Sort with the limited memory
Suppose I have X GB o开发者_运维知识库f RAM space available and I need to sort a huge array of data (much greater the all the available memory. It's stored on the hard drive). Could you give a hint, how that could be accomplished?
You are looking for external sorting. The largest cost in these scenarios is often disk IO. So the trick is to use an algorithm that minimises disk IO. The usual approach is to read suitably large chunks in to memory, sort those chunks, saving them back to disk, then merge the sorted chunks.
A search for "external sorting" or "sort merge" and your technologies of choice should give some good results.
Let us assume you have a huge file H and limit memory M.
I have two solutions.
Solution 1:
step 1: Read M from the file and write it into a temp buffer.
step 2: Sort (you should use in-place sorting algorithm, like QuickSort,HeapSort) the values.
step 3: Create a temp file and write the buffer into the temp file. Save the name of the temp file.
step 4: Repeat step 1 to 3 until read file H out, and sort M out and save all temp files.
step 5: Merge all temp files to a new file. Create a new file,and open all the temp files,put the file handles into a array.
step 6: Find the minimum number from the set of numbers currently pointed to by the file read pointer. You can use normal way to do that, compare every number, and use a temp variable to remember it (time complexity is O(n). Or you can create a priorityQueue,and a map, the key of the map is the number, and the value of the map is the sequence of the file handles.(time complexity is O(lgn),the second way wastes more memory, but performance is better,if you want a better way, you can use a int to replace list using bitmap, becase the temp file names are sequeced.
step 7: Save the number to the new file.
step 8: Read another number from the file that contained the minimum number at step 6.
step 9: Repeat step 7 to 8 until all numbers from all the temp files have been processed.
Solution 2:
step 1 : Create N temp files, the range of the numbers of every temp files are different.(For example,the range of temp_file_1 is from 0 to 1000000, and the next temp file is from 1000001 to 2000000...)
step 2: Read m from the H file and write the number into the different temp files until nothing can read from file H.
step 3: Sort every temp files.
step 4: Create a new file, merge all temp files into this new file one by one.
What's the difference between the solutions. According to solution 1, every number is sorted once and is compared many times(step 5), but read and write only twice. about solution 2, no merge processing, but every single number is read and wrote three times.
What you are looking for is an external sort.
http://en.wikipedia.org/wiki/External_sorting
Practically, if you don't want to write too much code, and disk usage is not an issue, put you data into a database with a proper index, and then just select * order by
from there.
I would image you might construct some kind of index system that you can separate your sorted data into.
Imagine shelfs in a library. Go through 1/x of the data, sorting all elements onto the shelfs, and cache each shelf to an individual file on the disk. Then sort the next 1/x of data, write it back to the disk, ect. Once you have all your data in shelves go through and sort each shelf individually, then finally, merge them into one nice sorted file.
You can sort the many huge files (the sorted result can be terabytes and bigger) with ZZZServer. It is free for non-commercial use. I am affiliated with the product:
ZZZServer -sortinit -sort file1.txt
ZZZServer -sort file2.txt
ZZZServer -sort file3.txt
...
ZZZServer -sortsave sorted.txt
After sorting the result is saved in
sorted.txt
Your input files must be encoded in the UTF-8 or ASCII format!
The ZZZServer using about 1MB RAM on sorting big files!
精彩评论