sorting 50 000 000 numbers
Suppose, that we need 开发者_C百科to sort 50 000 000 numbers. suppose, that the numbers is stored in a file. What is the most efficient algorithm for solving this problem? Parallel algorithm for sorting...
How to do it? Maybe useful link )
I can't use standard algorithm
Therefore i ask you about methods and algorithms :)Ok.. I read about parallel mergesort... But it's not clear for me.
solution, the first version
code is located here
50 million is not particularly large. I would just read them into memory. Sort them and write them out. It should take just a few seconds. How fast do you need it be? How compilcated do you need it to be?
On my old labtop it took 28 seconds. If I had more processors, it might be a little faster but much of the time is spent reading and writing the file (15 seconds) which wouldn't be any faster.
One of the critical factors is the size of your cache. The comparison itself is very cheap provided the data is in cache. As the L3 cache is shared, one thread is all you need to make full use of it.
public static void main(String...args) throws IOException {
generateFile();
long start = System.currentTimeMillis();
int[] nums = readFile("numbers.bin");
Arrays.sort(nums);
writeFile("numbers2.bin", nums);
long time = System.currentTimeMillis() - start;
System.out.println("Took "+time+" secs to sort "+nums.length+" numbers.");
}
private static void generateFile() throws IOException {
Random rand = new Random();
int[] ints = new int[50*1000*1000];
for(int i= 0;i<ints.length;i++)
ints[i] = rand.nextInt();
writeFile("numbers.bin", ints);
}
private static int[] readFile(String filename) throws IOException {
DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(filename), 64*1024));
int len = dis.readInt();
int[] ints = new int[len];
for(int i=0;i<len;i++)
ints[i] = dis.readInt();
return ints;
}
private static void writeFile(String name, int[] numbers) throws IOException {
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(name), 64*1024));
dos.writeInt(numbers.length);
for (int number : numbers)
dos.writeInt(number);
dos.close();
}
From top of my head, merge sort seems to be the best option when it comes to parallelisation and distribution, as it uses divide-and-conquer approach. For more information, google for "parallel merge sort" and "distributed merge sort".
For single-machine, multiple cores example, see see Correctly multithreaded quicksort or mergesort algo in Java?. If you can use Java 7 fork/join then see: "Java 7: more concurrency" and "Parallelism with Fork/Join in Java 7".
For distributing it over many machines, see Hadoop, it has a distributed merge sort implementation: see MergeSort and MergeSorter. Also of interest: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds
For sorting than many elements, your best shot is Merge Sort. It's usually the algorithms used by databases. Even though is not as fast as Quick Sort, it uses intermediate storage so you don't need huge amounts of memory to perform the sort.
Also, as pointed by sje397 and Scott in the comments, Merge Sort is highly parallelizable.
It depends a lot on the problem domain. For instance, if all the numbers are positive ints, the best way may be to make an array of 0-MAX_INT and then just count how many times each number occurs as you read the file, and then print out each int with a non-zero count however many times it occurred. That's an O(n) "sort". There's an official name for that sort, but I forget what it is.
By the way, I got asked this question in a Google interview. From the problem constraints I came up with this solution, and it seemed to be the answer they were looking for. (I turned down the job because I didn't want to move.)
do not be afraid of the large number. in fact, 50 000 000 numbers is not that big. so if the numbers were integers then each number is 4bytes in size so the overall memory needed to be allocated for this array is 50 000 000*4 /1024/1024 = 190.7 Mega Bytes which is relatively small. Having the math done, you can proceed to do QuickSort which runs in O(nLogn). note that the builtin sort method in .net arrays uses QuickSort, im not sure if this is the case in java too.
sorting 250 000 000 integers on my machine took about 2 minutes so go for it :)
They are not so many. If they are 10 bytes long extended for example it would be an array of 500Mbytes, it almost can stay on my phone! ;) So I'd say go for Quicksort if it's only that.
50e6 numbers is very small nowadays, don't make things more complex than they need to be...
bash$ sort < file > sorted.file
精彩评论