file based merge sort on large datasets in Java

2023-03-12 08:16 问答作者：

given large datasets that don't fit in memory, is there any library or api to perform sort in Java? the implementation would possibly be simila开发者_如何学编程r to linux utility sort.

Java provides a general-purpose sorting routine which can be used as part of the larger solution to your problem. A common approach to sort data that's too large to all fit in memory is this:

1) Read as much data as will fit into main memory, let's say it's 1 Gb

2) Quicksort that 1 Gb (here's where you'd use Java's built-in sort from the Collections framework)

3) Write that sorted 1 Gb to disk as "chunk-1"

4) Repeat steps 1-3 until you've gone through all the data, saving each data chunk in a separate file. So if your original data was 9 Gb, you will now have 9 sorted chunks of data labeled "chunk-1" thru "chunk-9"

5) You now just need a final merge sort to merge the 9 sorted chunks into a single fully sorted data set. The merge sort will work very efficiently against these pre-sorted chunks. It will essentially open 9 file readers (one for each chunk), plus one file writer (for output). It then compares the first data element in each read file and selects the smallest value, which is written to the output file. The reader from which that selected value came advances to its next data element, and the 9-way comparison process to find the smallest value is repeated, again writing the answer to the output file. This process repeats until all data has been read from all the chunk files.

6) Once step 5 has finished reading all the data you are done -- your output file now contains a fully sorted data set

With this approach you could easily write a generic "megasort" utility of your own that takes a filename and maxMemory parameter and efficiently sorts the file by using temp files. I'd bet you could find at least a few implementations out there for this, but if not you can just roll your own as described above.

The most common way to handle large datasets is in memory (you can buy a server with 1 TB these days) or in a database.

If you are not going to use a database (or buy more memory) you can write it yourself fair easily.

There are libraries which may help which perform Map-Reduce functions but they may add more complexity than they save.

继续阅读：large-data sorting

file based merge sort on large datasets in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？