Sorting gigantic binary files with C#

2023-04-09 18:17 问答作者：

I have a large file of roughly 400 GB of size. Generated daily by an external closed system. It is a binary file with the following format:

byte[8]byte[4]byte[n]

Where n is equal to the int32 value of byte[4].

This file has no delimiters and to read the whole file you would just repeat until EOF. With each "item" represented as byte[8]byte[4]byte[n].

The file looks like

byte[8]byte[4]byte[n]byte[8]byte[4]byte[n]...EOF

byte[8] is a 64-bit number representing a period of time represented by .NET Ticks. I need to sort this file but can't seem to figure out the quickest way to do so.

Presently, I load the Ticks into a struct and the byte[n] start a开发者_JS百科nd end positions and read to the end of the file. After this, I sort the List in memory by the Ticks property and then open a BinaryReader and seek to each position in Ticks order, read the byte[n] value, and write to an external file.

At the end of the process I end up with a sorted binary file, but it takes FOREVER. I am using C# .NET and a pretty beefy server, but disk IO seems to be an issue.

Server Specs:

2x 2.6 GHz Intel Xeon (Hex-Core with HT) (24-threads)
32GB RAM
500GB RAID 1+0
2TB RAID 5

I've looked all over the internet and can only find examples where a huge file is 1GB (makes me chuckle).

Does anyone have any advice?

At great way to speed up this kind of file access is to memory-map the entire file into address space and let the OS take care of reading whatever bits from the file it needs to. So do the same thing as you're doing right now, except read from memory instead of using a BinaryReader/seek/read.

You've got lots of main memory, so this should provide pretty good performance (as long as you're using a 64-bit OS).

Use merge sort. It's online and parallelizes well.

http://en.wikipedia.org/wiki/Merge_sort

If you can learn Erlang or Go, they could be very powerful and scale extremely well, as you have 24 threads. Utilize Async I/O. Merge Sort. And since you have 32GB of Ram, try to load as much as you can into RAM and sort it there then write back to disk.

I would do this in several passes. On the first pass, I would create a list of ticks, then distribute them evenly into many (hundreds?) buckets. If you know ahead of time that the ticks are evenly distributed, you can skip this initial pass. On a second pass, I would split the records into these few hundred separate files of about same size (these much smaller files represent groups of ticks in the order that you want). Then I would sort each file separately in memory. Then concatenate the files.

It is somewhat similar to the hashsort (I think).

继续阅读：binary binaryfiles large-data

Sorting gigantic binary files with C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？