Comparing large text files - Is comparing hashes faster than using subsets of the file?

2023-04-11 09:10 问答作者：

Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...

If the full text's string representation is too large to be held in memory at once, is it going to be faster to:

开发者_运维百科

a) Hash both files in their entirety and then test to see if the hashes are identical

b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch

In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?

I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:

Language is C# in .NET
Text files are 3GB each
Hash function is MD5
Maximum 'spare' RAM is 1GB

The MD5 Checksum will be slower since you need to process the two files to get the outcome. You say you have 3GB files and only 1GB of memory spare you do the math.
Checking them in byte chunks will actually determine any difference earlier, also by checking the file size, file length etc...

I would go with option 2.

Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...

Otherwise Option B is what i would go for...

To get the maximum speed I would use MemoryMappedFile instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor) which would allow for literally processing files of TB-size...

It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...

Assuming you have no future use for the hash information (to compare against other texts, or to check after potential future changes), then there's two cases: A) documents are same B) documents are different

If A, then there's almost no difference between the two scenarios. Both involve reading the entire files one chunk at a time and doing a calculation/compare on every byte. The computational overhead of the hash is minimal compared to the work of reading the files.

If B, then it's possible you'd find a difference in the first page of the files, at which point you'd be able to quit the process.

So depending on the relative probability of A v B, it seems comparing would be faster on average. Note also that you could then report where the change occurs, which you could not in the has scenario.

继续阅读：.net hash string-comparison

Comparing large text files - Is comparing hashes faster than using subsets of the file?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？