Comparing large text files - Is comparing hashes faster than using subsets of the file?
Say I have two large (text) files which are allegedly identical, but I want to make sure. The entire Harry Potter series of 'adult' and 'child' editions perhaps...
If the full text's string representation is too large to be held in memory at once, is it going to be faster to:
- 开发者_运维百科
- a) Hash both files in their entirety and then test to see if the hashes are identical
or
- b) Read in manageable chunks of each file and compare them until you either reach EOF or find a mismatch
In other words, would the convenience of comparing 2 small hashes be offset by the time it took to generate said hashes?
I'm expecting a couple of "it depends" answers, so if you want some assumtions to work with:
- Language is C# in .NET
- Text files are 3GB each
- Hash function is MD5
- Maximum 'spare' RAM is 1GB
The MD5 Checksum will be slower since you need to process the two files to get the outcome. You say you have 3GB files and only 1GB of memory spare you do the math.
Checking them in byte chunks will actually determine any difference earlier, also by checking the file size, file length etc...
I would go with option 2.
Option A is only useful if you reuse the hash (i.e. have other files to compare) so that the cost of calculating the hash isn't a factor...
Otherwise Option B is what i would go for...
To get the maximum speed I would use MemoryMappedFile
instances and XOR the content - the comparison can stop at the first encounter of a difference (i.e. the XOR operation returns something != 0). Regarding memory consumption you can use a "moving window" (i.e. via the call to CreateViewAccessor
) which would allow for literally processing files of TB-size...
It could even be worth to test performance of XOR against some LINQ based comparison methods... and always start by comparing the file sizes, this way you avoid doing unnecessary calculations...
Assuming you have no future use for the hash information (to compare against other texts, or to check after potential future changes), then there's two cases: A) documents are same B) documents are different
If A, then there's almost no difference between the two scenarios. Both involve reading the entire files one chunk at a time and doing a calculation/compare on every byte. The computational overhead of the hash is minimal compared to the work of reading the files.
If B, then it's possible you'd find a difference in the first page of the files, at which point you'd be able to quit the process.
So depending on the relative probability of A v B, it seems comparing would be faster on average. Note also that you could then report where the change occurs, which you could not in the has scenario.
精彩评论