开发者

Checksum to detect Duplicated files and Re-named files

I have a clarification regarding Checksum of files.

In my test application, I got the same checksum value for a duplicated file of my original file. Also, when my original file was renamed, the checksum gen开发者_开发技巧erated was same.

So, can I use the checksum to discard the processing of a duplicated file or a re-named file.


Yes, but you should use such checksum algorithm that can be used to generate fingerprints for your files. All checksums are not suitable for this.


Well, in general yes. It depends what sort of Checksum you're using though.


You should use the checksum to decide that you might skip processing of a file. Use a file compare to actually decide.

A checksum on a new file will match your original file, if their contents are the same. It will also match for other files that are not identical, because there are more possible file content strings than there are checksum values, no matter what checksum scheme you use. (You can make this pretty low probability but you can't make the problem go away).

So what you should do if file X (to be processed), has checksum C, the same as file A (already processed), is to compare the content of X with the content of A. If they are identical, you can use the answer for A as the answer for X. If you checksum scheme is at all decent, if X and A are NOT identical, you will find out after comparing just a few bytes. (You can even check the file sizes first but I doubt if this statistically saves you any time).

Of course, there's the problem of computing the checksum on X: that requires to read all the content of X. To compute it, you must read all of X. If generating the answer is cheap compared to doing disk reads, there isn't a lot of point to avoiding the work.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜