How does comparing images through md5 work?
Does this method compare the pixel values of the images? I'm guessing it won't work because they are diff开发者_开发百科erent sizes from each other but what if they are identical, but in different formats? For example, I took a screenshot and saved as a .jpg
and another and saved as a .gif
.
An MD5 hash is of the actual binary data, so different formats will have completely different binary data.
so for MD5 hashes to match, they must be identical files. (There are exceptions in fringe cases.)
This is actually one way forensic law enforcement finds data it deems as contraband. (in reference to images)
It is an MD5 Checksum - the same thing you often see when downloading a file, if the MD5 of the downloaded file matches the MD5 given by the provider, then the file transfer was successful. http://en.wikipedia.org/wiki/Checksum If there is even 1 bit of difference between the 2 files then the resulting hash will be completely different.
Due to the difference in encoding between a JPG and GIF, the 2 will not have the same MD5 hash.
md5
is a hash algorithm, so it does not compare images but it compares data. The data you put in can be nearly anything, like the contents of a file. It then outputs a hashstring based on the contents, which is the raw data of the file.
So you basically do not compare images when feeding the image into md5
but the raw data of the image. The hash algorithm does not know anything about it but the raw data, so a jpg and an gif (or any other image format) of the same screenshot will never be the same.
Even if you compare the decoded image it will not put out the same hash but will have small differences the human eye cannot see (depending on the amount of compression used). This might be different when comparing the decoded data of lossless encoded images, but I don't know here.
Take a look at the wikipedia article for a more detailed explanation and technical background about hash functions.
A .jpg file starts with 'JFIF', a .gif starts with 'GIF' when you look at the raw bytes. In otherwords, comparing the on-disk bytes of the "same image" in two different format is pretty much guaranteed to produce two different MD5 hashes, since the file's contents differ - even if the actual image is the "same picture".
To do a hash-based image comparison, you have to compare two images using the same format. It would be very very difficult to produce a .jpg and a .gif of the same image that would compare equal if you converted them to (say) a .bmp. It'd be the same fileformat, but the internal requirements of .gif (8bit, RLE/LZW lossless compression) v.s. the internal requirements of .jpg (24bit, lossy discrete cosine transform compression) mean it's nigh-on impossible to get the same .bmp from both source images.
If you're comparing hashes then every single byte of the two images will have to match - they can't use different compression formats, or "look the same". They have to be identical.
md5 is a hash. It is a code that is calculated from a bunch of data - any data really.
md5 is certainly not unique, but the chance that two different images have the exact same code is quite small. Therefor you could compare images by calculating an md5 code from each of them and compare the codes.
You cannot compare using the MD5 sum, as all the other posters have noted. However, you can compare the images in a different way, and it will tell you their similarity regardless of image type, or even size. You can use libPuzzle
http://libpuzzle.pureftpd.org/project/libpuzzle
This is a great library for image comparison and works very well.
It will still not work. Any image contains the header portion and the binary image buffer. In the said scenario 1. The the headers will be different between .jpg & .gif resulting in a different md5 sum 2. The image buffer itself may be different due to image compression as used by say the .jpg format.
md5sum
is a tool used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change.
Most commonly, md5sum
is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum
program is included in most Unix-like
operating systems or compatibility layers such as Cygwin
.
Hence it cannot be used to compare images.
Running md5sum
on images will generate md5 hash
based on images raw data. The output of hash strings for these images will not be the same since image format are not the same i.e. GIF and JPEG.
In addition, if you compare the sizes of these images will not be the same either. Usually GIF
images can be bigger than JPEG
files, which means MD5 hash
strings will not tally at all.
精彩评论