Is it possible to detect duplicate image files?
I have over 10K files for products, the problem is is that many of the images are duplicates.
If there is no image, there is a standard image that says 'no image'.
How can I detect if the image is this standard 'no image' image file开发者_开发百科?
Update The image is a different name, but it is exactly the same image otherwise.
People are saying Hash, so would I do this?
im = cStringIO.StringIO(file.read())
img = im.open(im)
md5.md5(img)
As a sidenote, for images, I find raster data hashes to be far more effective than file hashes.
ImageMagick provides reliable way to compute such hashes, and there are different bindings for python available. It helps to detect same images with different lossless compressions and different metadata.
Usage example:
>>> import PythonMagick
>>> img = PythonMagick.Image("image.png")
>>> img.signature()
'e11cfe58244d7cf98a79bfdc012857a9391249dca3aedfc0fde4528eed7f7ba7'
I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:
images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']
Then, for each key (image size) where there's more than 1 element in the dictionary, I'd read some fixed amount of the file and do a hash. Something like:
possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
hashes = defaultdict(list)
for fname in images[size]:
m = md5.new()
hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
for k in hashes:
if len(hashes[k]) <= 1: continue
for fname in hashes[k][1:]:
os.remove(fname)
This is all off the top of my head, haven't tested the code, but you get the idea.
Assuming you are talking about same images in terms of same image data.
Compute the hash of the "no image" image and compare it to the hashes of the other images. If the hashes are the same, it is the same file.
I had trouble installing PythonMagick on Fedora but Wand (another ImageMagick binding) worked.
from wand.image import Image
img = Image(filename="image.jpg")
print(img.signature)
Just be sure to install everything first:
yum install python3-wand ImageMagick
If you're looking for exact duplicates of a particular image: load this image into memory, then loop over your image collection; skip any file that doesn't have the same size; compare the contents of the files that have the same size, stopping at the first difference.
Computing a hash in this situation is actually counter-productive because you'd have to read each file completely into memory (instead of being able to stop at the first difference) and perform a CPU-intensive task on it.
If there are several sets of duplicates, on the other hand, computing a hash of each file is better.
If you're also looking for visual near-duplicates, findimagedupes can help you.
Hash them. Collisions are duplicates (at least, it's a mathematical impossibility that they aren't the same file).
精彩评论