Is it possible to detect duplicate image files?

2023-01-09 14:24 问答作者：

I have over 10K files for products, the problem is is that many of the images are duplicates.

If there is no image, there is a standard image that says 'no image'.

How can I detect if the image is this standard 'no image' image file开发者_开发百科?

Update The image is a different name, but it is exactly the same image otherwise.

People are saying Hash, so would I do this?

im = cStringIO.StringIO(file.read())
img = im.open(im)
md5.md5(img)

As a sidenote, for images, I find raster data hashes to be far more effective than file hashes.

ImageMagick provides reliable way to compute such hashes, and there are different bindings for python available. It helps to detect same images with different lossless compressions and different metadata.

Usage example:

>>> import PythonMagick
>>> img = PythonMagick.Image("image.png")
>>> img.signature()
'e11cfe58244d7cf98a79bfdc012857a9391249dca3aedfc0fde4528eed7f7ba7'

I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:

images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']

Then, for each key (image size) where there's more than 1 element in the dictionary, I'd read some fixed amount of the file and do a hash. Something like:

possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
    hashes = defaultdict(list)
    for fname in images[size]:
        m = md5.new()
        hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
    for k in hashes:
       if len(hashes[k]) <= 1: continue
       for fname in hashes[k][1:]:
           os.remove(fname)

This is all off the top of my head, haven't tested the code, but you get the idea.

Assuming you are talking about same images in terms of same image data.

Compute the hash of the "no image" image and compare it to the hashes of the other images. If the hashes are the same, it is the same file.

I had trouble installing PythonMagick on Fedora but Wand (another ImageMagick binding) worked.

from wand.image import Image

img = Image(filename="image.jpg")
print(img.signature)

Just be sure to install everything first:

yum install python3-wand ImageMagick

If you're looking for exact duplicates of a particular image: load this image into memory, then loop over your image collection; skip any file that doesn't have the same size; compare the contents of the files that have the same size, stopping at the first difference.

Computing a hash in this situation is actually counter-productive because you'd have to read each file completely into memory (instead of being able to stop at the first difference) and perform a CPU-intensive task on it.

If there are several sets of duplicates, on the other hand, computing a hash of each file is better.

If you're also looking for visual near-duplicates, findimagedupes can help you.

Hash them. Collisions are duplicates (at least, it's a mathematical impossibility that they aren't the same file).

继续阅读：python

Is it possible to detect duplicate image files?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？