Finding duplicate files by content across multiple directories

2022-12-22 09:21 问答作者：

I have downloaded some files from the internet related to a particular topic. Now I wish to check if the files have any duplicates. The issue is that the na开发者_开发技巧mes of the files would be different, but the content may match.

Is there any way to implement some code, which will iterate through the multiple folders and inform which of the files are duplicates?

if you are working on linux/*nix systems, you can use sha tools like sha512sum, now that md5 can be broken.

find /path -type f -print0 | xargs -0 sha512sum | awk '($1 in seen){print "duplicate: "$2" and "seen[$1] }(!($1 in  seen)){seen[$1]=$2}'

if you want to work with Python, a simple implementation

import hashlib,os
def sha(filename):    
    ''' function to get sha of file '''
    d = hashlib.sha512()
    try:
        d.update(open(filename).read())
    except Exception,e:
        print e
    else:
        return d.hexdigest()
s={}
path=os.path.join("/home","path1")
for r,d,f in os.walk(path):
    for files in f:
        filename=os.path.join(r,files)
        digest=sha(filename)
        if not s.has_key(digest):
            s[digest]=filename
        else:
            print "Duplicates: %s <==> %s " %( filename, s[digest])

if you think that sha512sum is not enough, you can use unix tools like diff, or filecmp (Python)

You can traverse the folders recursively and find the MD5 of each file and then look for duplicate MD5 values, this will give duplicate files content wise. Which language do you want to implement this in?

The following is the Perl program to do the above thing:

use strict;
use File::Find;
use Digest::MD5 qw(md5);    

my @directories_to_search = ('a','e');
my %hash;

find(\&wanted, @directories_to_search);

sub wanted  {

        chdir $File::Find::dir;
        if( -f $_) {
                my $con = '';
                open F,"<",$_ or die;
                while(my $line = <F>) {
                        $con .= $line;
                }
                close F;
                if($hash{md5($con)}) {
                        print "Dup found: $File::Find::name and $hash{md5($con)}\n";
                } else {
                        $hash{md5($con)} = $File::Find::name;
                }
        }
}

MD5 is a good way to find two identical file but it is not sufficient to assume that two file are identical! (in practice the risk is small but it exist) so you also need to compare the content

PS: Also if you just want to check the text content, if the return character '\n' is different in windows and linux

EDIT:

Reference: two different file can have the same md5 checksum: (MD5 collision vulnerability (wikipedia))

However, now that it is easy to generate MD5 collisions, it is possible for the person who created the file to create a second file with the same checksum, so this technique cannot protect against some forms of malicious tampering. Also, in some cases the checksum cannot be trusted (for example, if it was obtained over the same channel as the downloaded file), in which case MD5 can only provide error-checking functionality: it will recognize a corrupt or incomplete download, which becomes more likely when downloading larger files.

Do a recursive search through all the files, sorting them by size, any byte sizes with two or more files, do an MD5 hash or a SHA1 hash computation to see if they are in fact identical.

Regex will not help with this problem.

There are plenty of code examples on the net, I don't have time to knock out this code now. (This will probably elicit some downvotes - shrug!)

继续阅读：duplicate-data regex

Finding duplicate files by content across multiple directories

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？