Script to find all similarly named files (differing only by case?)

2023-03-14 00:35 问答作者：

I having been working on an SVN repo using command line only. I now have to bring in users that require a GUI to interface with the repo, however this is presenting a number of problems with similarly named files.

As it so happens a large number of images have been duplicated for reasons due to lack of communication or laziness.

I would lik开发者_如何学JAVAe to be able to search for all files recursively from a given folder, and identify all files that differ only by case/capitalization, and must have the same file size, as it is certainly possible conflicts exist between different files, although I've not encountered any yet.

I don't mind to hammer out a Perl script to handle this myself, however I'm wonder if such a thing already exists or if anybody has any tips before I roll my sleeves up?

Thanks :D

I lean on md5sum for this type of problem:

find * -type f | xargs md5sum | sort | uniq -Dw32

If you are using svn, you'll want to exclude your .svn directories. This will print out all files with their paths that have identical content.

If you really want to only match files that differ by case, you can add a few more things to the above pipeline:

find * -type f  | xargs md5sum | sort | uniq -Dw32 | awk -F'[ /]' '{ print $NF }' | sort -f | uniq -Di
myimage_23.png
MyImage_23.png

A shell script to list all filenames in a Subversion working directory that differ only in case from another filename in the same directory, and therefore will cause problems for Subversion clients on case-insensitive file systems, which cannot distinguish between such filenames:

find . -name .svn -type d -prune -false -o -print | \
perl -ne 'push @{$f{lc($_)}}, $_; END{map{print @{$f{$_}}} grep {@{$f{$_}}>1} sort keys %f}'

I have not used it personally but the Duplicate Files Finder looks like it would be suitable.

However, it will identify any duplicate files, regardless of file name, so you might have to filter the results if you only want duplicates with case-insensitive-matching file names.

It is open source, available on Windows and Linux, has both command line and GUI interfaces, and from the description the algorithm sounds very fast (only compares files with the same size rather than producing a checksum for every file).

I guess it would be something like:

#!perl
use File::Spec;
sub check_dir {
    my ($dir, $out) = @_;
    $out ||= [];
    opendir DIR, $dir or die "Impossible to read dir: $!";
    my @files = sort grep { /^[^\.]/ } readdir(DIR); # Ignore files starting with dot
    closedir DIR;
    my @nd = map { ! -d $_ ? File::Spec->catfile($dir, $_) : () } @files;
    for my $i (0 .. $#nd-1){
        push @$out, $nd[$i]
            if lc $nd[$i] eq lc $nd[$i+1]
            and -s $nd[$i] == -s $nd[$i+1];
    }
    map { -d $_ ? &check_dir($_, $out) : () } @files;
    return $out;
}
print join "\n", @{&check_dir(shift @ARGV)}, "";

Please check it before using it, I have no access to windows machines (this does not happen in Un*x). Also, note that in the case of two files with the same name (except for the case) and the same size, only the first will be printed. In the case of three, only the first two, and so on (of course, you will need to keep one!).

As far as I know what you want doesn't exist as such. However, here's an implementation in bash:

#!/bin/bash

dir=("$@")
matched=()
files=()

lc(){ tr '[:upper:]' '[:lower:]' <<< ${*} ; }

in_list() {
    local search="$1"
    shift
    local list=("$@")
    for file in "${list[@]}" ; do
        [[ $file == $search ]] && return 0
    done
    return 1
}


while read -r file ; do
    files=("${files[@]}" "$file")
done < <(find "${dir[@]}" -type f | sort)


for file1 in "${files[@]}" ; do
    for file2 in "${files[@]}" ; do
            if
                    # check that the file did not match already
                    ! in_list "$file1" "${matched[@]}" &&

                    # check that the files are not the same file
                    ! [ $(stat -f %i "${file1}") -eq $(stat -f %i "${file2}") ] &&

                    # check that the size of the files are the same
                    [ $(stat -f %z "${file1}") = $(stat -f %z "${file2}") ] &&

                    # check that the non-directory part (aka file name) of the two
                    # files match case insensitively
                    grep -q $(lc "${file1##*/}") <<<$(lc "${file2##*/}")
            then
                    matched=("${matched[@]}" "$file1")
                    echo "$file1"
                    break
            fi
    done
done

EDIT: Added comments and, inspired by TLP's comment, made only the file part of the path matter for equality comparisons. This has now been tested to a reasonable minimum degree and I expect that it won't explode in your face.

Here's a Ruby script to recursively search for files that differ only in case.

#!/usr/bin/ruby
# encoding: utf-8

def search( directory )

    set = {}
    Dir.entries( directory ).each do |entry|
        next if entry == '.' || entry == '..'
        path = File.join( directory, entry )

        key = path.upcase
        set[ key ] = [] unless set.has_key?( key )
        set[ key ] << entry

        search( path ) if File.directory?( path )
    end

    set.delete_if { |key, entries| entries.size == 1 }
    set.each do |key, entries|
        entries.each do |entry|
            puts File.join( directory, entry )
        end
    end

end

search( File.expand_path( ARGV[ 0 ] ) )

继续阅读：bash case-sensitive filesize perl

Script to find all similarly named files (differing only by case?)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？