Script to find all similarly named files (differing only by case?)
I having been working on an SVN repo using command line only. I now have to bring in users that require a GUI to interface with the repo, however this is presenting a number of problems with similarly named files.
As it so happens a large number of images have been duplicated for reasons due to lack of communication or laziness.
I would lik开发者_如何学JAVAe to be able to search for all files recursively from a given folder, and identify all files that differ only by case/capitalization, and must have the same file size, as it is certainly possible conflicts exist between different files, although I've not encountered any yet.
I don't mind to hammer out a Perl script to handle this myself, however I'm wonder if such a thing already exists or if anybody has any tips before I roll my sleeves up?
Thanks :D
I lean on md5sum
for this type of problem:
find * -type f | xargs md5sum | sort | uniq -Dw32
If you are using svn, you'll want to exclude your .svn directories. This will print out all files with their paths that have identical content.
If you really want to only match files that differ by case, you can add a few more things to the above pipeline:
find * -type f | xargs md5sum | sort | uniq -Dw32 | awk -F'[ /]' '{ print $NF }' | sort -f | uniq -Di
myimage_23.png
MyImage_23.png
A shell script to list all filenames in a Subversion working directory that differ only in case from another filename in the same directory, and therefore will cause problems for Subversion clients on case-insensitive file systems, which cannot distinguish between such filenames:
find . -name .svn -type d -prune -false -o -print | \ perl -ne 'push @{$f{lc($_)}}, $_; END{map{print @{$f{$_}}} grep {@{$f{$_}}>1} sort keys %f}'
I have not used it personally but the Duplicate Files Finder looks like it would be suitable.
However, it will identify any duplicate files, regardless of file name, so you might have to filter the results if you only want duplicates with case-insensitive-matching file names.
It is open source, available on Windows and Linux, has both command line and GUI interfaces, and from the description the algorithm sounds very fast (only compares files with the same size rather than producing a checksum for every file).
I guess it would be something like:
#!perl
use File::Spec;
sub check_dir {
my ($dir, $out) = @_;
$out ||= [];
opendir DIR, $dir or die "Impossible to read dir: $!";
my @files = sort grep { /^[^\.]/ } readdir(DIR); # Ignore files starting with dot
closedir DIR;
my @nd = map { ! -d $_ ? File::Spec->catfile($dir, $_) : () } @files;
for my $i (0 .. $#nd-1){
push @$out, $nd[$i]
if lc $nd[$i] eq lc $nd[$i+1]
and -s $nd[$i] == -s $nd[$i+1];
}
map { -d $_ ? &check_dir($_, $out) : () } @files;
return $out;
}
print join "\n", @{&check_dir(shift @ARGV)}, "";
Please check it before using it, I have no access to windows machines (this does not happen in Un*x). Also, note that in the case of two files with the same name (except for the case) and the same size, only the first will be printed. In the case of three, only the first two, and so on (of course, you will need to keep one!).
As far as I know what you want doesn't exist as such. However, here's an implementation in bash:
#!/bin/bash
dir=("$@")
matched=()
files=()
lc(){ tr '[:upper:]' '[:lower:]' <<< ${*} ; }
in_list() {
local search="$1"
shift
local list=("$@")
for file in "${list[@]}" ; do
[[ $file == $search ]] && return 0
done
return 1
}
while read -r file ; do
files=("${files[@]}" "$file")
done < <(find "${dir[@]}" -type f | sort)
for file1 in "${files[@]}" ; do
for file2 in "${files[@]}" ; do
if
# check that the file did not match already
! in_list "$file1" "${matched[@]}" &&
# check that the files are not the same file
! [ $(stat -f %i "${file1}") -eq $(stat -f %i "${file2}") ] &&
# check that the size of the files are the same
[ $(stat -f %z "${file1}") = $(stat -f %z "${file2}") ] &&
# check that the non-directory part (aka file name) of the two
# files match case insensitively
grep -q $(lc "${file1##*/}") <<<$(lc "${file2##*/}")
then
matched=("${matched[@]}" "$file1")
echo "$file1"
break
fi
done
done
EDIT: Added comments and, inspired by TLP's comment, made only the file part of the path matter for equality comparisons. This has now been tested to a reasonable minimum degree and I expect that it won't explode in your face.
Here's a Ruby script to recursively search for files that differ only in case.
#!/usr/bin/ruby
# encoding: utf-8
def search( directory )
set = {}
Dir.entries( directory ).each do |entry|
next if entry == '.' || entry == '..'
path = File.join( directory, entry )
key = path.upcase
set[ key ] = [] unless set.has_key?( key )
set[ key ] << entry
search( path ) if File.directory?( path )
end
set.delete_if { |key, entries| entries.size == 1 }
set.each do |key, entries|
entries.each do |entry|
puts File.join( directory, entry )
end
end
end
search( File.expand_path( ARGV[ 0 ] ) )
精彩评论