Replacing the last lines of a group of text using AWK
I have this output from doing various commands
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
d41d8cd98f00b204e9800998ecf8427e 1317506438 /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
2430ffcf28e7ef6990e46ae081f1fb08 1317522636 /home/eva开发者_JS百科n/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
2430ffcf28e7ef6990e46ae081f1fb08 1317506569 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
I want to pipe it through awk to make it look like this
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
Any ideas?
Some clarifications:
The last file before the newline or EOF will awalys be the original file, everything before should be marked as a duplicate.
The first column is the md5sum of the file, second is the modification date. You will notice the last file in a group will always have the oldest time stamp, this is the criteria I am using to determine what file is "original", the oldest file.
Here are the commands im using the to get the list of all duplicates
find ${PWD} -type f -exec stat -c %Y {} \; -exec md5sum '{}' \; | sed -r 'N;s/([0-9]+)\n([^ ]+) /\2 \1/g' | sort -r | uniq -w 32 --all-repeated=separate
Sort the lines (using sort
), store the hash in a temporary variable and compare it with the current using an if
statement. Another if
statement should get rid of possible blank lines.
For example:
| sort | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } }'
Edit:
Since you provided those clarifications, you could do it this way:
| tac | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } else { print "" } }' | tac
tac
inverts the line order, achieving exactly what sort did in the first example. The second tac
restores the original order.
This sed oneliner might work:
sed '$G' source | # append a newline to source
sed -nr '$!{N;s/^([^ ]+ )[^ ]+ +(.*\n)\1/Duplicate: \2\1/;s/^[^ ]+ [^ ]+ +(.*\n)$/Original: \1/;P;D}'
By appending a newline to the source file the problem becomes two substitutions negating any EOF inelegance.
I guess a sed solution is acceptable as you used sed in the source file prep.
How do you know what's a duplicate and what's a copy? That would be my question.
It would be easy if the duplicates all had Copy
in the name, but your first example, one of the first duplicates is called New Text Document.txt
, and the original is in the .svn
directory which should never have been looked at.
It looks like you have the MD5 hash in the first column which means you could sort on that, and then use awk
to loop through your output and print a blank line whenever the hash changes. That would group your files together.
The original vs. copy is going to be much more difficult. You'll have to work out a good criteria for that. You might choose the earliest modification date (mdate
). You could sort on that too. When you break on the hash, you could simply assume the first file in the list (because it has the earliest date) to be the original.
Or, you could simply assume that the ones with the word Copy
embedded in the file name are the copies. And, then, it might not really matter all that much. Do you want the program to merely identify duplicates or delete them? If the program is merely identifying duplicates, there's no need to figure out which ones are the original and which ones are the duplicates. You can probably do that better with your eye than any algorithm.
By the way, what exactly are the three columns. I'm assuming the first is a has, and the last is the file name, but what is the middle one?
Maybe this will work, if blank lines appear after the last line of each group, including the very last group, and if the file names never contain blanks. It hinges on the presence of the blank lines.
awk 'NF == 3 { if (save != "") { printf("Duplicate: %s\n", save); } save = $3; }
NF == 0 { printf("Original: %s\n", save); save = ""; }'
If the last blank line is missing, the last line will not be printed.
This doesn't work because of the blanks in the file names (so most lines do not have just 3 fields). Awk is not really the most appropriate tool. I tend to use Perl when Awk is not suitable:
#!/usr/bin/env perl
use strict;
use warnings;
my $save = "";
while (<>)
{
chomp;
if ($_ =~ m/^ (?:[\da-fA-F]+) \s+ (?:\d+) \s+ (\S.*)/x)
{
print "Duplicate: $save\n" if $save ne "";
$save = $1;
}
else
{
print "Original: $save\n\n";
$save = "";
}
}
This produces:
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt
If you must use Awk, then you'll need to work on $0
when NF >= 3
, removing the hash and inode number (or whatever the second value on the data line is) to find the filename.
awk '{
for (i = 0; ++i < NF;)
print "Duplicate:", $i
print "Original:", $NF
}' FS='\n' RS= infile
精彩评论