Replacing the last lines of a group of text using AWK

2023-04-10 15:22 问答作者：

I have this output from doing various commands

d41d8cd98f00b204e9800998ecf8427e 1317522632  /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
d41d8cd98f00b204e9800998ecf8427e 1317522632 /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
d41d8cd98f00b204e9800998ecf8427e 1317506438 /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp

2430ffcf28e7ef6990e46ae081f1fb08 1317522636 /home/eva开发者_JS百科n/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
2430ffcf28e7ef6990e46ae081f1fb08 1317506569 /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt

I want to pipe it through awk to make it look like this

Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp

Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt

Any ideas?

Some clarifications:

The last file before the newline or EOF will awalys be the original file, everything before should be marked as a duplicate.

The first column is the md5sum of the file, second is the modification date. You will notice the last file in a group will always have the oldest time stamp, this is the criteria I am using to determine what file is "original", the oldest file.

Here are the commands im using the to get the list of all duplicates

find ${PWD} -type f -exec stat -c %Y {} \; -exec md5sum '{}' \; | sed -r 'N;s/([0-9]+)\n([^ ]+) /\2 \1/g' | sort -r | uniq -w 32 --all-repeated=separate

Sort the lines (using sort), store the hash in a temporary variable and compare it with the current using an if statement. Another if statement should get rid of possible blank lines.

For example:

| sort | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } }'

Edit:

Since you provided those clarifications, you could do it this way:

| tac | awk '{ if ($0) { if (TEMP != $1) { print "Original: " $0 } else { print "Duplicate:" $0 } TEMP = $1 } else { print "" } }' | tac

tac inverts the line order, achieving exactly what sort did in the first example. The second tac restores the original order.

This sed oneliner might work:

sed '$G' source | # append a newline to source
sed -nr '$!{N;s/^([^ ]+ )[^ ]+ +(.*\n)\1/Duplicate: \2\1/;s/^[^ ]+ [^ ]+ +(.*\n)$/Original: \1/;P;D}'

By appending a newline to the source file the problem becomes two substitutions negating any EOF inelegance.

I guess a sed solution is acceptable as you used sed in the source file prep.

How do you know what's a duplicate and what's a copy? That would be my question.

It would be easy if the duplicates all had Copy in the name, but your first example, one of the first duplicates is called New Text Document.txt, and the original is in the .svn directory which should never have been looked at.

It looks like you have the MD5 hash in the first column which means you could sort on that, and then use awk to loop through your output and print a blank line whenever the hash changes. That would group your files together.

The original vs. copy is going to be much more difficult. You'll have to work out a good criteria for that. You might choose the earliest modification date (mdate). You could sort on that too. When you break on the hash, you could simply assume the first file in the list (because it has the earliest date) to be the original.

Or, you could simply assume that the ones with the word Copy embedded in the file name are the copies. And, then, it might not really matter all that much. Do you want the program to merely identify duplicates or delete them? If the program is merely identifying duplicates, there's no need to figure out which ones are the original and which ones are the duplicates. You can probably do that better with your eye than any algorithm.

By the way, what exactly are the three columns. I'm assuming the first is a has, and the last is the file name, but what is the middle one?

~~Maybe this will work, if blank lines appear after the last line of each group, including the very last group, and if the file names never contain blanks. It hinges on the presence of the blank lines.~~

~~awk 'NF == 3 { if (save != "") { printf("Duplicate: %s\n", save); } save = $3; } NF == 0 { printf("Original: %s\n", save); save = ""; }'~~

~~If the last blank line is missing, the last line will not be printed.~~

This doesn't work because of the blanks in the file names (so most lines do not have just 3 fields). Awk is not really the most appropriate tool. I tend to use Perl when Awk is not suitable:

#!/usr/bin/env perl
use strict;
use warnings;
my $save = "";

while (<>)
{
    chomp;
    if ($_ =~ m/^ (?:[\da-fA-F]+) \s+ (?:\d+) \s+ (\S.*)/x)
    {
        print "Duplicate: $save\n" if $save ne "";
        $save = $1;
    }
    else
    {
        print "Original:  $save\n\n";
        $save = "";
    }
}

This produces:

Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy.txt
Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New Text Document - Copy (2).txt
Original:  /home/evan/school_work/unix/Projects/Project2/finddups/.svn/tmp/tempfile.tmp

Duplicate: /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2 - Copy.txt
Original:  /home/evan/school_work/unix/Projects/Project2/finddups/test/New folder/junk2.txt

If you must use Awk, then you'll need to work on $0 when NF >= 3, removing the hash and inode number (or whatever the second value on the data line is) to find the filename.

awk '{
  for (i = 0; ++i < NF;)
    print "Duplicate:", $i
  print "Original:", $NF    
  }' FS='\n' RS= infile

继续阅读：bash sed stdout

Replacing the last lines of a group of text using AWK

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？