开发者

awk how to remove duplicates in fields only if previous fields are the same

I am trying to remove duplicates from fields (and replace them with blanks) only if the previous fields are the same. For example:

Sample input:

France  Paris      Museum of Fine Arts          blabala
France  Paris      Museum of Fine Arts          blaj开发者_开发问答lk
France  Paris      Yet another museum           lqmsjdf
France  Paris      Museum of National History            mlqskjf
France  Bordeaux   Museum of Fine Arts          qsfsqf
France  Bordeaux   City Hall                lmqjflqsk
France  Bordeaux   City Hall                    lqkjfqlskjflqskfj
Spain   Madrid     Museum of Fine Arts          lqksjfh
Spain   Madrid     Museum of Fine Arts          qlmfjlqsjf
Spain   Barcelona  City Hall                nvqjvvnqk
Spain   Barcelona  Museum of Fine Arts          lmkqjflqksfj

Desired output:

France    Paris        Museum of FineArts                    blabala
                                                             blajlk
                       Yet another museum                    lqmsjdf
                       Museum of National History            mlqskjf
          Bordeaux     Museum of Fine Arts                   qsfsqf
                       City Hall                             lmqjflqsk
                                                             lqkjfqlskjflqskfj
Spain     Madrid       Museum of Fine Arts                   lqksjfh
                                                             qlmfjlqsjf
          Barcelona   City Hall                              nvqjvvnqk
                      Museum of Fine Arts                    lmkqjflqksfj

Thank you much in advance for any kind of help.


Give this a try:

awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile

Here is a shorter version that works for any number of fields (the last field is always printed):

awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile

The output won't be aligned for on-screen use, but there will be the correct number of tabs.

The output will look like this:

field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.

If you need columns aligned, that is also possible.

Edit:

This version allows you to specify the fields to deduplicate:

#!/usr/bin/awk -f
BEGIN {
    FS="\t"; OFS=FS
    deduplist=ARGV[1]
    ARGV[1]=""
    split(deduplist,tmp," ")
    for (i in tmp) dedup[tmp[i]]=1
}
{
    for (i=1; i<=NF;i++)
        if (i in dedup) {
            if ($i == prev[i])
                $i = ""
            else
                prev[i] = $i
        }
    # prevent printing lines that are completely blank because 
    # it's an exact duplicate of the preceding line and all fields 
    # are being deduplicated
    if ($0 !~ /^[[:blank:]]*$/) 
        print
}

Run it like this: ./script.awk "2 3" inputfile to deduplicate fields 2 and three.


Try this Perl one-liner:

perl  -F"\t" -nae '@O=@F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",@F;@S=@O;'

See it

I've assumed the fields are tab separated.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜