awk how to remove duplicates in fields only if previous fields are the same

2023-02-06 23:30 问答作者：

I am trying to remove duplicates from fields (and replace them with blanks) only if the previous fields are the same. For example:

Sample input:

France  Paris      Museum of Fine Arts          blabala
France  Paris      Museum of Fine Arts          blaj开发者_开发问答lk
France  Paris      Yet another museum           lqmsjdf
France  Paris      Museum of National History            mlqskjf
France  Bordeaux   Museum of Fine Arts          qsfsqf
France  Bordeaux   City Hall                lmqjflqsk
France  Bordeaux   City Hall                    lqkjfqlskjflqskfj
Spain   Madrid     Museum of Fine Arts          lqksjfh
Spain   Madrid     Museum of Fine Arts          qlmfjlqsjf
Spain   Barcelona  City Hall                nvqjvvnqk
Spain   Barcelona  Museum of Fine Arts          lmkqjflqksfj

Desired output:

France    Paris        Museum of FineArts                    blabala
                                                             blajlk
                       Yet another museum                    lqmsjdf
                       Museum of National History            mlqskjf
          Bordeaux     Museum of Fine Arts                   qsfsqf
                       City Hall                             lmqjflqsk
                                                             lqkjfqlskjflqskfj
Spain     Madrid       Museum of Fine Arts                   lqksjfh
                                                             qlmfjlqsjf
          Barcelona   City Hall                              nvqjvvnqk
                      Museum of Fine Arts                    lmkqjflqksfj

Thank you much in advance for any kind of help.

Give this a try:

awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile

Here is a shorter version that works for any number of fields (the last field is always printed):

awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile

The output won't be aligned for on-screen use, but there will be the correct number of tabs.

The output will look like this:

field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.

If you need columns aligned, that is also possible.

Edit:

This version allows you to specify the fields to deduplicate:

#!/usr/bin/awk -f
BEGIN {
    FS="\t"; OFS=FS
    deduplist=ARGV[1]
    ARGV[1]=""
    split(deduplist,tmp," ")
    for (i in tmp) dedup[tmp[i]]=1
}
{
    for (i=1; i<=NF;i++)
        if (i in dedup) {
            if ($i == prev[i])
                $i = ""
            else
                prev[i] = $i
        }
    # prevent printing lines that are completely blank because 
    # it's an exact duplicate of the preceding line and all fields 
    # are being deduplicated
    if ($0 !~ /^[[:blank:]]*$/) 
        print
}

Run it like this: ./script.awk "2 3" inputfile to deduplicate fields 2 and three.

Try this Perl one-liner:

perl  -F"\t" -nae '@O=@F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",@F;@S=@O;'

See it

I've assumed the fields are tab separated.

继续阅读：duplicates

awk how to remove duplicates in fields only if previous fields are the same

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？