awk how to remove duplicates in fields only if previous fields are the same
I am trying to remove duplicates from fields (and replace them with blanks) only if the previous fields are the same. For example:
Sample input:
France Paris Museum of Fine Arts blabala
France Paris Museum of Fine Arts blaj开发者_开发问答lk
France Paris Yet another museum lqmsjdf
France Paris Museum of National History mlqskjf
France Bordeaux Museum of Fine Arts qsfsqf
France Bordeaux City Hall lmqjflqsk
France Bordeaux City Hall lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
Spain Madrid Museum of Fine Arts qlmfjlqsjf
Spain Barcelona City Hall nvqjvvnqk
Spain Barcelona Museum of Fine Arts lmkqjflqksfj
Desired output:
France Paris Museum of FineArts blabala
blajlk
Yet another museum lqmsjdf
Museum of National History mlqskjf
Bordeaux Museum of Fine Arts qsfsqf
City Hall lmqjflqsk
lqkjfqlskjflqskfj
Spain Madrid Museum of Fine Arts lqksjfh
qlmfjlqsjf
Barcelona City Hall nvqjvvnqk
Museum of Fine Arts lmkqjflqksfj
Thank you much in advance for any kind of help.
Give this a try:
awk -F '\t' 'BEGIN {OFS=FS} {if ($1 == prev1) $1 = ""; else prev1 = $1; if ($2 == prev2) $2 = ""; else prev2 = $2; if ($3 == prev3) $3 = ""; else prev3 = $3; print}' inputfile
Here is a shorter version that works for any number of fields (the last field is always printed):
awk -F '\t' 'BEGIN {OFS=FS} {for (i=1; i<=NF-1;i++) if ($i == prev[i]) $i = ""; else prev[i] = $i; print}' inputfile
The output won't be aligned for on-screen use, but there will be the correct number of tabs.
The output will look like this:
field1 TAB field2 TAB field3 TAB field4
TAB TAB TAB field4
TAB TAB field3 TAB field4
TAB field2 TAB field3 TAB field4
etc.
If you need columns aligned, that is also possible.
Edit:
This version allows you to specify the fields to deduplicate:
#!/usr/bin/awk -f
BEGIN {
FS="\t"; OFS=FS
deduplist=ARGV[1]
ARGV[1]=""
split(deduplist,tmp," ")
for (i in tmp) dedup[tmp[i]]=1
}
{
for (i=1; i<=NF;i++)
if (i in dedup) {
if ($i == prev[i])
$i = ""
else
prev[i] = $i
}
# prevent printing lines that are completely blank because
# it's an exact duplicate of the preceding line and all fields
# are being deduplicated
if ($0 !~ /^[[:blank:]]*$/)
print
}
Run it like this: ./script.awk "2 3" inputfile
to deduplicate fields 2 and three.
Try this Perl one-liner:
perl -F"\t" -nae '@O=@F;if(!$x){$x=1}else{for($i=0;$i<=$#S;$i++){$F[$i]=""if($S[$i] eq "" || $S[$i] eq $F[$i])}};print join "\t",@F;@S=@O;'
See it
I've assumed the fields are tab separated.
精彩评论