开发者

Shell script-duplicate records

I am facing problem in removing the duplicate entries.(I am not good in Shell!).here is the situation- Application creates a flat text file. Each line is one record and each field is seperated by delimiter "~|"(quotes excluded). So record looks like-

Field1~|Field2~|Field3~|Field4~|Field5~|Field6~|Field7~|

There are some records which are duplicate.Duplicate record is decided by value of field- Field2. How to write shell script/awk/sed to remove duplicate records based on this criteria? Script then has to write output to some other file. I could have done this in application itself but due to performance problem it can not be done. Thanks for help.

Input file

Field1~|ABA~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|PQR~|Field3~|Field4~|Field5~|Field6~|Field7~|开发者_StackOverflow
Field1~|XYZ~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|ABA~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|RST~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|PQR~|Field3~|Field4~|Field5~|Field6~|Field7~|

Output should be-

Field1~|ABA~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|PQR~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|XYZ~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|RST~|Field3~|Field4~|Field5~|Field6~|Field7~|

(order of the records doesn't matter.)


Not sure if I understood the question correctly, but is this what you're looking for?:

test.txt:

Field1~|Field2~|Field3~|Field4~|Field5~|Field6~|Field7~|
foo~|Field2~|bar~|Field4~|Field5~|Field6~|Field7~|
Field1~|foobar~|Field3~|Field4~|Field5~|Field6~|Field7~|

Calling sort:

sort --field-separator="~" --key 2,2 --unique test.txt

Results in:

Field1~|Field2~|Field3~|Field4~|Field5~|Field6~|Field7~|
Field1~|foobar~|Field3~|Field4~|Field5~|Field6~|Field7~|


If you want to remove all duplicates

nawk -F'~|' '{a[$2]++;b[$2]=$0}END{for(i in a) if (a[i]==1){print b[i]} }' file

If you want to keep only one version of duplicate record

nawk -F'~|' '!a[$2]++' file
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜