开发者

removing redundant data

I have a file which looks like this (3 columns and n number of rows)

chr8    101999980   102031975
chr8    101999980   102033533 
chr8    101999980   102033533 
chr8    101999980   102032736 
chr8    101999980   102034799 
chr8    101999980   102034799 
chr8    101999980   102034397
chr8    101999980   102032736

and from this data I wan开发者_JS百科t to remove the redundant lines and these exact repeated data could be present anywhere in this dataset with a bash script.


If maintaining the order is important:

awk '!c[$0]++' filename

This can be read as follows:

  • pushes each line as an array key (c[$0]),
  • post-increments (++) the value to keep a count of such lines, and
  • performs the default action only if the line has never been seen before (!)
    • n++ returns 0, or false, if n is unset
    • the default action is {print}


You can pipe your file through sort and uniq:

$ sort yourFile | uniq > newFile


sort yourfile | uniq > outputfile

If order does not matter.

It works on adjacent identical rows, that's why you need sort. In your file, you don't need sort because the duplicates come right next to each other. If that is not the standard case, you need to sort the file first.

$ uniq yourfile | wc -l
6
$ sort yourfile | uniq | wc -l
6

With and without sort both return 6 lines, but you did not say it is the default.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜