removing redundant data
I have a file which looks like this (3 columns and n number of rows)
chr8 101999980 102031975
chr8 101999980 102033533
chr8 101999980 102033533
chr8 101999980 102032736
chr8 101999980 102034799
chr8 101999980 102034799
chr8 101999980 102034397
chr8 101999980 102032736
and from this data I wan开发者_JS百科t to remove the redundant lines and these exact repeated data could be present anywhere in this dataset with a bash script.
If maintaining the order is important:
awk '!c[$0]++' filename
This can be read as follows:
- pushes each line as an array key (
c[$0]
), - post-increments (
++
) the value to keep a count of such lines, and - performs the default action only if the line has never been seen before (
!
)n++
returns 0, or false, if n is unset- the default action is
{print}
You can pipe your file through sort and uniq:
$ sort yourFile | uniq > newFile
sort yourfile | uniq > outputfile
If order does not matter.
It works on adjacent identical rows, that's why you need sort. In your file, you don't need sort because the duplicates come right next to each other. If that is not the standard case, you need to sort the file first.
$ uniq yourfile | wc -l
6
$ sort yourfile | uniq | wc -l
6
With and without sort both return 6 lines, but you did not say it is the default.
精彩评论