Uniq in awk; removing duplicate values in a column using awk
I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78,  WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458,  atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in so开发者_StackOverflowmething like this:
ENST00000371026 WDR78   WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32   WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458   atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk ' 
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there. 
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
  split($2, valueArray,",");
  j=0;
  for (i in valueArray)
  { 
    if (!(valueArray[i] in duplicateArray))
    { 
      duplicateArray[valueArray[i]] = 1
    }
  };
  printf $1 "\t";
  for (j in duplicateArray)
  {
    if (j)    # prevents printing an extra comma
    {
      printf j ",";
    }
  }
  printf "\t";
  print $3
  delete duplicateArray    # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
  $F[1] = join ",", grep !$_{$_}++, split ",", $F[1]; 
  print join "\t", @F; %_ = ();
  ' infile  
awk:
awk -F'\t' '{
  n = split($2, t, ","); _2 = x
  split(x, _) # use delete _ if supported
  for (i = 0; ++i <= n;)
    _[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
  $2 = _2 
  }-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' @t = split(/\t/);
  %t2 = map { $_ => 1 } split(/,/,$t[1]);
  $t[1] = join(",",keys %t2);
  print join("\t",@t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part                            # parts of a line
declare -a part2                           # parts 2. column
declare -A check                           # used to remember items in part2
while read  line ; do
  part=( $line )                           # split line using whitespaces
  IFS=','                                  # separator is comma
  part2=( ${part[1]} )                     # split 2. column using comma
  if [ ${#part2[@]} -gt 1 ] ; then         # more than 1 field in 2. column?
    check=()                               # empty check array
    new2=''                                # empty new 2. column
    for item in ${part2[@]} ; do 
      (( check[$item]++ ))                 # remember items in 2. column
      if [ ${check[$item]} -eq 1 ] ; then  # not yet seen?
        new2=$new2,$item                   # add to new 2. column
      fi 
    done
    part[1]=${new2#,}                      # remove leading comma
  fi 
  IFS=$'\t'                                # separator for the output
  echo "${part[*]}"                        # rebuild line
done < "$infile"
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论