using AWK how to remove these kind of Duplicates?

2023-04-06 05:48 问答作者：

I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file, for example:

    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.

This is a sample file, from that using this command I got the output like this:

awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'

0008.ASIA.
anish.asia.
ANISH.asia

But I want output like this

  008.ASIA
  anish.asia

008.ASIA
ANISH.asia

How do I remove these kind of duplicates?

Thanks in Advance Anish kumar.V

Thanks for your immediate reponse, Actually I wrote a complete script in bash, now I am in final stage. How to invoke python in that :-(

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log

mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date 

exec 6>&1 >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
    echo Download Not Success Domain count Failed With Error
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.as开发者_高级运维ia

###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count

a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)

echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"

exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" 07anis@gmail.com

In that

 awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia

in this part only now I am searching how to get the distinct values so any suggestions using AWK is better for me. Thanks again for your immediate response.

kent$  cat a
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.


kent$  awk -F' NS' '{ gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' a
anish.asia
0008.asia

btw, it is interesting, that I gave you a solution at http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html, and you add something new in your file, then I added the tolower() function here. :D

By putting your AWK script into a separate file, you can tell what's really going on. Here's a simple approach to your "filter out the duplicates" problem:

# For each line in the file
{

  # Decide on a unique key (eg. case insensitive without trailing period)
  unique_key = tolower($1)
  sub(/\.$/, "", unique_key)

  # If this line isn't a duplicate (it hasn't been found yet)
  if (!(unique_key in already_found)) {

    # Mark this unique key as found
    already_found[unique_key] = "found"

    # Print out the relevant data
    print($1)
  }
}

You can run AWK files by passing the -f option to awk.

If the above script isn't recognizable as an AWK script, here it is in inline form:

awk '{ key = tolower($1); sub(/\.$/, "", key); if (!(key in found)) { found[key] = 1; print($1) } }'

Or, just use the shell:

echo '    0008.ASIA. NS AS2.DNS.ASIA.CN.
    0008.ASIA. NS AS2.DNS.ASIA.CN.
    ns1.0008.asia. NS AS2.DNS.ASIA.CN.
    www.0008.asia. NS AS2.DNS.ASIA.CN.
    anish.asia NS AS2.DNS.ASIA.CN.
    ns2.anish.asia NS AS2.DNS.ASIA.CN
    ANISH.asia. NS AS2.DNS.ASIA.CN.' |
while read domain rest; do
    domain=${domain%.}
    case "$domain" in
        (*.*.*) : ;;
        (*.[aA][sS][iI][aA]) echo "$domain" ;;
    esac
done |
sort -fu

produces

0008.ASIA
anish.asia

Don't use AWK. Use Python

import readlines
result= set()
for line in readlines:
    words = lines.split()
    if "asia" in words[0].lower():
        result.add( words[0].lower() )
for name in result:
    print name

That might be easier to work with than AWK. Yes. It's longer. But it may be easier to understand.

Here is an alternative solution. Let sort create your cased-folded and uniq list (and it will be sorted!)

  {
   cat - <<EOS
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   0008.ASIA. NS AS2.DNS.ASIA.CN.
   ns1.0008.asia. NS AS2.DNS.ASIA.CN.
   www.0008.asia. NS AS2.DNS.ASIA.CN.
   anish.asia NS AS2.DNS.ASIA.CN.
   ns2.anish.asia NS AS2.DNS.ASIA.CN
   ANISH.asia. NS AS2.DNS.ASIA.CN.

EOS
 } |   awk '{
      #dbg print "$0=" $0
      targ=$1
      sub(/\.$/, "", targ)
      n=split(targ,tmpArr,".")
      #dbg print "n="n
      if (n > 2) targ=tmpArr[n-1] "." tmpArr[n]
      print targ 
     }' \
 | sort -f -u

output

0008.ASIA
anish.asia

Edit: fixed sort -i -u to sort -f -u. Many other unix utilties use '-i' to indcate 'ignorecase'. My test showed me I need to fix it, and I forgot to fix the final posting.

using AWK how to remove these kind of Duplicates?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？