using AWK how to remove these kind of Duplicates?
I am new to AWK, I have some basic ideas in AWK. I want to remove duplicates in a file, for example:
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
This is a sample file, from that using this command I got the output like this:
awk 'BEGIN{IGNORECASE=1}/^[^ ]+asia/ { gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[$1]++;}END{for (x in b)print x}'
0008.ASIA.
anish.asia. ANISH.asia
But I want output like this
008.ASIA
anish.asia
or
008.ASIA
ANISH.asia
How do I remove these kind of duplicates?
Thanks in Advance Anish kumar.V
Thanks for your immediate reponse, Actually I wrote a complete script in bash, now I am in final stage. How to invoke python in that :-(
#!/bin/bash
current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/asia/
MainPath=$RootPath${today}asia
LOG=/var/tmp/log/asia/asiacount$current_date.log
mkdir -p $MainPath
echo Intelliscan Process started for Asia TLD $current_date
exec 6>&1 >> $LOG
#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.anish.com:21/zonefile/anish.zone.gz
then
echo Download Not Success Domain count Failed With Error
exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####
gunzip asia.zone.gz > $MainPath/$today.as开发者_高级运维ia
###### It will start the Count #####
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
awk '/Total/ {print $2}' $RootPath/zonefile/$today.asia > $RootPath/$today.count
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.asia $RootPath/zonefile/$yesterday.asia)
echo "$current_date Count For Asia TlD $a"
echo "$current_date Overall Count For Asia TlD $c"
echo "$current_date New Registration Domain Counts $((c - a))"
echo "$current_date Deleted Domain Counts $((c - b))"
exec >&6 6>&-
cat $LOG | mail -s "Asia Tld Count log" 07anis@gmail.com
In that
awk '/^[^ ]+ASIA/ && !_[$1]++{print $1; tot++}END{print "Total",tot,"Domains"}' $MainPath/$today.asia > $RootPath/zonefile/$today.asia
in this part only now I am searching how to get the distinct values so any suggestions using AWK is better for me. Thanks again for your immediate response.
kent$ cat a
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
kent$ awk -F' NS' '{ gsub(/\.$/,"",$1);split($1,a,".")} length(a)==2{b[tolower($1)]++;}END{for (x in b)print x}' a
anish.asia
0008.asia
btw, it is interesting, that I gave you a solution at http://www.unix.com/shell-programming-scripting/167512-using-awk-how-its-possible.html, and you add something new in your file, then I added the tolower()
function here. :D
By putting your AWK script into a separate file, you can tell what's really going on. Here's a simple approach to your "filter out the duplicates" problem:
# For each line in the file
{
# Decide on a unique key (eg. case insensitive without trailing period)
unique_key = tolower($1)
sub(/\.$/, "", unique_key)
# If this line isn't a duplicate (it hasn't been found yet)
if (!(unique_key in already_found)) {
# Mark this unique key as found
already_found[unique_key] = "found"
# Print out the relevant data
print($1)
}
}
You can run AWK files by passing the -f
option to awk
.
If the above script isn't recognizable as an AWK script, here it is in inline form:
awk '{ key = tolower($1); sub(/\.$/, "", key); if (!(key in found)) { found[key] = 1; print($1) } }'
Or, just use the shell:
echo ' 0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.' |
while read domain rest; do
domain=${domain%.}
case "$domain" in
(*.*.*) : ;;
(*.[aA][sS][iI][aA]) echo "$domain" ;;
esac
done |
sort -fu
produces
0008.ASIA
anish.asia
Don't use AWK. Use Python
import readlines
result= set()
for line in readlines:
words = lines.split()
if "asia" in words[0].lower():
result.add( words[0].lower() )
for name in result:
print name
That might be easier to work with than AWK. Yes. It's longer. But it may be easier to understand.
Here is an alternative solution. Let sort
create your cased-folded and uniq list (and it will be sorted!)
{
cat - <<EOS
0008.ASIA. NS AS2.DNS.ASIA.CN.
0008.ASIA. NS AS2.DNS.ASIA.CN.
ns1.0008.asia. NS AS2.DNS.ASIA.CN.
www.0008.asia. NS AS2.DNS.ASIA.CN.
anish.asia NS AS2.DNS.ASIA.CN.
ns2.anish.asia NS AS2.DNS.ASIA.CN
ANISH.asia. NS AS2.DNS.ASIA.CN.
EOS
} | awk '{
#dbg print "$0=" $0
targ=$1
sub(/\.$/, "", targ)
n=split(targ,tmpArr,".")
#dbg print "n="n
if (n > 2) targ=tmpArr[n-1] "." tmpArr[n]
print targ
}' \
| sort -f -u
output
0008.ASIA
anish.asia
Edit: fixed sort -i -u
to sort -f -u
. Many other unix utilties use '-i' to indcate 'ignorecase'. My test showed me I need to fix it, and I forgot to fix the final posting.
精彩评论