开发者

Add Tab Separator to Grep

I am new to grep and awk, and I would like to create tab separated values in the "frequency.txt" file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus - I modified it for the Khmer language). I've looked around ( grep a tab in UNIX ), but I can't seem to find an example that makes sense to me for this bash script (I'm too much of a newbee).

I am using this bash script in cygwin:

#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's/​/ /g' -e 's/\t/ /g' \
    -e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
    -e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
    -e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
    -e 's/៧//g' -e 's/៨开发者_开发百科//g' -e 's/៩//g' dictionary.txt | \
  tr [:upper:] [:lower:] | \
  sort | \
  uniq -c | \
  sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'

Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?

Here's a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):

ព្រះ​វិញ្ញាណ​នឹង​ប្រពន្ធ​ថ្មោង​ថ្មី​ពោល​ថា អញ្ជើញ​មក ហើយ​អ្នក​ណា​ដែល​ឮ​ក៏​ថា អញ្ជើញ​មក​ដែរ អ្នក​ណា​ដែល​ស្រេក នោះ​មាន​តែ​មក ហើយ​អ្នក​ណា​ដែល​ចង់​បាន មាន​តែ​យក​ទឹក​ជីវិត​នោះ​ចុះ ឥត​ចេញ​ថ្លៃ​ទេ។

Here is an example output of frequency.txt as it is now (frequency and then term):

25605 នឹង 25043 ជា 22004 បាន 20515 នោះ

I want the output frequency.txt to look like this (where TAB is an actual tab character):

25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ

Thanks for your help!


You should be able to replace the whole lengthy sed command with this:

tr -d '[a-zA-Z][0-9]«»:;.,()-?។”“|០១២៣៤៥៦៧៨៩'
tr '\t' ' '

Comments:

  • 's/​/ /g' - the first two slashes mean re-use the previous match which was [a-z][A-Z] and replace them with spaces, but they were deleted so this is a no-op
  • 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' - the pipe characters don't delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be 's/[«»:;.,()-?។”“|]//g' (leaving one pipe in case you really want to delete them)
  • 's/ /\n/g' - earlier, you replaced tabs with spaces, now you're replacing the spaces with newlines

You should be able to have the tabs you want by inserting this in your pipeline right after the uniq:

sed 's/^ *\([0-9]\+\) /\1\t/'

If you want the AWK command to output a tab:

awk 'BEGIN{OFS='\t'} {print $2, $1}'


What about writing awk to file with "<"?


The following script should get you where you need to go. The pipe to tee will let you see output on the screen while at the same time writing the output to ./outfile

#!/bin/sh  

sed ':a;N;s/[a-zA-Z0-9។០១២៣៤៥៦៧៨៩\n«»:;.,()?”“-]//g;ta' < dictionary.txt | \
gawk '{$0=toupper($0);for(i=1;i<=NF;i++)a[$i]++}
   END{for(item in a)printf "%s\t%d ", item, a[item]}' | \
tee ./outfile
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜