Using grep to filter out words from a stopwords file
I want to use grep together with a stopwords-file to filter out common english words from another file. The file "somefile" contains one word per line.
cat somefile | grep -v -f stopwords
The problem with this approach is: It checks whether a word in stopwords occurs in somefile, but I want the opposite, i.e. check if a word in somefile occurs in stopwords.
How开发者_如何学Python to do this?
Example
somefile contains the following:
hello
o
orange
stopwords contains the following:
o
I want to filter out only the word "o" from somefile, not hello and orange.
I thought about it some more, and found a solution...
use the -w
switch of grep
to match whole words:
grep -v -w -f stopwords somefile
Assuming you have stopwords file /tmp/words:
in
the
you can create from it sed program by:
sed 's|^|s/\\<|; s|$|\\>/[CENSORED]/g;|' /tmp/words > /tmp/words.sed
this way you will get /tmp/words.sed:
s/\<in\>/[CENSORED]/g;
s/\<the\>/[CENSORED]/g;
and then use it to censor any text file:
sed -e -f /tmp/words.sed /input/file/to/filter.txt > /censored/output.txt
The -e
is needed for sed to understand extended regexp needed for recognition.
Of course you can change [censored]
to any other string or empty string if you wish.
This solution will handle many words in line as well as one word per line files.
精彩评论