Using grep to filter out words from a stopwords file

2023-04-02 22:02 问答作者：

I want to use grep together with a stopwords-file to filter out common english words from another file. The file "somefile" contains one word per line.

cat somefile | grep -v -f stopwords

The problem with this approach is: It checks whether a word in stopwords occurs in somefile, but I want the opposite, i.e. check if a word in somefile occurs in stopwords.

How开发者_如何学Python to do this?

Example

somefile contains the following:

hello
o
orange

stopwords contains the following:

I want to filter out only the word "o" from somefile, not hello and orange.

I thought about it some more, and found a solution...

use the -w switch of grep to match whole words:

grep -v -w -f stopwords somefile

Assuming you have stopwords file /tmp/words:

in
the

you can create from it sed program by:

sed 's|^|s/\\<|; s|$|\\>/[CENSORED]/g;|' /tmp/words > /tmp/words.sed

this way you will get /tmp/words.sed:

s/\<in\>/[CENSORED]/g;
s/\<the\>/[CENSORED]/g;

and then use it to censor any text file:

sed -e -f /tmp/words.sed /input/file/to/filter.txt > /censored/output.txt

The -e is needed for sed to understand extended regexp needed for recognition. Of course you can change [censored] to any other string or empty string if you wish.

This solution will handle many words in line as well as one word per line files.

继续阅读：grep stop-words

Using grep to filter out words from a stopwords file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？