How can I remove stop words from a large text file?

2023-01-23 01:52 问答作者：

I have a billion word corpus which I have collected in a scalar. I have a .regex file that contains all the stop words that I want to eliminate from my data (text).

I don't know how to use this .regex file, so I have made an array and stored all the stop words of the .regex file in my stop word array.

To remove the stop words I do something like this:

grep { $scal开发者_如何学编程arText =~ s/\b\Q$_\E\b/ /g } @stopList;

This takes a long time to execute. How can I use the .regex file in my Perl script to remove the stop words? Or is there any faster way to remove the stop words?

Yes, I imagine what you're doing there is extremely slow, albeit for a couple of reasons. I think you need to process your stopwords regex before you build up your string of a billion words from your corpus.

I have no idea what a .regex file is, but I'm going to presume it contains a legal Perl regular expression, something that you can compile using no more than:

$stopword_string = `cat foo.regex`;
$stopword_rx     = qr/$stopword_string/;

That probably presumes that there's a (?x) at the start.

But if your stopword file is a list of lines, you will need to do something more like this:

chomp(@stopwords = `cat foo.regex`);

# if each stopword is an independent regex:
$stopword_string = join "|" => @stopwords;

# else if each stopword is a literal
$stopword_string = join "|" => map {quotemeta} @stopwords;

# now compile it (maybe add some qr//OPTS)
$stopword_rx     = qr/\b(?:$stopword_string)\b/;

WARNING

Be very careful with \b: it's only going to do what you think it does above if the first character in the first word and the last character in the last word is an alphanumunder (a \w character). Otherwise, it will be asserting something you probably don't mean. If that could be a possibility, you will need to be more specific. The leading \b would need to become (?:(?<=\A)|(?<=\s)), and the trailing \b would need to become (?=\s|\z). That's what most people think \b means, but it really doesn't.

Having done that, you should apply the stopword regex to the corpus as you're reading it in. The best way to do this is not to put the stuff into your string in the first place that you'll just need to take out later.

So instead of doing

$corpus_text = `cat some-giant-file`;
$corpus_text =~ s/$stopword_rx//g;

Instead do

my $corpus_path = "/some/path/goes/here";
open(my $corpus_fh, "< :encoding(UTF-8)", $corpus_path)
    || die "$0: couldn't open $corpus_path: $!";

my $corpus_text = q##;

while (<$corpus_fh>) {
    chomp;  # or not
    $corpus_text .= $_ unless /$stopword_rx/;
}

close($corpus_fh)
    || die "$0: couldn't close $corpus_path: $!";

That will be much faster than putting stuff in there that you just have to weed out again later.

My use of cat above is just a shortcut. I don't expect you to actually call a program, least of all cat, just to read in a single file, unprocessed and unmolested. ☺

You may want to use Regexp::Assemble to compile a list of Perl regexes into one regex.

I found a faster way to do it. Saves me around 4 seconds.

my $qrstring = '\b(' . (join '|', @stopList) . ')\b';
$scalarText =~ s/$qrstring/ /g;

where stopList is the array of all my words and scalarText is my whole text.

Can anyone please tell me a faster way if you know any?

继续阅读：perl stop-words

How can I remove stop words from a large text file?

WARNING

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

WARNING

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？