Stopword removal and save the new file

2023-02-15 03:18 问答作者：

I have text files which I need to remove stop words from them. I have the stop words stored in a text file. I load the "stop-word" text file into my Perl script and store the stop words in an array called "stops".

Currently I am loading a different set of text files and I am storing them in a separate array then doing a pattern match to see if any of the words are indeed stop words. I can print the stop words and know which ones are occurring in the files but how do I remove them from the text fi开发者_高级运维le and store a new text file so it has no stop words?

i.e Stopwords: the a to of and into

Text File: "The girl was driving and crashed into a man"

Resulting file: girl was driving crashed man

I load the file in:

$dirtoget = "/Users/j/temp/";
opendir( IMD, $dirtoget ) || die("Cannot open directory");`
@thefiles = readdir(IMD);`

foreach $f (@thefiles) {
if ( $f =~ m/\.txt$/ ) {

    open( FILE, "/Users/j/temp/$f" ) or die "Cannot open FILE";

    while (<FILE>) {
        @file = <FILE>;

Here is the pattern matching loop:

  foreach $word(split) {
                foreach $x (@stop) {
                   if  ($x =~ m/\b\Q$word\E\b/) {
                 $word='';
                        print $word,"\n";

Setting $word to be null.

Or I could do:

    $word = '' if exists $stops{$word};

I'm just not sure how I set output file to no longer contain the matching words. Is it stupid to store the words which don't match in an array and output them to a file?

Overwriting the files in-place is possible, but a hassle. The Unix way of doing this is to just output the non-stopwords to standard output (which print does by default), redirect that

./remove_stopwords.pl textfile.txt > withoutstopwords.txt

then proceed with the file withoutstopwords.txt. This also allows the use of the program in a pipeline.

Shorter:

use strict;
use warnings;
use English qw<$LIST_SEPARATOR $NR>;

my $stop_regex 
    = do { 
        local $LIST_SEPARATOR = '\\E|\\Q';
        eval "qr/\\b(\\Q@{stop}\\E)\\b/";
    };
@ARGV = glob( '/Users/j/temp/*.txt' );
while ( <> ) { 
    next unless m/$stop_regex/;
    print "Stop word '$1' found at $ARGV line $NR\n";
}

What do you want to do with these words? If you wanted to replace them then you could do this:

use English qw<$INPLACE_EDIT $LIST_SEPARATOR $NR>;
local $INPLACE_EDIT = 'bak';

...
while ( <> ) { 
    if ( m/$stop_regex/ )
        s/$stop_regex/$something_else/g;
    }
    print;
}

With $INPLACE_EDIT active, perl will dump the print into a '.bak' file and when it moves onto the next file, it will write the .bak to the original file. If that's what you want to do.

You can use the substitution operator to delete words from your files:

use warnings;
use strict;

my @stop = qw(foo bar);
while (<DATA>) {
    my $line = $_;
    $line =~ s/\b$_\b//g for @stop;
    print $line;
}

__DATA__
here i am
with a foo
and a bar too
lots of foo foo food

prints:

here i am
with a
and a  too
lots of   food

继续阅读：arrays pattern-matching perl regex split

Stopword removal and save the new file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？