开发者

Stopword removal and save the new file

I have text files which I need to remove stop words from them. I have the stop words stored in a text file. I load the "stop-word" text file into my Perl script and store the stop words in an array called "stops".

Currently I am loading a different set of text files and I am storing them in a separate array then doing a pattern match to see if any of the words are indeed stop words. I can print the stop words and know which ones are occurring in the files but how do I remove them from the text fi开发者_高级运维le and store a new text file so it has no stop words?

i.e Stopwords: the a to of and into

Text File: "The girl was driving and crashed into a man"

Resulting file: girl was driving crashed man

I load the file in:

$dirtoget = "/Users/j/temp/";
opendir( IMD, $dirtoget ) || die("Cannot open directory");`
@thefiles = readdir(IMD);`

foreach $f (@thefiles) {
if ( $f =~ m/\.txt$/ ) {

    open( FILE, "/Users/j/temp/$f" ) or die "Cannot open FILE";

    while (<FILE>) {
        @file = <FILE>;

Here is the pattern matching loop:

  foreach $word(split) {
                foreach $x (@stop) {
                   if  ($x =~ m/\b\Q$word\E\b/) {
                 $word='';
                        print $word,"\n";

Setting $word to be null.

Or I could do:

    $word = '' if exists $stops{$word};

I'm just not sure how I set output file to no longer contain the matching words. Is it stupid to store the words which don't match in an array and output them to a file?


Overwriting the files in-place is possible, but a hassle. The Unix way of doing this is to just output the non-stopwords to standard output (which print does by default), redirect that

./remove_stopwords.pl textfile.txt > withoutstopwords.txt

then proceed with the file withoutstopwords.txt. This also allows the use of the program in a pipeline.


Shorter:

use strict;
use warnings;
use English qw<$LIST_SEPARATOR $NR>;

my $stop_regex 
    = do { 
        local $LIST_SEPARATOR = '\\E|\\Q';
        eval "qr/\\b(\\Q@{stop}\\E)\\b/";
    };
@ARGV = glob( '/Users/j/temp/*.txt' );
while ( <> ) { 
    next unless m/$stop_regex/;
    print "Stop word '$1' found at $ARGV line $NR\n";
}

What do you want to do with these words? If you wanted to replace them then you could do this:

use English qw<$INPLACE_EDIT $LIST_SEPARATOR $NR>;
local $INPLACE_EDIT = 'bak';

...
while ( <> ) { 
    if ( m/$stop_regex/ )
        s/$stop_regex/$something_else/g;
    }
    print;
}

With $INPLACE_EDIT active, perl will dump the print into a '.bak' file and when it moves onto the next file, it will write the .bak to the original file. If that's what you want to do.


You can use the substitution operator to delete words from your files:

use warnings;
use strict;

my @stop = qw(foo bar);
while (<DATA>) {
    my $line = $_;
    $line =~ s/\b$_\b//g for @stop;
    print $line;
}

__DATA__
here i am
with a foo
and a bar too
lots of foo foo food

prints:

here i am
with a
and a  too
lots of   food
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜