Stopword removal and save the new file
I have text files which I need to remove stop words from them. I have the stop words stored in a text file. I load the "stop-word" text file into my Perl script and store the stop words in an array called "stops".
Currently I am loading a different set of text files and I am storing them in a separate array then doing a pattern match to see if any of the words are indeed stop words. I can print the stop words and know which ones are occurring in the files but how do I remove them from the text fi开发者_高级运维le and store a new text file so it has no stop words?
i.e Stopwords: the a to of and into
Text File: "The girl was driving and crashed into a man"
Resulting file: girl was driving crashed man
I load the file in:
$dirtoget = "/Users/j/temp/";
opendir( IMD, $dirtoget ) || die("Cannot open directory");`
@thefiles = readdir(IMD);`
foreach $f (@thefiles) {
if ( $f =~ m/\.txt$/ ) {
open( FILE, "/Users/j/temp/$f" ) or die "Cannot open FILE";
while (<FILE>) {
@file = <FILE>;
Here is the pattern matching loop:
foreach $word(split) {
foreach $x (@stop) {
if ($x =~ m/\b\Q$word\E\b/) {
$word='';
print $word,"\n";
Setting $word
to be null.
Or I could do:
$word = '' if exists $stops{$word};
I'm just not sure how I set output file to no longer contain the matching words. Is it stupid to store the words which don't match in an array and output them to a file?
Overwriting the files in-place is possible, but a hassle. The Unix way of doing this is to just output the non-stopwords to standard output (which print
does by default), redirect that
./remove_stopwords.pl textfile.txt > withoutstopwords.txt
then proceed with the file withoutstopwords.txt
. This also allows the use of the program in a pipeline.
Shorter:
use strict;
use warnings;
use English qw<$LIST_SEPARATOR $NR>;
my $stop_regex
= do {
local $LIST_SEPARATOR = '\\E|\\Q';
eval "qr/\\b(\\Q@{stop}\\E)\\b/";
};
@ARGV = glob( '/Users/j/temp/*.txt' );
while ( <> ) {
next unless m/$stop_regex/;
print "Stop word '$1' found at $ARGV line $NR\n";
}
What do you want to do with these words? If you wanted to replace them then you could do this:
use English qw<$INPLACE_EDIT $LIST_SEPARATOR $NR>;
local $INPLACE_EDIT = 'bak';
...
while ( <> ) {
if ( m/$stop_regex/ )
s/$stop_regex/$something_else/g;
}
print;
}
With $INPLACE_EDIT
active, perl will dump the print into a '.bak' file and when it moves onto the next file, it will write the .bak
to the original file. If that's what you want to do.
You can use the substitution operator to delete words from your files:
use warnings;
use strict;
my @stop = qw(foo bar);
while (<DATA>) {
my $line = $_;
$line =~ s/\b$_\b//g for @stop;
print $line;
}
__DATA__
here i am
with a foo
and a bar too
lots of foo foo food
prints:
here i am
with a
and a too
lots of food
精彩评论