开发者

Removing stop words and saving the new file Perl

I have created a Perl file to load in an array of "Stop words".

Then I load in a directory with ".ner" files contained in it. Each file gets opened and each word is split and compared to the words in the stop file. If the word matches the word it is changed to "" (nothing-and gets removed) I then copy the file to another location. So I can differentiate between files with stop words and files without. But does this change the file to now contain no stop words or will it revert back to the original?

#!/usr/bin/perl

#use strict;
#use warnings;

my @stops;
my @file;

use File::Copy;

open( STOPWORD, "/Users/jen/stopWordList.txt" ) or die "Can't Open: $!\n";

@stops = <STOPWORD>;
while (<STOPWORD>)    #read each line into $_
{
    chomp @stops;     # Remove newline from $_
    push @stops, $_;  # add the line to @triggers
}

close STOPWORD;

$dirtoget="/Users/jen/temp/";

opendir(IMD, $dirtoget) || die("Cannot open directory");

@thefiles= readdir(IMD);

foreach $f (@thefiles){
    if ($f =~ m/\.ner$/){
        print $f,"\n";

        open (FILE, "/Users/jen/temp/$f")or die"Cannot open FILE"; 

        if ( FILE eq "" ) {
            close FILE;
        }
        else{
            while (<FILE>) {

               foreach $word(split(/\|/)){

                    foreach $x (@stops) {
                       if  ($x =~ m/\b\Q$word\E\b/) {
                            $word = '';   
             copy("/Users/jen/temp/$f","/Users/jen/correct/$f")or die "Copy failed: $!";
                    close FILE;
                    } 
                    }
                }
            }
        }
    }
}
closedir(IMD);
exit 0;

The format of the file I am splitting and comparing is as follows:

'<title>|NN|O Woman|NNP|O jumped|VBD|O for|IN|O life|NN|O after|IN|O firebomb|NN|O attack|NN|O -|:|O National|NNP|I-ORG News|NNP|I-ORG ,|,|I-ORG Frontpage|NNP|I-ORG -|:|I-ORG Independent.ie</tit开发者_运维百科le>|NNP|'

Should I be outlining where the words should be split ie: split(/|/)?


You should ALWAYS use : use strict; use warnings;

use three args open and test opening for failure.

As said codaddict A split with no arguments is equivalent to split(' ', $_).

Here is a proposal to achieve the job (as far as I well understood what you wanted).

#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;

my @stops = qw(put here your stop words);
my %stops = map{$_ => 1} @stops;

my @thefiles;

my $path = '/Users/jen/temp/';
my $out = $path.'outputfile';
open my $fout, '>', $out or die "can't open '$out' for writing : $!";

foreach my $file(@thefiles) {
    next unless $file =~ /\.ner$/;
    open my $fh, '<', $path.$file or die "can't open '$file' for reading : $!";
    my @lines = <$file>;
    close $fh;
    foreach my $line(@lines) {
        my @words = split/\|/,$line;
        foreach my $word(@words) {
            $word = '' if exists $stops{$word};
        }
        print $fout join '|',@words;
    }
}
close $out;


A split with no arguments is equivalent to split(' ', $_).

Since you want the lines to be split on | you need to do:

split/\|/


@jenniem001,

open FILE, ("<$fh")||die("cant");undef $/;my $whole_file = <FILE>;foreach my $word (@words){$whole_file=~s/\b\Q$word\E\b//ig;}open FILE (">>$duplicate")||die("cant");print FILE $whole_file;

That will remove stops from your file and create a duplicate. Just call give $duplicate a name :)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜