开发者

Counting individual words in a text file

I am trying to count the number of times a particular word occurred in a text file. The text file is given as program argument to the perl program.

while($text = <>)
{
    @words = split (/\W*\s+\W*/, $text);
    @words = grep (/^[a-zA-Z\-]+$/, @words);
    foreach $word (@words)
    {
        $wordCount{$word}++;
    }
}

I do not have a clear understanding of these lines -

@words = split (/\W*\s+\W*/, $text);
@words = grep (/^[a-zA-Z\-]+$/, @words);

I know split is going to split开发者_Python百科 the string into the array variable, but how? Is it as non-words? I don't understand the regular expression used in split function.

What does grep do and again its regular expression is unclear to me.

P.S. When i check this, the code seems to have a bug, in that if I enter a text file having text as -

the quick brown fox jumps over the lazy dog dog.rose is brown, violet jumps the fox.

It counts the words fox and dog only once, which is incorrect.

What is wrong here?


I'm not sure that splitting into an array is the most memory efficient method, especially for very large texts. If you have a textfile of a few megabytes, you will build a very large array, which will use a lot of memory.

Instead, I would do something like this:

while ($text = <>) {
    while ($text =~ /([A-Za-z\-]+)/g)  {
        my $word = lc($1);    # dont diffrentiate between 'Dog' and 'dog'
        $count++;             # total word count
        $wordCount{$word}++;  # individual word count
    }
}

Then it is also easy to add new characters, if you happen to find any you wish to include as valid characters in a word. I.e. if you think this_file would be acceptable, change the characters to [A-Za-z\-_].

As for your questions:

The regex \W*\s+W* means: match a non-word character zero to any number of times, followed by one to any number of whitespace, followed by zero to any number of non-word character. A rather odd way to split, but it will basically split around all whitespace, and remove all non-word characters in the process, for a more correct individual word count. (E.g. it will not treat dog, and dog as two different words).

The grep in itself will return a list of values that match the regex. The regex will match any array value in @words that consists only (from start to end) of letters, upper or lower case, and hyphen. If there is any other character inside the value, the grep will exclude it.

The bug is that "dog.rose" and "fox." will not be split correctly, because there is no whitespace. They will therefore not be implicitly cleansed of non-word characters, and thus will be removed with the grep.


As always with these questions, there are a million different ways to define what a "word" is. Taking the existing one in use here (sequence of letters with internal dashes allowed) but making it work with the two noted failing cases:

my $text = 'the quick brown fox jumps over the lazy dog dog.rose is brown, violet jumps the fox.';
my %wordCount;
for my $word ( $text =~ /([a-zA-Z]+|-(?=[a-zA-Z\-])(?<=[a-zA-Z\-]-))+/g ) {
    ++$wordCount{$word};
}

for my $word ( sort { $wordCount{$a} <=> $wordCount{$b} || $a cmp $b } keys %wordCount ) {
     print "$word: $wordCount{$word}\n" 
}


\W is matching word characters
\s is matching whitespace

As you might already have guessed, it doesn't work because there's no whitespace between the words in dog.rose.

I would split on \b (which means word boundaries). This should be simpler and more correct than your \W*\s+\W*.

while($text = <>)
{
    @words = split (/\b/, $text);
    foreach $word (@words)
    {
        $wordCount{$word}++;
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜