Counting individual words in a text file

2023-03-09 07:36 问答作者：

I am trying to count the number of times a particular word occurred in a text file. The text file is given as program argument to the perl program.

while($text = <>)
{
    @words = split (/\W*\s+\W*/, $text);
    @words = grep (/^[a-zA-Z\-]+$/, @words);
    foreach $word (@words)
    {
        $wordCount{$word}++;
    }
}

I do not have a clear understanding of these lines -

@words = split (/\W*\s+\W*/, $text);
@words = grep (/^[a-zA-Z\-]+$/, @words);

I know split is going to split开发者_Python百科 the string into the array variable, but how? Is it as non-words? I don't understand the regular expression used in split function.

What does grep do and again its regular expression is unclear to me.

P.S. When i check this, the code seems to have a bug, in that if I enter a text file having text as -

the quick brown fox jumps over the lazy dog dog.rose is brown, violet jumps the fox.

It counts the words fox and dog only once, which is incorrect.

What is wrong here?

I'm not sure that splitting into an array is the most memory efficient method, especially for very large texts. If you have a textfile of a few megabytes, you will build a very large array, which will use a lot of memory.

Instead, I would do something like this:

while ($text = <>) {
    while ($text =~ /([A-Za-z\-]+)/g)  {
        my $word = lc($1);    # dont diffrentiate between 'Dog' and 'dog'
        $count++;             # total word count
        $wordCount{$word}++;  # individual word count
    }
}

Then it is also easy to add new characters, if you happen to find any you wish to include as valid characters in a word. I.e. if you think this_file would be acceptable, change the characters to [A-Za-z\-_].

As for your questions:

The regex \W*\s+W* means: match a non-word character zero to any number of times, followed by one to any number of whitespace, followed by zero to any number of non-word character. A rather odd way to split, but it will basically split around all whitespace, and remove all non-word characters in the process, for a more correct individual word count. (E.g. it will not treat dog, and dog as two different words).

The grep in itself will return a list of values that match the regex. The regex will match any array value in @words that consists only (from start to end) of letters, upper or lower case, and hyphen. If there is any other character inside the value, the grep will exclude it.

The bug is that "dog.rose" and "fox." will not be split correctly, because there is no whitespace. They will therefore not be implicitly cleansed of non-word characters, and thus will be removed with the grep.

As always with these questions, there are a million different ways to define what a "word" is. Taking the existing one in use here (sequence of letters with internal dashes allowed) but making it work with the two noted failing cases:

my $text = 'the quick brown fox jumps over the lazy dog dog.rose is brown, violet jumps the fox.';
my %wordCount;
for my $word ( $text =~ /([a-zA-Z]+|-(?=[a-zA-Z\-])(?<=[a-zA-Z\-]-))+/g ) {
    ++$wordCount{$word};
}

for my $word ( sort { $wordCount{$a} <=> $wordCount{$b} || $a cmp $b } keys %wordCount ) {
     print "$word: $wordCount{$word}\n" 
}

\W is matching word characters
\s is matching whitespace

As you might already have guessed, it doesn't work because there's no whitespace between the words in dog.rose.

I would split on \b (which means word boundaries). This should be simpler and more correct than your \W*\s+\W*.

while($text = <>)
{
    @words = split (/\b/, $text);
    foreach $word (@words)
    {
        $wordCount{$word}++;
    }
}

继续阅读：perl regex

Counting individual words in a text file

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？