How do I use a Perl regExp to count sentences?
I've struggled with regExp in Perl for some reason from the start and have a quick script i wrote here to count sentences in some text being inputted that won't work. I just get the number 1 back at the end and I know in the file specified there is several so the count should be higher. I can't see the issue...
#!C:\strawberry\perl\bin\perl.exe
#strict
#diagnostics
#warnings
$count = 0;
$file = "c:/programs/lorem.txt";
open(IN, "<$file") || die "Sorry, the file failed to open: $!";
while($line = <IN>)
{
if($line =~ m/^[A-Z]/)
{
$count++;
}
}
close(IN);
print("Sentances count was: ($count)");
The file lorem.txt is here......
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et magnis dis parturient montes,开发者_运维知识库 nascetur ridiculus mus. Donec quam felis, ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo. Nullam dictum felis eu pede mollis pretium. Integer tincidunt. Cras dapibus. Vivamus elementum semper nisi. Aenean vulputate eleifend tellus. Aenean leo ligula, porttitor eu, consequat vitae, eleifend ac, enim. Aliquam lorem ante, dapibus in, viverra quis, feugiat a, tellus. Phasellus viverra nulla ut metus varius laoreet. Quisque rutrum. Aenean imperdiet. Etiam ultricies nisi vel augue. Curabitur ullamcorper ultricies nisi. Nam eget dui. Etiam rhoncus. Maecenas tempus, tellus eget condimentum rhoncus, sem quam semper libero, sit amet adipiscing sem neque sed ipsum. Nam quam nunc, blandit vel, luctus pulvinar, hendrerit id, lorem. Maecenas nec odio et ante tincidunt tempus. Donec vitae sapien ut libero venenatis faucibus. Nullam quis ante. Etiam sit amet orci eget eros faucibus tincidunt. Duis leo. Sed fringilla mauris sit amet nibh. Donec sodales sagittis magna. Sed consequat, leo eget bibendum sodales, augue velit cursus nunc,
I don't know what's in your lorem.txt
, but the code that you've given is not counting sentences. It's counting lines, and furthermore it's counting lines that begin with a capital letter.
This regex:
/^[A-Z]/
will only match at the beginning of a line, and only if the first character on that line is capitalized. So if you have a line that looks like it. And then we went...
it will not be matched.
If you want to match all capital letters, just remove the ^
from the beginning of the regex.
This does not answer your specific question about regexp, but you could consider using a CPAN module: Text::Sentence. You can look at its source code to see how it defines a sentence.
use warnings;
use strict;
use Data::Dumper;
use Text::Sentence qw(split_sentences);
my $text = <<EOF;
One sentence. Here is another.
And yet another.
EOF
my @sentences = split_sentences($text);
print Dumper(\@sentences);
__END__
$VAR1 = [
'One sentence.',
'Here is another.',
'And yet another.'
];
A google search also turned up: Lingua::EN::Sentence
You are currently counting all lines that begin with a capital letter. Perhaps you intend to count all words that start with a capital letter? If so, try:
m/\W[A-Z]/
(Although this is not a robust count of sentences)
On another note, there is no need to do the file manipulation explicitly. perl does a really good job of that for you. Try this:
$ARGV[ 0 ] = "c:/programs/lorem.txt" unless @ARGV; while( $line = <> ) { ...
If you do insist on doing an explicit open/close, it is considered bad practice to use raw filehandles. In other words, instead of "open IN...", do "open my $fh, '<', $file_name;"
精彩评论