How to count the Chinese word in a file using regex in perl?
I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.
The Error message is
Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things = 125, valid words =
which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8
pragma is installed.
#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;
my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;
open (FILE, "< $input_file") or die "Can't open $input_file: $!";
while (<FILE>) {
foreach (split) { #break $_ into words, assign each to $_ in turn
$total++;
next if /\W|^\d+/; #strange words skip the remainder of the loop
$valid++;
$count{$_}++; # count each separate word stored in a hash
## next comes here ##
}
}
print "Total things = $total, valid words = $valid\n";
foreach my $word (sort keys %count) {
print "$word \t was seen \t $count{$word} \t times.\n";
}
##---Data----
sample_file.t开发者_StackOverflowxt
那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.
We set STDOUT to the :utf8 IO layer so the says won't show malformed the data, then open the file with the same layer so that the diamond won't read malformed data. Afterward, inside the while, rather than splitting on the empty string, we use a regex with the "East_Asian_Width: Wide" Unicode-like property.
utf8 is for my personal sanity checking, and can be removed (Y).
use strict;
use warnings;
use 5.010;
use utf8;
use autodie;
binmode(STDOUT, ':utf8');
open my $fh, '<:utf8', 'sample_file.txt';
my ($total, $valid);
my %count;
while (<$fh>) {
$total += length;
for (/(\p{Ea=W})/g) {
$valid++;
$count{$_}++;
}
}
say "Total things = $total, valid words = $valid";
for my $word (sort keys %count) {
say "$word \t was seen \t $count{$word} \t times.";
}
EDIT: J-16 SDiZ and daxim pointed out that the chances of sample_file.txt
being in UTF-8 are.. slim. Read their comments, then take a look at the Encode module in perldoc, specifically the 'Encoding via PerlIO' portion.
I may be able to offer some insight, but it's hard to tell if my answer will be "helpful". First, I only speak and read english, so I obviously do not speak or read chinese. I do happen to be the author of RegexKitLite, which is an Objective-C wrapper around the ICU regex engine. This is obviously not perl
, :).
Despite this, the ICU regex engine happens to have a feature that sounds remarkably like what it is that you're trying to do. Specifically, the ICU regex engine contains the UREGEX_UWORD
modifier option, which can be turned on dynamically via the normal (?w:...)
syntax. This modifier performs the following action:
Controls the behavior of \b in a pattern. If set, word boundaries are found according to the definitions of word found in Unicode UAX 29, Text Boundaries. By default, word boundaries are identified by means of a simple classification of characters as either “word” or “non-word”, which approximates traditional regular expression behavior. The results obtained with the two options can be quite different in runs of spaces and other non-word characters.
You can use this in a regex like (?w:\b(.*?)\b)
to "extract" words from a string. In the ICU regex engine, it has a fairly powerful "word breaking engine" that is specifically designed to find word breaks in written languages that do not have an explicit space 'character', like english. Again, not reading or writing these languages, my understanding is that "itisroughlysomethinglikethis". The ICU word breaking engine uses heuristics, and occasionally dictionaries, to be able to find the word breaks. It is my understanding that Thai happens to be a particularly difficult case. In fact, I happen to use ฉันกินข้าว
(Thai for "I eat rice", or so I was told) with a regex of (?w)\b\s*
to perform a split
operation on the string to extract the words. Without (?w)
you can not split on word breaks. With (?w)
it results in the words ฉัน
, กิน
, and ข้าว
.
Provided the above "sounds like the problem you're having", then this could be the reason. If this is the case, then I am not aware of any way to accomplish this in perl
, but I wouldn't consider this opinion an authoritative answer since I use the ICU regex engine more often than the perl
one and am clearly not properly motivated to find a working perl
solution when I've already got one :). Hope this helps.
精彩评论