Why can't I use the map function to create a good hash from a simple data file in Perl?
The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!
Here's the minimized code to exhibit my problem:
The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding. It has the following three lines:
abacus æbәkәs abalone æbәlәuni abandon әbændәn
The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";
In the output, the hash table seems to be okay:
$VAR1 = { 'abalone' => 'æbәlәuni ', 'abandon' => 'әbændәn', 'abacus' => 'æbәkәs ' };
But it is actually not, because I only get two values instead of three:
æbәlәu开发者_StackOverflow中文版ni әbændәn
Perl gives the following warning message:
Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i
n> line 3.
where's the problem? Can someone kindly explain? Thanks.
The Solution
Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.
To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
Now, the output is exactly what I expected:
$VAR1 = { 'abalone' => 'æbәlәuni ', 'abandon' => 'әbændәn', 'abacus' => 'æbәkәs ' }; æbәkәs æbәlәuni әbændәn
Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.
Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.
Note To clarify a little more, if I use:
open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
The output is this:
$VAR1 = { 'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni ", 'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n", "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s " }; æbәlәuni әbændәn
And the warning message:
Use of uninitialized value in print at C:\hash_test.pl line 13, line 3.
I find the warning message a little suspicious. It tells you that the $in
filehandle is at line 3 when it should be at line 4 after having read the last line.
When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s "
where \x{feff}
is the BOM.
In your Dumper output, there is spurious blank before abacus
(where you had not specified :utf8
for the output handle).
As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify '<:utf8'
when you are opening the input file.
If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.
#! /usr/bin/env perl
use Data::Dumper;
open my $in, '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";
my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";
If you want it to be more robust, it is recommended to use :encoding(utf8)
instead of :utf8
, for reading a file.
open my $in, '<:encoding(utf8)', "hash_test.txt";
Read PerlIO for more information.
I think your answer may be sitting right in front of you. The output from Data::Dumper
which you posted is:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
Notice the character between the '
and abacus
? You tried to access the third value via $hash{abacus}
. This is incorrect because of that character before abacus
in the Dumper()
hash. You could try plugging it into a loop which should take care of it:
foreach my $k (keys %hash) {
print $out $hash{$k};
}
split/\s/ instead of split/\t/
Works For Me. Are you sure your example matches your actual code and data?
精彩评论