开发者

Why can't I use the map function to create a good hash from a simple data file in Perl?

The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!

Here's the minimized code to exhibit my problem:

The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding. It has the following three lines:

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding. It contains the following code:

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

In the output, the hash table seems to be okay:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

But it is actually not, because I only get two values instead of three:

æbәlәu开发者_StackOverflow中文版ni
әbændәn

Perl gives the following warning message:

Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i n> line 3.

where's the problem? Can someone kindly explain? Thanks.

The Solution

Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :) As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.

To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

Now, the output is exactly what I expected:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.

Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.

Note To clarify a little more, if I use:

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

The output is this:

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

And the warning message:

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.


I find the warning message a little suspicious. It tells you that the $in filehandle is at line 3 when it should be at line 4 after having read the last line.

When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:

"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"

where \x{feff} is the BOM.

In your Dumper output, there is spurious blank before abacus (where you had not specified :utf8 for the output handle).

As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify '<:utf8' when you are opening the input file.


If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

If you want it to be more robust, it is recommended to use :encoding(utf8) instead of :utf8, for reading a file.

open my $in, '<:encoding(utf8)', "hash_test.txt";

Read PerlIO for more information.


I think your answer may be sitting right in front of you. The output from Data::Dumper which you posted is:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

Notice the character between the ' and abacus? You tried to access the third value via $hash{abacus}. This is incorrect because of that character before abacus in the Dumper() hash. You could try plugging it into a loop which should take care of it:

foreach my $k (keys %hash) {
  print $out $hash{$k};
}


split/\s/ instead of split/\t/


Works For Me. Are you sure your example matches your actual code and data?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜