How can I convert an input file to UTF-8 encoding in Perl?

2022-12-09 11:12 问答作者：

I already know how to convert the non-utf8-encoded content of a file line by line to UTF-8 encode, using something like the following code:

# outfile.txt is in GB-2312 encode    
open my $filter,"<",'c:/outfile.txt'; 

while(<$filter>){
#convert each line of outfile.txt 开发者_JAVA技巧to UTF-8 encoding   
    $_ = Encode::decode("gb2312", $_); 
...}

But I think Perl can directly encode the whole input file to UTF-8 format, so I've tried something like

#outfile.txt is in GB-2312 encode
open my $filter,"<:utf8",'c:/outfile.txt';

(Perl says something like "utf8 "\xD4" does not map to Unicode" )

and

open my $filter,"<",'c:/outfile.txt'; 
$filter = Encode::decode("gb2312", $filter);

(Perl says "readline() on unopened filehandle!)

They don't work. But is there some way to directly convert the input file to UTF-8 encode?

Update:

Looks like things are not as simple as I thought. I now can convert the input file to UTF-8 code in a roundabout way. I first open the input file and then encode the content of it to UTF-8 and then output to a new file and then open the new file for further processing. This is the code:

open my $filter,'<:encoding(gb2312)','c:/outfile.txt'; 
open my $filter_new, '+>:utf8', 'c:/outfile_new.txt'; 
print $filter_new $_ while <$filter>; 
while (<$filter_new>){
...
}

But this is too much work and it is even more troublesome than simply encode the content of $filter line by line.

I think I misunderstood your question. I think what you want to do is read a file in a non-UTF-8 encoding, then play with the data as UTF-8 in your program. That's something much easier. After you read the data with the right encoding, Perl represents it internally as UTF-8. So, just do what you have to do.

When you write it back out, use whatever encoding you want to save it as. However, you don't have to put it back in a file to use it.

old answer

The Perl I/O layers only read the data assuming it's already properly encoded. It's not going to convert encoding for you. By telling open to use utf8, you're telling it that it already is utf8.

You have to use the Encode module just as you've shown (unless you want to write your own I/O layer). You can convert bytes to UTF-8, or if you know the encoding, you can convert from one encoding to another. Since it looks like you already know the encoding, you might want the from_to() function.

If you're just starting out with Perl and Unicode, go through Juerd's Perl Unicode Advice before you do anything.

The :encoding layer will return UTF-8, suitable for perl's use. That is, perl will recognize each character as a character, even if they are multiple bytes. Depending on what you are going to do next with the data, this may be adequate.

But if you are doing something with the data where perl will try to downgrade it from utf8, you either need to tell perl not to (for instance, doing a binmode(STDOUT, ":utf8") to tell perl that output to stdout should be utf8), or you need to have perl treat your utf8 as binary data (interpreting each byte separately, and knowing nothing about the utf8 characters.)

To do that, all you need is to apply an additional layer to your open:

open my $foo, "<:encoding(gb2312):bytes", ...;

Note that the output of the following will be the same:

perl -we'open my $foo, "<:encoding(gb2312):bytes", "foo"; $bar = <$foo>; print $bar'
perl -CO -we'open my $foo, "<:encoding(gb2312)", "foo"; $bar = <$foo>; print $bar'

but in one case, perl knows that data read is utf8 (and so length($bar) will report the number of utf8 characters) and has to be explicitly told (by -CO) that STDOUT will accept utf8, and in the other, perl makes no assumptions about the data (and so length($bar) will report the number of bytes), and just prints it out as is.

继续阅读：character-encoding perl unicode utf-8

How can I convert an input file to UTF-8 encoding in Perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？