How can I decode UTF-16 data in Perl when I don't know the byte order?

2022-12-31 12:42 问答作者：

If I open a file ( and spec开发者_如何学Goify an encoding directly ) :

open(my $file,"<:encoding(UTF-16)","some.file") || die "error $!\n";
while(<$file>) {
    print "$_\n";
}
close($file);

I can read the file contents nicely. However, if I do:

use Encode;

open(my $file,"some.file") || die "error $!\n";
while(<$file>) {
    print decode("UTF-16",$_);
}
close($file);

I get the following error:

UTF-16:Unrecognised BOM d at F:/Perl/lib/Encode.pm line 174

How can I make it work with decode?

EDIT: here are the first several bytes:

FF FE 3C 00 68 00 74 00

If you simply specify "UTF-16", Perl is going to look for the byte-order mark (BOM) to figure out how to parse it. If there is no BOM, it's going to blow up. In that case, you have to tell Encode which byte-order you have by specifying either "UTF-16LE" for little-endian or "UTF-16BE" for big-endian.

There's something else going on with your situation though, but it's hard to tell without seeing the data you have in the file. I get the same error with both snippets. If I don't have a BOM and I don't specify a byte order, my Perl complains either way. Which Perl are you using and which platform do you have? Does your platform have the native endianness of your file? I think the behaviour I see is correct according to the docs.

Also, you can't simply read a line in some unknown encoding (whatever Perl's default is) then ship that off to decode. You might end up in the middle of a multi-byte sequence. You have to use Encode::FB_QUIET to save the part of the buffer that you couldn't decode and add that to the next chunk of data:

open my($lefh), '<:raw', 'text-utf16.txt';

my $string;
while( $string .= <$lefh> ) {
    print decode("UTF-16LE", $string, Encode::FB_QUIET) 
    }

You need to specify either UTF-16BE or UTF-16LE. See http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM

What you're trying to do impossible.

You're reading lines of text without specifying an encoding, so every byte that contains a newline character (default \x0a) ends a line. But this newline character may very well be in the middle of an UTF-16 character, in which case your next line can't be decoded. If your data is UTF-16LE, this will happen all the time – line feeds are \x0a \x00. If you have UTF16-BE, you might get lucky (newlines are \x00 \x0a), until you get a character with \x0a in the high byte.

So, don't do that, open the file in the right encoding.

继续阅读：decode perl utf-16

How can I decode UTF-16 data in Perl when I don't know the byte order?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？