Perl strings internals

2023-01-02 07:30 问答作者：

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?

I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.

Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processi开发者_StackOverflowng of UTF-8 encoded file and here starts trouble.

First, I read file into string like this:

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

then simply print it:

print $contents;

And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character". (I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).

Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).

If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.

Update.

I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.

Thanks to links, now I have much more solid understanding on what's going on and how things should be done.

Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)

Before printing you need to encode the characters back to bytes.

use Encode;  
utf8::encode($contents);

There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)

Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

Oh, and it must use multi-byte strings, because otherwise it's just not unicode.

Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.

In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.

Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.

Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.

You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -

Perl uses logically-wide characters to represent strings internally.

it will confirm your statements.

Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

继续阅读：encoding perl string

Perl strings internals

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？