开发者

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

2023-02-01 06:28 问答作者：

Note below how ã 开发者_运维百科changes to a. NOTE2: Before you blame this on CMD.EXE and Windows pipe weirdness, see Experiment 2 below which gets a similar problem using File::Find.

The particular problem I'm trying to fix involves working with image files stored on a local drive, and manipulating the file names which may contain foreign characters. The two experiments shown below are intermediate debugging steps.

The ã character is common in latin languages. e.g. http://pt.wikipedia.org/wiki/Cão

Experiment 1

Look closely, note how cão becomes cao.

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

Experiment 2

Here I tried using File::Find instead of piped input, in case the issue was with the Windows implementation of the | shell operator. The issue actually gets worse, as the ~a becomes Pi:

Why does Perl lose foreign characters on Windows; can this be fixed (if so, how)?

Debugging update:

I tried some of the tricks listed at http://perldoc.perl.org/perlunicode.html, e.g. use utf8, use feature 'unicode_strings', etc, to no avail.

Environment and Version Info

The OS is Windows 7, 64-bit.

The Perl is:

This is perl 5, version 12, subversion 2 (v5.12.2) built for MSWin32-x64-multi-thread
(with 8 registered patches, see perl -V for more detail)

Copyright 1987-2010, Larry Wall

Binary build 1202 [293621] provided by ActiveState http://www.ActiveState.com
Built Sep  6 2010 22:53:42

Perl, as with many other scripting languages, is built on the C runtime.

On Windows, the standard MS C runtime for narrow (byte) characters uses an encoding which defaults to the Windows system encoding (‘ANSI code page’) for IO activities such as opening files or writing to the console.

The ANSI code page is always a locale-specific encoding: usually single-byte, but multi-byte in some locales (eg China, Japan etc). It is never UTF-8 or anything else capable of reproducing the whole of Unicode; which characters Perl IO can cope with is dependent on the Windows locale (“language for non-Unicode programs” setting).

Whilst console apps can be given UTF-8 using the chcp 65001 command, there are a number of serious inconsistencies which come up with doing this. This causes difficulty for a lot of tools on Windows and is something Microsoft really needs to fix, but so far their attitude is that Unicode Equals UTF-16; everyone who wants Unicode to work must use the widechar interfaces.

So you won't currently be able to deal with files that use non-ASCII filenames reliably in Perl on Windows. Sorry.

You could try Python (which added special Windows-only filename handling to get around this problem in version 2.3 onwards; see PEP 277), or one of the Unicode-aware Windows Scripting Host languages. Either way, getting Unicode out to the console on Windows still has more pitfalls.

The following 3 liner works as expected on my newly minted ActivePerl 5.12.2:

use utf8;
open($file, '>:encoding(UTF-8)', "output.txt") or die $!;
print $file "さっちゃん";

I think the culprit is cmd.exe.

继续阅读：perl unicode windows-7

更多精彩内容

0 赞 0 踩 0 收藏

上一篇:How to undo a commit in SVN? [duplicate]

下一篇:Working with __get() by reference

精彩评论

暂无评论...

登录注册

请自觉遵守互联网相关的政策法规，严禁发布色情、暴力、反动的言论！

验证码：

验证码

取消

最新问答

问答排行榜