开发者

Question about pathname encoding

What have I done to get such a strange encoding in this path-name?

In my file manager (Dolphin) the path-name looks good.

#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);

my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;

say Dump $s1;
say Dump $s2;

# SV = PV(0x824b50) at 0x9346d8
#   REFCNT = 1
#   FLAGS = (P开发者_如何学编程ADMY,POK,pPOK,UTF8)
#   PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
#   CUR = 4
#   LEN = 16

# SV = PV(0x7a7150) at 0x934c30
#   REFCNT = 1
#   FLAGS = (PADMY,POK,pPOK,UTF8)
#   PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
#   CUR = 8
#   LEN = 16

say $s1;
say $s2;

# Léo
# Lakmé

$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );

say $s1;
say $s2;

# L�o
# Lakmé


Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.

Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.

Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.

You can make your program work by calling find this way:

find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );

And then calling Encode::decode() when using the value of $File::Find::name:

my $path = Encode::decode('utf8', $File::Find::name);

To be more clear, this is how $File::Find::name was formed:

use Encode;

# This is a way to get $dir to be represented as a UTF-8 string

my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;

say "dir: ", d($dir); # length = 3

# This is what readdir() is returning:

my $leaf = encode('utf8', 'Lakem' . chr(233));

say "leaf: ", d($leaf); # length = 7

$File::Find::name = $dir . '/' . $leaf;

say "File::Find::name: ", d($File::Find::name);

sub d {
  join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}


The POSIX filesystem API is broken as no encoding is enforced. Period.

Many problems can happen. For example a pathname can even contain both latin1 and UTF-8 depending on how various filesystems on a path handle encoding (and if they do).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜