How can I guess if a string has text or binary data in Perl?
What is the best way to find out if the scalar value is ASCII/UTF8 (text) or a binary data in Perl? Is this code right?开发者_运维知识库:
if (is_utf8($scalar, 1) or ($scalar =~ m/\A [[:ascii:]]* \Z/xms)) {
# $scalar is a text
}
else {
# $scalar is a binary
}
Is there a better way?
is_utf8
tests whether the Perl utf8 flag is turned on or not. It's possible for a scalar to contain correctly formed utf-8 and not have the flag turned on. I think it's possible to deliberately turn the flag on even with malformed utf-8 too, but I'm not sure.
To check whether the scalar contains UTF-8 data, you need to check the flag, and if it is not, also try something like
eval {
my $utf8 = decode_utf8 ($scalar);
}
and then check for errors in $@
.
To check whether a non-UTF-8 scalar contains non-ASCII data, your idea $scalar =~ m/\A [[:ascii:]]* \Z/xms
looks ok.
The best way, clearly, is to simply keep track when you are reading the data. You as the programmer should already know whether you are getting text (and its encoding) or binary data. When you're reading text, you Encode::decode()
it (see http://p3rl.org/UNI for details) into Perl text strings.
If you really don't know beforehand, the -T
and -B
file tests offer a heuristic.
Disregard Kinopiko's answer, in the vast majority of cases, you should not need to know about the internal representation of data, and messing with the utility functions from the utf8
pragma module is the wrong approach.
精彩评论