How can I change my regular expression to read UTF-8?
I got very far in a script I am working on only to find out it has a problem reading UTF-8 characters.
I have a contact in Sweden that made a VM on his machine with some UTF-8 in it and when my script hit that VM it lost its mind, but it was able to read all of the other VMs that are in the "normal" charset.
Anyhow, maybe my code will make more sense.
#!/usr/bin/perl
use strict;
use warnings;
#use utf8;
use Net::OpenSSH;
# Create a hash for storing the options needed by Net::OpenSSH
my %ssh_options = (
port => '22',
user => 'root',
password => 'password'
);
# Create a new Net::OpenSSH object
my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options);
# Create an array and capture the ESX\ESXi output from开发者_如何学Go the current server
my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
shift @getallvms;
# Process data gathered from server
foreach my $vm (@getallvms) {
# Match ID, NAME
$vm =~ m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm;
my $id = "$+{id}";
my $name = "$+{name}";
print "$id\n";
print "$name\n";
print "\n";
}
I have narrowed it down to my regular expression as the problem, because here the raw output from the server before regular expression is applied.
416
TEST Box åäö!"''*#
And this is what I get after I apply my regular expression
416
TEST
For some reason the regular expression is not matching, I just don't know why. And the current regular expression in the example is the third attempt at getting it to work.
The FULL line that I am matching looks like this. The way my regular expression was done was because I only need the first two blocks of information, the expression you have wants to copy the entire line.
The code:
432 TEST Box åäö!"''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04
The subpattern
(?<name> .+?)\s+
in your regular expression means “match and remember one or more non-newline characters, but stop as soon as you find whitespace,” so $name
contains TEST
because the pattern stopped matching when it saw the space just before Box
.
The VI Toolkit wiki gives an example of the getallvms subcommand's output:
# vmware-vim-cmd -H 10.10.10.10 -U root -P password /vmsvc/getallvms Vmid Name File Guest OS Version Annotation 64 bartPE [store] BartPE/BartPE.vmx winXPProGuest vmx-04 96 trustix [store] Trustix/Trustix.vmx otherLinuxGuest vmx-04
The case is slightly different from the example in your question, but it appears that we can look for [store]
as a bumper for the match:
/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix
The non-greedy quantifier +?
means match one or more of something, but the match wants to hand control to the rest of the pattern as quickly as possible. Remember that [
has a special meaning in regular expressions, but the pattern \[
matches a literal rather than introducing a character class.
I think of this technique as bookending or tacking-and-stretching. If you want to extract a chunk of text that's difficult to characterize, look for surrounding features that are easy to match—often as simple as ^
or $
. Then use a stretchy pattern to grab everything in between, usually (.+)
or (.+?)
. Read the “Quantifiers” section of the perlre documentation for an explanation of your many options.
This fixes the immediate problem, and you can also add polish in a few areas.
Do not use $1
, $2
, and friends unconditionally! Always test that the pattern matches before using capture variables. For example
if (/(foo|bar|baz)/) {
print "got $1\n";
}
else {
print "no match\n";
}
An unprotected print $1
can produce surprising results that are tough to debug.
Judicious use of Perl's defaults can help emphasize the computation and lets the mechanism fade into the background. Dropping $vm
in favor of $_
as the implicit loop variable and implicit match target makes for a nicer result.
Your comments merely translate from Perl to English. The most helpful comments explain the why, not the what. Also keep in mind Rob Pike's advice on commenting:
If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.
In the assignments from %+
, the quotes don't do anything useful. The values are already strings, so remove the quotes.
my $id = $+{id};
my $name = $+{name};
Below is a modified version of your code that captures everything after the number but before [store]
into $name
. The utf8 pragma declares that your source code—not, as with a common mistake, your input—contains UTF-8. The test below simulates with a canned echo
the output from vim-cmd
on the Swedish VM.
As Tom suggested, I use the Encode module to decode the output that arrives through the SSH connection and encode it for benefit of the local host before printing it out.
The perlunifaq documentation advises decoding external data into Perl's internal format and then encoding any output just before it's written. I assume that the value returned from $ssh->capture(...)
uses UTF-8 encoding, that is, that the remote host is sending UTF-8. We see the expected result because I'm running a modern distribution of Linux and ssh-ing back to it, but in the wild, you may be dealing with some other encoding.
You're able to get away with skipping the calls to decode
and encode
because Perl's internal format happens to match those of the hosts you're using. In general, however, cutting corners can get you into trouble:
- What if I don't decode?
- What if I don't encode?
Finally, the code!
#! /usr/bin/env perl
use strict;
use utf8;
use warnings;
use Encode;
use Net::OpenSSH;
my %ssh_options = ();
my $ssh = Net::OpenSSH->new('localhost', %ssh_options);
# Create an array and capture the ESX\ESXi output from the current server
#my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
my @getallvms = $ssh->capture(<<EOEcho);
echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*# [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx slesGuest vmx-04'
EOEcho
shift @getallvms;
for (@getallvms) {
$_ = decode "utf8", $_, Encode::FB_CROAK;
if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) {
my $id = $+{id};
my $name = $+{name};
print encode("utf8", $id), "\n",
encode("utf8", $name), "\n",
"\n";
}
else {
print "no match\n";
}
}
Output:
416 TEST Box åäö!"''*#
If you know the string you work on is UTF-8 and Net::OpenSSH doesn't (and hence doesn't mark it as such), you can convert it to an internal representation Perl can work on with one of:
use Encode;
decode_utf8( $in_place );
$decoded = decode_utf8( $raw );
So you have make sure, that Perl understand those names as UTF-8 encoded strings. So far I don't think it has. A comprehensive overview about UTF-8 in Perl.
You can test your strings unicodeness with Encode::is_utf8
and decode them with Encode::decode('UTF-8', $your_string)
.
UTF-8 is pretty messy still in Perl, IMHO. You must have pretty patient with it.
To print UTF-8 strings out in pretty way, you should use something like that in your script:
BEGIN {
binmode(STDOUT, ':encoding(UTF-8)');
binmode(STDERR, ':encoding(UTF-8)'); # Error messages
}
If you got Perl understand your UTF-8 names, you could regex them properly too.
Recent Net::OpenSSH releases have native support for charset encoding/decoding in capture methods:
my @getallvms = $ssh->capture({stream_encoding => 'utf8'},
'vim-cmd vmsvc/getallvms');
精彩评论