How can I change my regular expression to read UTF-8?

2023-02-09 11:33 问答作者：

I got very far in a script I am working on only to find out it has a problem reading UTF-8 characters.

I have a contact in Sweden that made a VM on his machine with some UTF-8 in it and when my script hit that VM it lost its mind, but it was able to read all of the other VMs that are in the "normal" charset.

Anyhow, maybe my code will make more sense.

#!/usr/bin/perl
use strict;
use warnings;
#use utf8;
use Net::OpenSSH;

# Create a hash for storing the options needed by Net::OpenSSH
my %ssh_options = (
    port => '22',
    user => 'root',
    password => 'password'
);

# Create a new Net::OpenSSH object
my $ssh = Net::OpenSSH->new('192.168.2.101', %ssh_options);

# Create an array and capture the ESX\ESXi output from开发者_如何学Go the current server
my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
shift @getallvms;
# Process data gathered from server
foreach my $vm (@getallvms) {
    # Match ID, NAME
    $vm =~  m/^(?<id> \d+)\s+(?<name> .+?)\s+/xm;
    my $id = "$+{id}";
    my $name = "$+{name}";
    print "$id\n";
    print "$name\n";
    print "\n";
}

I have narrowed it down to my regular expression as the problem, because here the raw output from the server before regular expression is applied.

416
TEST Box åäö!"''*#

And this is what I get after I apply my regular expression

416
TEST

For some reason the regular expression is not matching, I just don't know why. And the current regular expression in the example is the third attempt at getting it to work.

The FULL line that I am matching looks like this. The way my regular expression was done was because I only need the first two blocks of information, the expression you have wants to copy the entire line.

The code:

432    TEST Box åäö!"''*#   [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx   slesGuest    vmx-04

The subpattern

(?<name> .+?)\s+

in your regular expression means “match and remember one or more non-newline characters, but stop as soon as you find whitespace,” so $name contains TEST because the pattern stopped matching when it saw the space just before Box.

The VI Toolkit wiki gives an example of the getallvms subcommand's output:

# vmware-vim-cmd -H 10.10.10.10 -U root -P password /vmsvc/getallvms
Vmid    Name               File                 Guest OS       Version   Annotation
64     bartPE    [store] BartPE/BartPE.vmx     winXPProGuest     vmx-04
96     trustix   [store] Trustix/Trustix.vmx   otherLinuxGuest   vmx-04

The case is slightly different from the example in your question, but it appears that we can look for [store] as a bumper for the match:

/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix

The non-greedy quantifier +? means match one or more of something, but the match wants to hand control to the rest of the pattern as quickly as possible. Remember that [ has a special meaning in regular expressions, but the pattern \[ matches a literal rather than introducing a character class.

I think of this technique as bookending or tacking-and-stretching. If you want to extract a chunk of text that's difficult to characterize, look for surrounding features that are easy to match—often as simple as ^ or $. Then use a stretchy pattern to grab everything in between, usually (.+) or (.+?). Read the “Quantifiers” section of the perlre documentation for an explanation of your many options.

This fixes the immediate problem, and you can also add polish in a few areas.

Do not use $1, $2, and friends unconditionally! Always test that the pattern matches before using capture variables. For example

if (/(foo|bar|baz)/) {
  print "got $1\n";
}
else {
  print "no match\n";
}

An unprotected print $1 can produce surprising results that are tough to debug.

Judicious use of Perl's defaults can help emphasize the computation and lets the mechanism fade into the background. Dropping $vm in favor of $_ as the implicit loop variable and implicit match target makes for a nicer result.

Your comments merely translate from Perl to English. The most helpful comments explain the why, not the what. Also keep in mind Rob Pike's advice on commenting:

If your code needs a comment to be understood, it would be better to rewrite it so it's easier to understand.

In the assignments from %+, the quotes don't do anything useful. The values are already strings, so remove the quotes.

my $id   = $+{id};
my $name = $+{name};

Below is a modified version of your code that captures everything after the number but before [store] into $name. The utf8 pragma declares that your source code—not, as with a common mistake, your input—contains UTF-8. The test below simulates with a canned echo the output from vim-cmd on the Swedish VM.

As Tom suggested, I use the Encode module to decode the output that arrives through the SSH connection and encode it for benefit of the local host before printing it out.

The perlunifaq documentation advises decoding external data into Perl's internal format and then encoding any output just before it's written. I assume that the value returned from $ssh->capture(...) uses UTF-8 encoding, that is, that the remote host is sending UTF-8. We see the expected result because I'm running a modern distribution of Linux and ssh-ing back to it, but in the wild, you may be dealing with some other encoding.

You're able to get away with skipping the calls to decode and encode because Perl's internal format happens to match those of the hosts you're using. In general, however, cutting corners can get you into trouble:

What if I don't decode?
What if I don't encode?

Finally, the code!

#! /usr/bin/env perl

use strict;
use utf8;
use warnings;

use Encode;
use Net::OpenSSH;

my %ssh_options = ();
my $ssh = Net::OpenSSH->new('localhost', %ssh_options);

# Create an array and capture the ESX\ESXi output from the current server
#my @getallvms = $ssh->capture('vim-cmd vmsvc/getallvms');
my @getallvms = $ssh->capture(<<EOEcho);
echo -e 'JUNK\n416 TEST Box åäö!"'\\'\\''*#    [Store] TEST Box +w6XDpMO2IQ-_''_+Iw/TEST Box +w6XDpMO2IQ _''_+Iw.vmx   slesGuest    vmx-04'
EOEcho
shift @getallvms;

for (@getallvms) {
  $_ = decode "utf8", $_, Encode::FB_CROAK;

  if (/^(?<id> \d+) \s+ (?<name> .+?) \s+ \[store]/mix) {
    my $id   = $+{id};
    my $name = $+{name};
    print encode("utf8", $id),   "\n",
          encode("utf8", $name), "\n",
          "\n";
  }
  else {
    print "no match\n";
  }
}

Output:

416
TEST Box åäö!"''*#

If you know the string you work on is UTF-8 and Net::OpenSSH doesn't (and hence doesn't mark it as such), you can convert it to an internal representation Perl can work on with one of:

use Encode;
decode_utf8( $in_place );
$decoded = decode_utf8( $raw );

So you have make sure, that Perl understand those names as UTF-8 encoded strings. So far I don't think it has. A comprehensive overview about UTF-8 in Perl.

You can test your strings unicodeness with Encode::is_utf8 and decode them with Encode::decode('UTF-8', $your_string).

UTF-8 is pretty messy still in Perl, IMHO. You must have pretty patient with it.

To print UTF-8 strings out in pretty way, you should use something like that in your script:

BEGIN {
   binmode(STDOUT, ':encoding(UTF-8)');
   binmode(STDERR, ':encoding(UTF-8)');  # Error messages
}

If you got Perl understand your UTF-8 names, you could regex them properly too.

Recent Net::OpenSSH releases have native support for charset encoding/decoding in capture methods:

my @getallvms = $ssh->capture({stream_encoding => 'utf8'},
                              'vim-cmd vmsvc/getallvms');

继续阅读：perl regex utf-8

How can I change my regular expression to read UTF-8?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？