How can Perl split a line on whitespace except when the whitespace is in doublequotes?
I have the following string:
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTi开发者_如何学编程meout 1 30
I need a regular expression to split this line but ignore spaces in double quotes in Perl.
The following is what I tried but it does not work.
(".*?"|\S+)
Once upon a time I also tried to re-invent the wheel, and solve this myself.
Now I just use Text::ParseWords and let it do the job for me.
Update: It looks like the fields are actually tab separated, not space. If that is guaranteed, just split on \t
.
First, let's see why (".*?"|\S+)
"does not work". Specifically, look at ".*?"
That means zero or more characters enclosed in double-quotes. Well, the field that is giving you problems is ""C:\Program Files\ABC\ABC XYZ""
. Note that each ""
at the beginning and end of that field will match ".*?"
because ""
consists of zero characters surrounded with double quotes.
It is better to match as specifically as possible rather than splitting. So, if you have a configuration file with directives and a fixed format, form a regular expression match that is as close to the format you are trying to match as possible.
Move the quotation marks outside of the capturing parentheses if you don't want them.
#!/usr/bin/perl
use strict;
use warnings;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my @parts = $s =~ m{\A(\w+) ([0-9]) (""[^"]+"") (\w+) ([0-9]) ([0-9]{2})};
use Data::Dumper;
print Dumper \@parts;
Output:
$VAR1 = [
'StartProgram',
'1',
'""C:\\Program Files\\ABC\\ABC XYZ""',
'CleanProgramTimeout',
'1',
'30'
];
In that vein, here is a more involved script:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my @strings = split /\n/, <<'EO_TEXT';
StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30
StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30
EO_TEXT
my $re = qr{
(?<directive>StartProgram)\s+
(?<instance>[0-9][0-9]?)\s+
(?<path>"".+?""|\S+)\s+
(?<timeout_directive>CleanProgramTimeout)\s+
(?<timeout_instance>[0-9][0-9]?)\s+(?<timeout_seconds>[0-9]{2})
}x;
for (@strings) {
if ( $_ =~ $re ) {
print Dumper \%+;
}
}
Output:
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => '""C:\\Program Files\\ABC\\ABC XYZ""',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
$VAR1 = {
'timeout_directive' => 'CleanProgramTimeout',
'timeout_seconds' => '30',
'path' => 'c:\\opt\\perl',
'directive' => 'StartProgram',
'timeout_instance' => '1',
'instance' => '1'
};
Update: I cannot get Text::Balanced
or Text::ParseWords
to parse this correctly. I suspect the problem is the repeated quotation marks that delineate the substring that should not be split. The following code is my best (not very good) attempt at solving the generic problem by using split and then selective re-gathering of parts of the string.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $s = q{StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30};
my $t = q{StartProgram 1 c:\opt\perl CleanProgramTimeout 1 30};
print Dumper parse_line($s);
print Dumper parse_line($t);
sub parse_line {
my ($line) = @_;
my @parts = split /(\s+)/, $line;
my @real_parts;
for (my $i = 0; $i < @parts; $i += 1) {
unless ( $parts[$i] =~ /^""/ ) {
push @real_parts, $parts[$i] if $parts[$i] =~ /\S/;
next;
}
my $part;
do {
$part .= $parts[$i++];
} until ($part =~ /""$/);
push @real_parts, $part;
}
return \@real_parts;
}
my $x = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
my @parts = $x =~ /("".*?""|[^\s]+?(?>\s|$))/g;
my $str = 'StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 1 30';
print "str:$str\n";
@A = $str =~ /(".+"|\S+)/g;
foreach my $l (@A) {
print "<$l>\n";
}
That gives me:
$ ./test.pl
str:StartProgram 1 ""C:\Program Files\ABC\ABC XYZ"" CleanProgramTimeout 130
<StartProgram>
<1>
<""C:\Program Files\ABC\ABC XYZ"">
<CleanProgramTimeout>
<1>
<30>
精彩评论