Regex: Repeated capturing groups
I have to parse some tables from an ASCII text file. Here's a partial sample:
QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212
RECKITTBEN 192.50 209.00 192.50 201.80 5.21 34 2850 5.707
RUPALIINS 150.00 159.00 150.00 156.25 6.29 4 80 .125
SALAMCRST 164.00 164.75 163.00 163.25 -.45 80 8250 13.505
SINGERBD 779.75 779.75 770.00 773.00 -.89 8 95 .735
SONARBAINS 68.00 69.00 67.50 68.00 .74 11 3050 2.077
The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.
I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:
(\S+)\s+(\s+[\d\.\-]+){8}
But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:
You repeated the capturing group开发者_Python百科 itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations.
I've consulted their help file, but I don't have a clue as to how to solve this.
How can I capture each column separately?
In C# (modified from this example):
string input = "QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
Console.WriteLine("Matched text: {0}", match.Value);
for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
Console.WriteLine(" Group {0}: {1}", ctr, match.Groups[ctr].Value);
int captureCtr = 0;
foreach (Capture capture in match.Groups[ctr].Captures) {
Console.WriteLine(" Capture {0}: {1}",
captureCtr, capture.Value);
captureCtr++;
}
}
}
Output:
Matched text: QSMDRYCELL 11.00 11.10 11.00 11.00 -.90 11 11000 1.212
...
Group 2: 1.212
Capture 0: 11.00
Capture 1: 11.10
Capture 2: 11.00
...etc.
If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value. It is assigned the last value matched.
As described in question 1313332, retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.
The warning suggests that you could put another group around the whole set, like this:
(\S+)\s+((\s+[\d\.\-]+){8})
You would then be able to see all the columns, but of course they would not be separated. Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.
Unfortunately you need to repeat the (…)
8 times to get each column separately.
^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$
If code is possible, you can first match those numeric columns as a whole
>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)
then split the columns by spaces
>>> [[p] + q.split() for p, q in allres]
精彩评论