开发者

Regex: Repeated capturing groups

I have to parse some tables from an ASCII text file. Here's a partial sample:

QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
RECKITTBEN  192.50  209.00  192.50  201.80    5.21      34      2850     5.707
RUPALIINS   150.00  159.00  150.00  156.25    6.29       4        80      .125
SALAMCRST   164.00  164.75  163.00  163.25    -.45      80      8250    13.505
SINGERBD    779.75  779.75  770.00  773.00    -.89       8        95      .735
SONARBAINS   68.00   69.00   67.50   68.00     .74      11      3050     2.077

The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.

I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:

(\S+)\s+(\s+[\d\.\-]+){8}

But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:

You repeated the capturing group开发者_Python百科 itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations.

I've consulted their help file, but I don't have a clue as to how to solve this.

How can I capture each column separately?


In C# (modified from this example):

string input = "QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
   Console.WriteLine("Matched text: {0}", match.Value);
   for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
      Console.WriteLine("   Group {0}:  {1}", ctr, match.Groups[ctr].Value);
      int captureCtr = 0;
      foreach (Capture capture in match.Groups[ctr].Captures) {
         Console.WriteLine("      Capture {0}: {1}", 
                           captureCtr, capture.Value);
         captureCtr++; 
      }
   }
}

Output:

Matched text: QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
...
    Group 2:      1.212
         Capture 0:  11.00
         Capture 1:    11.10
         Capture 2:    11.00
...etc.


If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value. It is assigned the last value matched.

As described in question 1313332, retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.

The warning suggests that you could put another group around the whole set, like this:

(\S+)\s+((\s+[\d\.\-]+){8})

You would then be able to see all the columns, but of course they would not be separated. Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.


Unfortunately you need to repeat the (…) 8 times to get each column separately.

^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$

If code is possible, you can first match those numeric columns as a whole

>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)

then split the columns by spaces

>>> [[p] + q.split() for p, q in allres]
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜