Regex: Repeated capturing groups

2023-01-06 02:12 问答作者：

I have to parse some tables from an ASCII text file. Here's a partial sample:

QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
RECKITTBEN  192.50  209.00  192.50  201.80    5.21      34      2850     5.707
RUPALIINS   150.00  159.00  150.00  156.25    6.29       4        80      .125
SALAMCRST   164.00  164.75  163.00  163.25    -.45      80      8250    13.505
SINGERBD    779.75  779.75  770.00  773.00    -.89       8        95      .735
SONARBAINS   68.00   69.00   67.50   68.00     .74      11      3050     2.077

The table consists of 1 column of text and 8 columns of floating point numbers. I'd like to capture each column via regex.

I'm pretty new to regular expressions. Here's the faulty regex pattern I came up with:

(\S+)\s+(\s+[\d\.\-]+){8}

But the pattern captures only the first and the last columns. RegexBuddy also emits the following warning:

You repeated the capturing group开发者_Python百科 itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations.

I've consulted their help file, but I don't have a clue as to how to solve this.

How can I capture each column separately?

In C# (modified from this example):

string input = "QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212";
string pattern = @"^(\S+)\s+(\s+[\d.-]+){8}$";
Match match = Regex.Match(input, pattern, RegexOptions.MultiLine);
if (match.Success) {
   Console.WriteLine("Matched text: {0}", match.Value);
   for (int ctr = 1; ctr < match.Groups.Count; ctr++) {
      Console.WriteLine("   Group {0}:  {1}", ctr, match.Groups[ctr].Value);
      int captureCtr = 0;
      foreach (Capture capture in match.Groups[ctr].Captures) {
         Console.WriteLine("      Capture {0}: {1}", 
                           captureCtr, capture.Value);
         captureCtr++; 
      }
   }
}

Output:

Matched text: QSMDRYCELL   11.00   11.10   11.00   11.00    -.90      11     11000     1.212
...
    Group 2:      1.212
         Capture 0:  11.00
         Capture 1:    11.10
         Capture 2:    11.00
...etc.

If you want to know what the warning is appearing for, it's because your capture group matches multiple times (8, as you specified) but the capture variable can only have one value. It is assigned the last value matched.

As described in question 1313332, retrieving these multiple matches is generally not possible with a regular expression, although .NET and Perl 6 have some support for it.

The warning suggests that you could put another group around the whole set, like this:

(\S+)\s+((\s+[\d\.\-]+){8})

You would then be able to see all the columns, but of course they would not be separated. Because it's generally not possible to capture them separately, the more common intention is to capture all of it, and the warning helps remind you of this.

Unfortunately you need to repeat the (…) 8 times to get each column separately.

^(\S+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)\s+([-.\d]+)$

If code is possible, you can first match those numeric columns as a whole

>>> rx1 = re.compile(r'^(\S+)\s+((?:[-.\d]+\s+){7}[-.\d]+)$', re.M)
>>> allres = rx1.findall(theAsciiText)

then split the columns by spaces

>>> [[p] + q.split() for p, q in allres]

继续阅读：.net regex

Regex: Repeated capturing groups

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？