Regex: match, but not if in a comment

2023-03-21 17:15 问答作者：

I have a file of data fields, which may contain comments, like below:

id, data, data, data
101 a, b, c
102 d, e, f
103 g, h, i // has to do with 101 a, b, c
104 j, k, l
//105 m, n, o
// 106 p, q, r

As you can see in the first comment above, there are direct references to a matching pattern. Now, I want to capture 103 and it's three data fields, but I don't want to capture what's in the comments.

I've tried negative lookbehind to exclude 105 and 106, but I can't come up with a regex to capture both.

(?<!//)(\b\d+\b),\s(data),\s(data),\s(data)

This will cap开发者_开发百科ture all but exclude capture of 105, but to specify

(?<!//\s*) or (?<!//.*)

as my attempt to exclude a comment with any whitespace or any characters invalidates my entire regex.

I have a feeling I need a crafty use of an anchor, or I need to wrap what I want in a capture group and make a reference to it (like with $1) in my lookbehind.

If this is another case of "regular expressions don't support recursion" because it's a regular language (a la automata theory), please point that out.

Is it possible to exclude the comments in 103, and lines 105 and 106, using a regular expression? If so, how?

The easy way out is to replace \s*//.* with the empty string before you begin.

This will remove all the (single-line) comments from your input and you can go on with a simple expression to match what actually you want.

The alternative would be to use look-ahead instead of look-behind:

^(?!//)(\b\d+\b),\s(data),\s(data),\s(data)

In your case it would even work to just anchor the regex because it is clear that the first thing on a line must be a digit:

^(\b\d+\b),\s(data),\s(data),\s(data)

Some regex engines (the one in .NET, for example), support variable-length look-behinds, your's does not seem to be capable of this, this is why (?<!//\s*) fails for you.

You could simply anchor the regex to the start of the line:

(?m)^(\d+),\s(\S+),\s(\S+),\s(\S+)

It seems to me you could just anchor the expression at the beginning of the line (to get all the data):

^(\d+),\s(data),\s(data),\s(data)\s*(?://|$)

Or maybe you can use a proper CSV parser which can handle comments.

One another way I have just used in the text editor with the regex when you don't have the regex look ahead/behind features is just use a sequence of these:

^[^\r\n/]*(/[^/])?[^\r\n/]*(/[^/])?my_search_sequence

It will ignore sequence of / splitted by not / character by maximum of 2 of them. If you want more, then just add more:

^[^\r\n/]*(/[^/])?[^\r\n/]*(/[^/])?[^\r\n/]*(/[^/])my_search_sequence

and so on.

Probability that your search word would be behind the sequence like these does reduce with the length of the regex.

继续阅读：comments regex

Regex: match, but not if in a comment

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？