How to extract regex comment
I have开发者_C百科 a regex like this
(?<!(\w/))$#Cannot end with a word and slash
I would like to extract the comment from the end. While the example does not reflect this case, there could be a regex with includes regex on hashes.
\##value must be a hash
What would the regex be to extract the comment ensuring it is safe when used against regex which could contain #'s that are not comments.
Here's a .Net flavored Regex for partly parsing .Net flavor patterns, which should get pretty close:
\A
(?>
\\. # Capture an escaped character
| # OR
\[\^? # a character class
(?:\\.|[^\]])* # which may also contain escaped characters
\]
| # OR
\(\?(?# inline comment!)\#
(?<Comment>[^)]*)
\)
| # OR
\#(?<Comment>.*$) # a common comment!
| # OR
[^\[\\#] # capture any regular character - not # or [
)*
\z
Luckily, in .Net each capturing group remembers all of its captures, and not just the last, so we can find all captures of the Comment
group in a single parse. The regex pretty much parses regular expression - but hardly fully, it just parses enough to find comments.
Here's how you use the result:
Match parsed = Regex.Match(pattern, pattern,
RegexOptions.IgnorePatternWhitespace |
RegexOptions.Multiline);
if (parsed.Success)
{
foreach (Capture capture in parsed.Groups["Comment"].Captures)
{
Console.WriteLine(capture.Value);
}
}
Working example: http://ideone.com/YP3yt
One last word of caution - this regex assumes the whole pattern is in IgnorePatternWhitespace
mode. When it isn't set, all #
are matched literally. Keep in mind the flag might change multiple times in a single pattern. In (?-x)#(?x)#comment
, for example, regardless of IgnorePatternWhitespace
, the first #
is matched literally, (?x)
turns the IgnorePatternWhitespace
flag back on, and the second #
is ignored.
If you want a robust solution you can use a regex-language parser.
You can probably adapt the .Net source code and extract a parser:
- Reference Source - RegexParser.cs
- GitHub - RegexParser.cs
Something like this should work (if you run it separately on each line of the regex). The comment itself (if it exists) will be in the third capturing group.
/^((\\.)|[^\\\#])*\#(.*)/
(\\.)
matches an escaped character, [^\#]
matches any non-slash non-hash characters, together with the *
quantifier they match the entire line before the comment. Then the rest of the regex detects the comment marker and extracts the text.
One of the overlooked options in regex parsing is the RightToLeft
mode.
extract the comment from the end.
One can simply the pattern if we work our way from the end of the line to the beginning. Such as
^
.+? # Workable regex
(?<Comment> # Comment group
(?<!\\) # Not a comment if escaped.
\# # Anchor for actual comment
[^#]+ # The actual commented text to stop at #
)? # We may not have a comment
$
Use the above pattern in C# with these options RegexOptions.RightToLeft | RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline
there could be a regex with includes regex on hashes
This line (?<!\\) # Not a comment if escaped.
handles that situation by saying if there is a proceeding \
, we do not have a comment.
精彩评论