开发者

Regex to match quoted strings with negative lookbehind (.NET)

I am trying to create a .NET Regex that will match quoted strings in VB.NET source code, but excluding certain unwanted strings, such as strings in XML comments and region labels etc.

Here's a data sample, representing some VB.NET source code that the Regex might execute against:

#Region "Class Constructors"

''' <summary>
''' Initializes a new instance of the <see cref="MyClass" /> class.
''' </summary>
Public Sub New()
    Debug.WriteLine("This string should be matched by the Regex")
End Sub

#End Region

The Regex should match the quoted string in the Debug.WriteLine method call, but should ignore the string in the region label and the XML comment. It should also support VB.NET's quote escaping syntax that uses two consecutive double quotes to represent an embedded (escaped) quote character:

"This is a string containing an escaped quote "" character"

As a starting point, I have experimented with the following Regex, but the negative lookbehind causes it to mat开发者_JS百科ch subsequent closing quotes as if they were opening quotes.

(?<!Region\s+)"(?<Literal>(?:[^"]|"")*)"

As an additional finesse, it would be helpful if the Regex could completely ignore empty strings represented by a pair of quote characters.

Any suggestions please?

Thanks in advance, Tim


I think this is one of the cases where a single regex won't solve all of your problems. I assume that #Region directives can be multi-lined as in:

#Region \
  "MyRegion"

or maybe with some other line break character, so your lookbehind isn't enough at all. Extracting matches selectively from a text with a complex syntax requires a lexer, or maybe you should parse the whole thing differently. You might however be able to find a shortcut, for example you know that you don't want anything between the tags <summary> and </summary>, so you can loop through each line and skip everything past <summary> until you find the closing tag, then you can resume matching for strings. You should put special care in writing a regex to strip comments and preprocessor directives away (ie: ', # and REM). Note that those keywords are all valid when they're not in a string, so stripping comments is a bit involved. Even there, a single regex might not be enough. For dropping double quotes, this seems to do the trick for me:

"((?:[^"]|"")+)"
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜