开发者

RegEx with strange behaviour: matching String with back reference to allow escaping and single and double quotes

Matching a string that allows escaping is not that difficult. Look here: http://ad.hominem.org/log/2005/05/quoted_strings.php. For the sake of simplicity I chose the approach, where a string is divided into two "atoms": either a character that is "not a quote or backslash" or a backslash followed by any character.

"(([^"\\]|\\.)*)"

The obvious improvement now is, to allow different quotes and use a backreference.

(["'])((\\.|[^\1\\])*?)\1

Also multiple backslashes are interpreted correctly.

Now to the part, where it gets weird: I have to parse some variables like this (note the missing backslash in the first variable value):

test = 'foo'bar'
var = 'lol'
int = 7

So I wrote quite an expression. I found out that开发者_高级运维 the following part of it does not work as expected (only difference to the above expression is the appended "([\r\n]+)"):

(["'])((\\.|[^\1\\])*?)\1([\r\n]+)

Despite the missing backslash, 'foo'bar' is matched. I used RegExr by gskinner for this (online tool) but PHP (PCRE) has the same behaviour.

To fix this, you can hardcode the quote by replacing the backreferences with '. Then it works as expected. Does this mean the backreference does actually not work in this case? And what does this have to do with the linebreak characters, it worked without it?


You can't use a backreference inside a character class; \1 will be interpreted as octal 1 in this case (at least in some regex engines, I don't know if this is universally true).

So instead try the following:

(["'])(?:\\.|(?!\1).)*\1(?:[\r\n]+)

or, as a verbose regex:

(["'])       # match a quote
(?:          # either match...
 \\.         # an escaped character
 |           # or
 (?!\1).     # any character except the previously matched quote
)*           # any number of times
\1           # then match the previously matched quote again
(?:[\r\n]+)  # plus one or more linebreak characters.

Edit: Removed some unnecessary parentheses and changed some into non-capturing parentheses.

Your regex insists on finding at least one carriage return after the matched string - why? What if it's the last line of your file? Or if there is a comment or whitespace after the string? You probably should drop that part completely.

Also note that you don't have to make the * lazy for this to work - the regex can't cross an unescaped quote character - and that you don't have to check for backslashes in the second part of the alternation since all backslashes have already been scooped up by the first part of the alternation (?:\\.|(?!\1).). That's why this part has to be first.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜