开发者

Translate the intent of this PHP regex for multiline strings, into Python/PERL

Below is a PHP regex intended to match (multiline) strings inside PHP or JavaScript source code (from this post), but I suspect it's got issues. What is the literal Python (or else PERL) equivalent of this?

~'(\\.|[^'])*'|"(\\.|[^"])*"~s
  • the s modifier means dot matches all characters, including newline; in Python that's re.compile(..., re.DOTALL)
  • I totally don't get the intent of the leading \\. ? Does that reduce to . ? Are double-backslashes need to escape it twice in PHP?
  • allowing in every position a match of either 开发者_运维百科\\. or [^'] (any non-quote character) seems total overkill to me, maybe explains why this person's regex blows up. Does [^'] group not already match everything that . with s modifier does, surely it should match newlines?

  • for constructing two versions of the regex with single, and double, quotes in Python, can use this two-step approach

  • NB a simpler version of this regex can also be found in this list of PHP regex examples, under Programming: String.


The \\. is meant to match a literal backslash in the pattern, and swallow the following character. Note that since patterns in PHP (and Python) are contained in strings, it would actually need to be \\\\. in the string, so that it ends up as \\. in the regex.

It's important to match the backslash and swallow the following character because it could be used to escape a quote which would otherwise end the match prematurely.

This pattern looks like it should work fine, and I can't think of a more succinct way to express it.

It should also work fine in Python (as you say, with re.DOTALL). In Python you could use the raw string notation to save the extra escaping of the backslash although you'd still need to escape the single quote. This should be equivalent:

re.search(r'\'(\\.|[^\'])*\'|"(\\.|[^"])*"', str, re.DOTALL)


The regex is mostly okay, except it doesn't handle escaped quotes (i.e., \" and \'). That's easy enough to fix:

'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*"

That's a "generic" regex; in Python you would usually write it in the form of a raw string:

r"""'(?:\\.|[^'\\]+)*'|"(?:\\.|[^"\\]+)*""""

In PHP you have to escape the backslashes to get them past PHP's string processing:

'~\'(?:\\\\.|[^\'\\\\]+)*\'|"(?:\\\\.|[^"\\\\]+)*"~s'

Most of the currently-popular languages have either a string type that requires less escaping, support for regex literals, or both. Here's how your regex would look as a C# verbatim string:

@"'(?:\\.|[^'\\]+)*'|""(?:\\.|[^""\\]+)*"""

But, formatting considerations aside, the regex itself should work in any Perl-derived flavor (and many other flavors as well).


p.s.: Notice how I added the + quantifier to your character classes. Your intuition about matching one character at a time is correct; adding the + makes a huge difference in performance. But don't let that fool you; when you're dealing with regexes, intuition seems to wrong more often than not. :/

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜