开发者

Regex PHP Only Match if Not Surrounded By Quotes

I have some regex I run over an entire HTML page looking for strings and replacing them, however if the string is in single or double quotes I do not want it to match.

Current Regex: ([a-zA-Z_][a-zA-Z0-9_]*)

I would like to match steve,john,cathie and john likes to walk (x3) but not "steve", 'sophie' or "john"'likes'"cake"

I have tried 开发者_运维百科(^")([a-zA-Z_][a-zA-Z0-9_]*)(^") but get no matches?

Test Cases:

(steve=="john") would return steve
("test"=="test") would not return anything
(boob==lol==cake) would return all three


Try this one:

(\b(?<!['"])[a-zA-Z_][a-zA-Z_0-9]*\b(?!['"]))

Against this string:

john "michael" michael 'michael elt0n_john 'elt0n_j0hn'
 1      2        3        4       5            6

It would match nr 1 john, nr 3 Michael and nr 5 elt0n_john


To do that you probably need some dark magic:

'~(?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*+)*+\')(*SKIP)(*F)|([a-zA-Z_][a-zA-Z0-9_]*)~'

The (?:"[^"\\\\]*+(?:\\\\.[^"\\\\]*+)*+"|\'[^\'\\\\]*+(?:\\\\.[^\'\\\\]*+)*+\') part matches a string in either single or double quotes and implements backslash-escaping. The (*SKIP)(*F) skips the quoted string and forces a fail. ([a-zA-Z_][a-zA-Z0-9_]*) is your regex.

PS: If you are using this on PHP scripts, you may want to use the Tokenizer instead. That way you could for example exclude keywords (like class or abstract, I don't know whether you need this) and you will have much better handling of edge cases (like HEREDOC).


You could try with:

preg_match_all('#(?<!["\']) \b \w+ \b (?!["\'])#x', $str, $matches);

The \w+ matches word characters, but would allow 0123sophie for example. The \b matches word boundaries and thus ensures that the anti-quote assertions do not terminate too early.

However, this regex will also fail to find words which have just a single quote "before or after' them.


Pez, resurrecting this ancient question because the current answer is not quite correct (and I'm not sure any solution can be).

It will fail to match john when it is in incomplete quotes, for instance in "john, john", 'john and john' (situations that can happen with john's birthday etc. See this demo.

This alternate solution just skips any content in quotes:

(?:'[^'\n]*'|"[^"\n]*")(*SKIP)(*F)|\b[a-zA-Z_][a-zA-Z_0-9]*\b

See demo

Either way, with quotes, no solution is perfect because you always run the risk of having unbalanced quotes. In this case I have tried to mitigate the problem by assuming that if it's on another line, it's a different string.

Reference

  1. How to match pattern except in situations s1, s2, s3
  2. How to match a pattern unless...


Ok I think I have it and it works for your test cases:

(?<!"|'|\w)(\w+)(?!"|'|\w)

Done with look-ahead/look-behind regex feature.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜