开发者

regex to find instance of a word or phrase -- except if that word or phrase is in braces

First, a disclaimer. I know a little about regex's but I'm no expert. They seem to be something that I really need twice a year so they just don't stay "on top" of my brain.

The situation: I'd like to write a regex to match a certain word, let's call it "Ostrich". Easy. Except Ostrich can sometimes appear inside of开发者_JS百科 a curly brace. If it's inside of a curly brace it's not a match. The trick here is that there can be spaces inside the curly braces. Also the text is typically inside of a paragraph.

This should match: I have an Ostrich.

This should not match: My Emu went to the {Ostrich Race Name}.

This should be a match: My Ostrich went to the {Ostrich Race Name}.

This should not be a match: My Emu went to the {Race Ostrich Place}. My Emu went to the {Race Place Ostrich}.

It seems like this is possible with a regex, but I sure don't see it.


I'll offer an alternative solution to doing this, which is a bit more robust (not using regex assertions).

First, remove all the bracketed items, using a regex like {[^}]+} (use replace to change it to an empty string).

Now you can just search for Ostrich (using regex or simple string matching, depending on your needs).


While regular expressions can certainly be written to do what you ask, they're probably not the best tool for this particular type of thing.

One major problem with regular expressions is that they're very good at pattern matching for things that are there, but not so much when you start adding except into the mix.

Regular expressions are not stateful enough to handle this properly without a lot of work, so I would try to find a different path towards a solution.

A character tokenizer that handles the braces would be easy enough to write.


I believe this will work, using lookahead and lookbehind assertions:

(?<!{[^}]*)Ostrich(?![^{]*})

I also tested the case My {Ostrich} went to the Ostrich Race. (where the second "Ostrich" does match)

Note that the lookahead assertion: (?![^{]*}) is optional.. but without it:

  • My {Ostrich has a missing bracket won't match
  • My Ostrich also} has a missing bracket will match

which may or may not be desirable.

This works in the .NET regex engine, however, it is not PCRE-compatible because it uses non-fixed-length assertions which are not supported.


Here's a very large regex that almost works.

It will return each "raw" occurrence of the word in a group.
However, the group for the last one will be empty; I'm not sure why. (Tested with .Net)

Parse without whitespace

^(?:

    (?:
        [^{]
        |
        (?:\{.*?\})
    )*?

    (?:\W(Ostrich)\W)?
)*$


Using a positive lookahead with a negation appears to properly match all the test cases as well as multiple Ostriches:

(?<!{[^}]*)Ostrich(?=[^}]*)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜