开发者

Regex to find an HTML element without a particular phrase within its tags

I need to match <output_channels> elements which don't contain the phrase 'Story' between the opening <output_channels> and closing </output_channels> tags. <output_channels> elements are never nested, so I think I should be able to do this with regex - please don't reply that it's impossible unless it genuinely is!

Here's an example of the text I'll be searching in, using either perl or vim (I find it easier to test regexes in vim):

<output_channels>
  <output_channel>RSS</output_channel>
  <output_channel>Story</output_channel> 
</output_channels>

<output_channels>开发者_开发技巧;
  <output_channel>RSS</output_channel>
</output_channels>

I'm thinking I need to run something like the following, but this matches both <output_channels> blocks:

<output_channels>.*?((?!Story).)*?<\/output_channels>


Use search term:

<output_channels>\_s\{-}\(\(<output_channel>\_s\{-}Story\_s\{-}<\/output_channel>\)\@!\_.\)\{-}\_s\{-}<\/output_channels>

This will match your 2nd <output_channels> element only above since it doesn't have <output_channel>Story</output_channel>.

\_s will match any white space character including new line
\_. will match any character including new line
{-} is to make a pattern non-greedy in vim
\@! is to negate preceding pattern match
\( and \) is for grouping the pattern


This might need some adjustment depending on what your whole XML file looks like, but it works with your example:

<output_channels>(?:\s*<output_channel>(?!Story)[^<]+<\/output_channel>\s*)+<\/output_channels>


You need to get rid of that first .*?. What's happening is, after the ((?!Story).)*? part correctly fails to match content with Story in it, the regex engine backtracks and gives the .*? a crack at it, and of course it succeeds. Assuming, of course, that you're matching in /s (single-line or dot-matches-all) mode.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜