开发者

Regular expression to cherry pick a multiline component of a paragraph sitting between tags (Not html)

In the following I need a Regexpr to capture the part between the <tagstart></tagstart>

开发者_StackOverflow社区

Please note this is not html.

* real time results: shows results as you type 
* code hinting: roll over your expression to see info on specific elements 
* detailed results: roll over a match to see details & view group info below 
* built in regex guide: doub<tagstart>le click entries to insert them into your expression 
* online & desktop: regexr.com or download the desktop version for Mac, Windows, or Linux 
* save your expressions: My Saved expr</tagstart>essions are saved locally 
* search Community expressions and add your own

Thanks


EDIT: As @Kobi correctly points out in the comments, the much simpler version of the original post below is of course:

<(tagstart)>(.*?)</\1>

Since the original version also works and all the other statements remain true, I'll leave it as it is.


If (and only if) the tags cannot be nested:

<(tagstart)>((?:(?!</\1>).)*)</\1>

Explanation:

<(tagstart)>      # matches "<tagstart>" and stores "tagstart" in group 1
(                 # begin group 2
  (?:             #   begin non-capturing group
    (?!           #     begin negative look-ahead (... not followed by)
      </\1>       #       a closing tag with the same name as group 1
    )             #     end negative look-ahead
    .             #     if ok, match the next character
  )*              #   end non-capturing group, repeat
)                 # end group 2 (stores everything between the tags)
</\1>             # a closing tag with the same name as group 1

The regex needs to be applied in "single line" mode (sometimes called "dotall" mode). Either that or you substitute the . for [\s\S].

To generically match text between any two equally named tags, use <(\w+)> instead of <(tagstart)>.

Depending on your regex flavor, some things may work differently, like $1 instead of \1 for back-references, or meta-characters that need additional escaping.

See a Rubular demo.


Maybe this regexp: (\<tagstart\>)(.+)(\<\/tagstart\>)/s would help you? The second match would be what you are searching for. See demo for details.


#!/usr/bin/perl -w

undef $/;

$_ = <>;

m|<(.*?)>(.*)</\1>|s;

print $2;

If you really need just <tagstart>, replace the bits like <(.*?)> with <tagstart> and similar for closing. The undef $/ bit lets you slurp in a lot with a single read, and the $2 selects the second match group. The s and the end of the regex asks for . to match even new-line characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜