\"?" />
开发者

Regex: Skip/Ignore pattern

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?

<test type="yippie<innertext>" />

EDIT:

Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.

<h:test type="yippie<innertext>" />
<h:test type="yippie">

Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" '  -or-  ' type="yippie"'   (ie, remaining content before ">" or "/>")

So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's c开发者_开发技巧ondition.

(<([a-zA-Z0-9_:-]+)([^>"]*|[^>]*?)\s*(/)?>)

Thanks for your help.


Try this:

<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>

Example usage (Python):

>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')

Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.


It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:

<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />

Furthermore, you probably need to escape the inner < and > as &lt; and &gt;

For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:

RegEx match open tags except XHTML self-contained tags

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜