\"?" />

Regex: Skip/Ignore pattern

Given that the following string is embedded in text, how can I extract the whole line but not matching on the inner "<" and ">"?

<test type="yippie<innertext>" />


Being more specific, we need to handle both use cases below where "type" has or does not have "<" and ">" chars.

<h:test type="yippie<innertext>" />
<h:test type="yippie">

Group 1: 'h:test'
Group 2: ' type="yippie<innertext>" '  -or-  ' type="yippie"'   (ie, remaining content before ">" or "/>")

So far, I have something like this, but it's a little off how it Group 2 stops at the first ">". Tweaking first part of Group 2's c开发者_开发技巧ondition.


Thanks for your help.

Try this:


Example usage (Python):

>>> x = '<h:test type="yippie<innertext>" />'
>>> re.search('<([:\w]+)(\s(?:"[^"]*"|[^/>"])+)/?>', x).groups()
('h:test', ' type="yippie<innertext>" ')

Also note that if your document is HTML or XML then you should use an HTML or XML parser instead of trying to do this with regular expressions.

It looks like you are trying to parse XML/HTML with a regex. I would say that your approach is fundamentally wrong. A sufficiently advanced regex is not indistinguishable from an XML parser. After all, what if you needed to parse:

<test type="yippie<inner\"text\"_with_quotes,_literal_slash_and_quote\\\">" />

Furthermore, you probably need to escape the inner < and > as &lt; and &gt;

For further reasons why you should not parse XML with a regex, I can only yield to this superior answer:

RegEx match open tags except XHTML self-contained tags





验证码 换一张
取 消

