How do I ensure I always get a list of matches from Python's Regular Expressions?
I'm trying to pull some information (no recursion necessary) from a jsp page (malformed xml) similar to this:
<td>
<html:button ...></html:button>
<html:submit ...></html:submit></td>
And a regex:
<html:(button|submit|cancel)[\s\S]*?</html:(button|submit|cancel)>
re.findall() is giving me a list of tuples, like so:
[('button',开发者_StackOverflow社区'button'),('button','button')]
Which I understand from the documentation is correct, but I'm looking to get something more like:
["<html:button ...>","<html:button ...>"]
What is the appropriate way to get the outcome I expect?
Aside from the fact that a regex probably isn't what you want to do this with, you want to put the bit you want in groups using parentheses. If you want everything up to the closing </html:whatever>
tag, then you want something like this:
(<html:(button|submit|cancel)[\s\S]*?)</html:(button|submit|cancel)>
If you just want the <html:button>
bit, use:
(<html:(button|submit|cancel)>)[\s\S]*?</html:(button|submit|cancel)>
e.g.
from
<html:button>foobar</html:submit>
you get:
('<html:button>', 'button', 'submit')
If you want to get the foobar
from above, use:
(<html:(button|submit|cancel)>)([\s\S]*?)</html:(button|submit|cancel)>
to get:
('<html:button>', 'button', 'foobar', 'submit')
Note that it is not, in general, possible to match opening and closing tags (note that <html:button>
is opened, and </html:submit>
closes in the example above). If you need to do that, use a proper parser.
Your (button|submit|cancel)
getting capture, so add ?:
in brackets like (?:
>>> re.findall('<html:(?:button|submit|cancel)[\s\S]*?</html:(?:button|submit|cancel)>',TheHTMLWhichShouldntParseWithRegex)
['<html:button ...></html:button>', '<html:submit ...></html:submit>']
精彩评论