What is the problem with this regex?
First, I'm not a regex expert, so I'm pretty sure I'm doing something wrong.
Here is my regular expression:
<(list)(\b[^>]*)>(<\1\b[^>]*>.*?<\/\1>|.)*?<\/\1>
This is the input string:
...
<list title="Lorem ipsum dolor sit amet, consectetur adipiscing elit...">
<li>
<list title="Lorem adipiscing...">
<li>Lorem ipsum dolor sit amet, consectetur adipiscing elit</li>
<li>Lorem ipsum dolor sit amet, consectetur adipiscing elit</li>
</list>
</li>
<li>
<list title="Lorem ipsum...">
<li>Lorem ipsum dolor sit amet, consectetur adipiscing elit</li>
</list>
</li>
<li>Lorem ipsum dolor sit amet, consectetur adipiscing elit
</li>
<li>Lorem ipsum dolor sit amet, consectetur adipiscing elit
</li>
</list>
...
I want to match the extern开发者_StackOverflow中文版al <list>
and catch all the content including the intertal <list>
but when I try to read the group \3
is empty althoug groups \1
and \2
are fine.
Any ideas would be very much appreciated.
This problem cannot be solved with a regular expression match. Seriously. I'm not just repeating the "don't parse HTML with regex" dogma; regular expressions are logically incapable of handling nested tags (which is why everyone says "don't parse HTML with regex")
The best idea I can give you is to use an XML parser. If you insist on solving this problem using regular expressions, you will wind up writing your own recursive-descent parser anyway, so you might as well take advantage of the work others have done on that problem already.
精彩评论