How do we create such a regular expression to extract data?
<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>
I'd like to create a regexp that safely matches these:
<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>
This is possible that there are oth开发者_如何学Cer tags (e.g. <i>
,<strike>
...etc ) between each pair of <br>
and they have to be collected just like the <br><b>Peter</b><br>
How should the regexp look like?
If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser
<br>.*?<br>
will match anything from one <br>
tag to the closest following one.
The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.
Split the string at (<br>)+
. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.
If you want to preserve the <br>
, then this is not possible unless you know that there is one before and after each element in the result.
精彩评论