开发者

How do we create such a regular expression to extract data?

<br>Aggie<br><br>John<br><p>Hello world</p><br>Mary<br><br><b>Peter</b><br>

I'd like to create a regexp that safely matches these:

<br>Aggie<br>
<br>John<br>
<br>Mary<br>
<br><b>Peter</b><br>

This is possible that there are oth开发者_如何学Cer tags (e.g. <i>,<strike>...etc ) between each pair of <br> and they have to be collected just like the <br><b>Peter</b><br>

How should the regexp look like?


If you learn one thing on SO, let it be - "Do not parse HTML with a regex". Use an HTML Parser


<br>.*?<br>

will match anything from one <br> tag to the closest following one.

The main problem with parsing HTML using regexes is that regexes can't handle arbitrarily nested structures. This is not a problem in your example.


Split the string at (<br>)+. You'll get empty strings at the beginning and the end of the result, so you need to remove them, too.

If you want to preserve the <br>, then this is not possible unless you know that there is one before and after each element in the result.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜