开发者

Regex to pull out HTML items

Given the following HTML block, what would be the best Regex pattern to create the following list: (keep the url links in the Matches collection.

Abdominal Aortic Aneurysm see Aortic Aneurysm
Abdominal Pain
Abdominal Pregnancy see Ectopic Pregnancy
Abnormalities see Birth Defects
ABO Blood Groups see Blood and B开发者_运维问答lood Disorders

Abortion
About Your Medicines see Medicines; Over-the-Counter Medicines
ABPA see Aspergillosis
Abscess
Abuse see Child Abuse; Domestic Violence; Elder Abuse 

Here is the raw input:

<li><span class="formod5">&nbsp;</span></li>
<li class="item">Abdominal Aortic Aneurysm see <a href="http://www.nlm.nih.gov/medlineplus/aorticaneurysm.html">Aortic Aneurysm</a></li>
<li class="item"><a href="http://www.nlm.nih.gov/medlineplus/abdominalpain.html">Abdominal Pain</a></li>
<li class="item">Abdominal Pregnancy see <a href="http://www.nlm.nih.gov/medlineplus/ectopicpregnancy.html">Ectopic Pregnancy</a></li>
<li class="item">Abnormalities see <a href="http://www.nlm.nih.gov/medlineplus/birthdefects.html">Birth Defects</a></li>
<li class="item">ABO Blood Groups see <a href="http://www.nlm.nih.gov/medlineplus/bloodandblooddisorders.html">Blood and Blood Disorders</a></li> 
<li><span class="formod5">&nbsp;</span></li>
<li class="item"><a href="http://www.nlm.nih.gov/medlineplus/abortion.html">Abortion</a></li>
<li class="item">About Your Medicines see <a href="http://www.nlm.nih.gov/medlineplus/medicines.html">Medicines</a>; <a href="http://www.nlm.nih.gov/medlineplus/overthecountermedicines.html">Over-the-Counter Medicines</a></li>
<li class="item">ABPA see <a href="http://www.nlm.nih.gov/medlineplus/aspergillosis.html">Aspergillosis</a></li>
<li class="item"><a href="http://www.nlm.nih.gov/medlineplus/abscess.html">Abscess</a></li>
<li class="item">Abuse see <a href="http://www.nlm.nih.gov/medlineplus/childabuse.html">Child Abuse</a>; <a href="http://www.nlm.nih.gov/medlineplus/domesticviolence.html">Domestic Violence</a>; <a href="http://www.nlm.nih.gov/medlineplus/elderabuse.html">Elder Abuse</a></li> 
<li><span class="formod5">&nbsp;</span></li>

TIA


Ignore these DOM guys. They don’t know what they’re talking about, and even if they do, they haven’t answered your question, which is rude.

If that’s really all you’re trying to do, which I believe is strip tags and leave the rest, you can strip those particular tags up there that don’t contain fancy stuff with a simple:

s/<.*?>//g;

and you’ll have to convert the entities like

s/&nbsp;//g

On arbitrary HTML, you have to be a lot more careful than this of course, because you have <script> tags and <style> tags and CDATA sections and alt=">" and all that jazz, but on the sample you presented, this will work just fine.

Don’t you have better ways of converting HTML to text than this, though?


Do not use regex for this kind of stuff (i think that you don't use hammer instead of the wrench when you need to screw a bolt?), use special tools that are used for this kind of operations : HTML DOM parser (http://simplehtmldom.sourceforge.net/) or something similar.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜