Regex to replace html links to plain-text URLs

2022-12-21 05:14 问答作者：

I need to replace links in html:

<a href="http://example.com"></a>

To just plain-text url address:

http://example.com

UPD. Some clarification here, i need this to strip down html tags from text but preserve link locations. It's purely for internal use, so there won't be any crazy edge-case code. Language is python i开发者_Python百科n this case, but i don't see how's that relevant.

As I said before if you are ok with some mistakes and/or have some amount of control over the input, you can make some compromises in completeness and use Regex. Since your update says this is the case, here's a regex that should work for you:

/<a\s(?:.(?!=href))*?href="([^"]*)"[^>]*?>(.*?)</a>/gi

$1: The HREF
$2: Everything inside the tag.

This will handle all the test cases below except the last three lines:

Hello this is some text <a href="/test">This is a link</a> and this is some more text.
<a href="/test">Just a link on this line.</a>
There are <a href="/test">two links </a> on <a href="http://www.google.com">this line</a>!
Now we need to test some <a href="http://www.google.com" class="test">other attributes.</a>. They can be <a class="test" href="http://www.google.com">before</a> or after.
Or they can be <a rel="nofollow" href="http://www.google.com" class="myclass">both</a>
Also we need to deal with <a href="/test" class="myclass" style=""><span class="something">Nested tags and empty attributes</span></a>.
Make sure that we don't do anything with <a name="marker">anchors with no href</a>
Make sure we skip other <address href="/test">tags that start with a even if they are closed with an a</a>
Lastly try some other <a href="#">types</a> of <a href="">href</a> attributes.

Also we need to skip <a malformed tags.  </a>.  But <a href="#">this</a> is where regex fails us.
We will also fail if the user has used <a href='javascript:alert("the reason"))'>single quotes for some reason</a>
Other invalid HTML such as <a href="/link1" href="/link2">links with two hrefs</a> will have problems for obvious reasons.

>>> s="""blah <a href="http://example.com"></a> blah <a href="http://www.google.com">test</a>"""
>>> import re
>>> pat=re.compile("<a\s+href=\"(.*?)\">.*?</a>",re.M|re.DOTALL|re.I)
>>> pat.findall(s)
['http://example.com', 'http://www.google.com']
>>> pat.sub("\\1",s)
'blah http://example.com blah http://www.google.com'

for more complex operations, use BeautifulSoup

Instead of using regex, you could try to use unlink with minidom

继续阅读：regex

Regex to replace html links to plain-text URLs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？