开发者

Optimizing python link matching regular expression

I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(d开发者_运维知识库ata)

to find links in some html, it is taking a long time on certain html, any optimization advice?

One that it chokes on is http://freeyourmindonline.net/Blog/


Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.


I'd suggest using BeautifulSoup for this task.


How about more straight handling of href's?

re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)

That takes about 0.007 seconds in comparsion with your findall which takes 38.694 seconds on my computer.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜