Optimizing python link matching regular expression
I have a regular expression, links = re.compile('<a(.+?)href=(?:"|\')?((?:https?://|/)[^\'"]+)(?:"|\')?(.*?)>(.+?)</a>',re.I).findall(d开发者_运维知识库ata)
to find links in some html, it is taking a long time on certain html, any optimization advice?
One that it chokes on is http://freeyourmindonline.net/Blog/
Is there any reason you aren't using an html parser? Using something like BeautifulSoup, you can get all links without using an ugly regex like that.
I'd suggest using BeautifulSoup for this task.
How about more straight handling of href's?
re_href = re.compile(r"""<\s*a(?:[^>]+?)href=("[^"]*(\\"[^"]*)*"|'[^']*(\\'[^']*)*'|[^\s>]*)[^>]*>""", re.I)
That takes about 0.007 seconds in comparsion with your findall
which takes 38.694 seconds on my computer.
精彩评论