Python: How do you use re to ignore links in parentheses?

2023-04-04 19:31 问答作者：

The relevant part of the code is:

import re
reargs = '<a\s*href=[\'|"](.*?)[\'"].*?>'
link = re.search(reargs,content,flags=re.IGNORECASE)

I'm building a crawler and the web pages I'm working with have links in parentheses that I don't want so it would be like:

Foo foo foo fo开发者_如何学运维o (link) foo foo foo foo link foo foo foo foo (foo link foo) foo foo link foo foo link......and so on

If there can be multiple sets of nested parentheses like "((foo) link)", I don't think this is possible with regular expressions. In particular, note that parentheses can be used inside URLs (such as at wikipedia), so there may still be nested parens even if the text itself doesn't contain any. So, in the general case I don't think this can be done with regex.

In order to solve it, I will assume you can have parentheses at most 1 level deep, and that no URLs contain parentheses.

The regex you're looking for is something like the following:

(\([^\)]*\)|[^\(<])*_link_

Where _link_ is a regular expression matching a link (which you describe in the problem statement, though it might need some tweaking). To summarize what that first part of my regex is: it matches 0 or more of either a parenthetical statement or a non-link non-parenthesis character. Now, use the matched back references (link.group(2) in your example) to find your URL.

In general parsing HTML with regex is a bad idea. But because you asked, and the general question has merit (how to ignore cases where your match is surrounded by parentheses) I'll tell you what I think.

Now, because I don't know what your page looks like I'll just say that, in general, you can exclude matches by adding [^x],except where x is the character you don't want. The brackets make it so that it will match anything, and the ^ excludes whatever follows.

So you can exclude parentheses by surround your match string with [^(]foo[^)]. If there are other characters between the parentheses you'll have to account for that separately.

With lxml you could do something like this:

import lxml.html
import re

tree = lxml.html.parse("http://pastehtml.com/view/b7604in99.html")
links = tree.xpath("//a")

for link in links:
    if re.match(r'^\(.*\)$', link.text.strip()):
        print link.get('href')

继续阅读：python regex

Python: How do you use re to ignore links in parentheses?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？