开发者

Parsing html for domain links

I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want 开发者_StackOverflow中文版to compare it against. So a sample list contains

list=['www.domain.com', 'sub.domain.com']

But I may have a list of links that look like

http://domain.com
http://sub.domain.com/some/other/page

I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.

Right now I am using url2lib for parsing the html. What are my options in completely this task?


You might consider stripping 'www.' from the list and doing something as simple as:

url = 'domain.com/'
for domain in list:
    if url.startswith(domain):
        ... do something ...

Or trying both wont hurt either I spose:

url = 'domain.com/'
for domain in list:
    domain_minus_www = domain
    if domain_minus_www.startswith('www.'):
        domain_minus_www = domain_minus_www[4:]
    if url.startswith(domain) or url.startswith(domain_minus_www):
        ... do something ...
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜