Parsing html for domain links
I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want 开发者_StackOverflow中文版to compare it against. So a sample list contains
list=['www.domain.com', 'sub.domain.com']
But I may have a list of links that look like
http://domain.com
http://sub.domain.com/some/other/page
I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.
Right now I am using url2lib for parsing the html. What are my options in completely this task?
You might consider stripping 'www.' from the list
and doing something as simple as:
url = 'domain.com/'
for domain in list:
if url.startswith(domain):
... do something ...
Or trying both wont hurt either I spose:
url = 'domain.com/'
for domain in list:
domain_minus_www = domain
if domain_minus_www.startswith('www.'):
domain_minus_www = domain_minus_www[4:]
if url.startswith(domain) or url.startswith(domain_minus_www):
... do something ...
精彩评论