Parsing html for domain links

2022-12-29 17:50 问答作者：

I have a script that parses an html page for all the links within it. I am getting all of them fine, but I have a list of domains I want 开发者_StackOverflow中文版to compare it against. So a sample list contains

list=['www.domain.com', 'sub.domain.com']

But I may have a list of links that look like

http://domain.com
http://sub.domain.com/some/other/page

I can strip off the http:// just fine, but in the two example links I just posted, they both should match. The first I would like to match against the www.domain.com, and the second, I would like to match against the subdomain in the list.

Right now I am using url2lib for parsing the html. What are my options in completely this task?

You might consider stripping 'www.' from the list and doing something as simple as:

url = 'domain.com/'
for domain in list:
    if url.startswith(domain):
        ... do something ...

Or trying both wont hurt either I spose:

url = 'domain.com/'
for domain in list:
    domain_minus_www = domain
    if domain_minus_www.startswith('www.'):
        domain_minus_www = domain_minus_www[4:]
    if url.startswith(domain) or url.startswith(domain_minus_www):
        ... do something ...

继续阅读：python

Parsing html for domain links

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？