Partial matches in a dictionary

2023-02-13 21:08 问答作者：

Assume I have the following dictionary mapping of domain names to it's human readable description

domain_info = {"google.com" : "A Search Engine", 
               "facebook.com" : "A Social Networking Site", 
               "stackoverflow.com" : "Q&A Site for Programmers"}

I would like to get the description from response.url which returns an absolute path http://www.google.com/reader/view/

My current approach

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
domain = domain[-2:]       # ['google', 'com']
domain = ".".join(domain)  # 'google.com'
info = domain_info[domain]

seems to 开发者_如何学Gobe too slow for large number of invocations, can anyone suggest an alternate way to speed things up?

An ideal solution would handle any subdomain and be case-insensitive

What does "too slow for large number of operations" mean? It's still going to work in constant time (for each URL) and you can't get any better than that. The above seems to be a perfectly good way to do it.

If you need it to be a bit faster (but it wouldn't be terribly faster), you could write your own regex. Something like "[a-zA-Z]+://([a-zA-Z0-9.]+)". That would get the full domain, not the subdomain. You would still need to do the domain splitting unless you can use lookahead in the regex to get just the last two segments. Be sure to use re.compile to make the regex itself fast.

Note that going domain[-2] is likely not going to be what you want. The logic of finding an appropriate "company level domain" is pretty complicated. For example, if the domain is google.com.au, this will give you "com.au" which is unlikely to be what you want -- you probably want "google.com.au".

As you say an ideal solution would handle any subdomain, you probably want to iterate over all the splits.

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
info = None
for i in range(len(domain)):
    subdomain = ".".join(domain[i:]) # 'www.google.com', 'google.com', 'com'
    try:
        info = domain_info[subdomain]
        break
    except KeyError:
        pass

With the above code, you will find it if it matches any subdomain. As for case sensitivity, that is easy. Ensure all the keys in the dictionary are lowercase, and apply .lower() to the domain before all the other processing.

It seems like the urlparse.py in the Python 2.6 standard library does a bunch of things when calling the urlparse() function. It may be possible to speed things up by writing a little URL parser which does only what is absolutely necessary and no more.

UPDATE: see this part of Wikipedia's page about DNS for information on the syntax of domain names, it may give some ideas for the parser.

You may consider extracting the domain without sub-domains using regular expression:

'http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)'

import re
m = re.search('http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)', 'http://www.google.com/asd?#a')
print m.group(2)

You can use some of the work that urlparse does. Try to look things up directly by the netloc it returns and only fall back on the split/join if you must:

def normalize( domain ):
    domain = domain.split(".") # ['www', 'google', 'com']
    domain = domain[-2:]       # ['google', 'com']
    return ".".join(domain)  # 'google.com'


# caches the netlocs that are not "normal"
aliases = {}

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc

    if netloc in aliases:
        return domain_info[aliases[netloc]]

    if netloc in domain_info:
        return domain_info[netloc]

    main = normalize(netloc)
    if main in domain_info:
        aliases[netloc] = main
        return domain_info[netloc]

Same thing with a caching lib:

from beaker.cache import CacheManager
netlocs = CacheManager(namespace='netloc')

@netlocs.cache()
def getloc( domain ):
    try:
        return domain_info[domain]
    except KeyError:
        domain = domain.split(".")
        domain = domain[-2:]
        domain = ".".join(domain)
        return domain_info[domain]

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc
    return getloc( netloc )

Maybe it helps a bit, but it really depends on the variety of urls you have.

继续阅读：python

Partial matches in a dictionary

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？