开发者

Partial matches in a dictionary

Assume I have the following dictionary mapping of domain names to it's human readable description

domain_info = {"google.com" : "A Search Engine", 
               "facebook.com" : "A Social Networking Site", 
               "stackoverflow.com" : "Q&A Site for Programmers"}

I would like to get the description from response.url which returns an absolute path http://www.google.com/reader/view/

My current approach

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
domain = domain[-2:]       # ['google', 'com']
domain = ".".join(domain)  # 'google.com'
info = domain_info[domain]

seems to 开发者_如何学Gobe too slow for large number of invocations, can anyone suggest an alternate way to speed things up?

An ideal solution would handle any subdomain and be case-insensitive


What does "too slow for large number of operations" mean? It's still going to work in constant time (for each URL) and you can't get any better than that. The above seems to be a perfectly good way to do it.

If you need it to be a bit faster (but it wouldn't be terribly faster), you could write your own regex. Something like "[a-zA-Z]+://([a-zA-Z0-9.]+)". That would get the full domain, not the subdomain. You would still need to do the domain splitting unless you can use lookahead in the regex to get just the last two segments. Be sure to use re.compile to make the regex itself fast.

Note that going domain[-2] is likely not going to be what you want. The logic of finding an appropriate "company level domain" is pretty complicated. For example, if the domain is google.com.au, this will give you "com.au" which is unlikely to be what you want -- you probably want "google.com.au".

As you say an ideal solution would handle any subdomain, you probably want to iterate over all the splits.

url = urlparse.urlparse(response.url)
domain = url.netloc        # 'www.google.com'
domain = domain.split(".") # ['www', 'google', 'com']
info = None
for i in range(len(domain)):
    subdomain = ".".join(domain[i:]) # 'www.google.com', 'google.com', 'com'
    try:
        info = domain_info[subdomain]
        break
    except KeyError:
        pass

With the above code, you will find it if it matches any subdomain. As for case sensitivity, that is easy. Ensure all the keys in the dictionary are lowercase, and apply .lower() to the domain before all the other processing.


It seems like the urlparse.py in the Python 2.6 standard library does a bunch of things when calling the urlparse() function. It may be possible to speed things up by writing a little URL parser which does only what is absolutely necessary and no more.

UPDATE: see this part of Wikipedia's page about DNS for information on the syntax of domain names, it may give some ideas for the parser.


You may consider extracting the domain without sub-domains using regular expression:

'http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)'

import re
m = re.search('http:\/\/([^\.]+\.)*([^\.][a-zA-Z0-9\-]+\.[a-zA-Z]{2,6})(\/?|\/.*)', 'http://www.google.com/asd?#a')
print m.group(2)


You can use some of the work that urlparse does. Try to look things up directly by the netloc it returns and only fall back on the split/join if you must:

def normalize( domain ):
    domain = domain.split(".") # ['www', 'google', 'com']
    domain = domain[-2:]       # ['google', 'com']
    return ".".join(domain)  # 'google.com'


# caches the netlocs that are not "normal"
aliases = {}

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc

    if netloc in aliases:
        return domain_info[aliases[netloc]]

    if netloc in domain_info:
        return domain_info[netloc]

    main = normalize(netloc)
    if main in domain_info:
        aliases[netloc] = main
        return domain_info[netloc]

Same thing with a caching lib:

from beaker.cache import CacheManager
netlocs = CacheManager(namespace='netloc')

@netlocs.cache()
def getloc( domain ):
    try:
        return domain_info[domain]
    except KeyError:
        domain = domain.split(".")
        domain = domain[-2:]
        domain = ".".join(domain)
        return domain_info[domain]

def getinfo( url ):
    netloc = urlparse.urlparse(response.url).netloc
    return getloc( netloc )

Maybe it helps a bit, but it really depends on the variety of urls you have.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜