开发者

Searching a string in a list in python

Hello I have a string which c开发者_运维技巧ontains a mail address. For example ( user@foo.bar.com ) And I have a list which contains only domains ('bar.com','stackoverflow.com') etc.

I want to search the list if it contains my string's domain. Right now I am using a code like this

if tokens[1].partition("@")[2] in domainlist:

tokens[1] contains the mail address and domainlist contains the domains. But as you can see the result of tokens[1].partition("@")[2] will return foo.bar.com but my list has the domain bar.com. How can I make this if statement return true? And it should be very fast because hundreds of mail addresses will come in every second


It should work like this:

if any(tokens[1].endswith(domain) for domain in domainlist): 


If speed is really an issue for you, you can look into methods like Aho-Corasick. There are plenty of implementations available, like esmre/esm http://code.google.com/p/esmre/

As pointed out by @Riccardo Galli, simple string matching will produce some false positives, so you can try with esmre first, adding according regexes into index, something like index.enter("(^|\.){0}$".format(domain))


Opposite to other answers, here 'foo.com' would not match also '@y.afoo.com'

def mailInDomains(mail,domains):

    for domain in domainList:
        dLen = len(domain)
        if mail[-dLen:]==domain and mail[-dLen-1] in ('.','@'):
            return True

    return False


First, make domainlist a set. It will be faster to check whether there is something contained in it.

Second, add all 'superdomains' into this set, such as 'bar.com' for 'foo.bar.com'.

domainlist = ['foo.bar.com', 'bar2.com', 'foo3.bar3.foobar.com']
domainset = set()
for domain in domainlist:
    parts = domain.split('.')
    domainset.update('.'.join(parts[i:]) for i in xrange(len(parts)-1))

#domainset is now:
set(['bar.com',
     'bar2.com',
     'bar3.foobar.com',
     'foo.bar.com',
     'foo3.bar3.foobar.com',
     'foobar.com'])

And now you can test

if tokens[1].partition("@")[2] in domainset:


Hundreds of mail addresses should not be an issue. The following is a one-liner:

any(domain.endswith(d) for d in MY_DOMAINS)

Here, you can do user,sep,domain = address.rpartition('@'). Otherwise, your current method will fail for email addresses such as "B@tm4n"@something.com, which are valid according to https://www.rfc-editor.org/rfc/rfc5322

If performance becomes a factor, you can use a Trie (a kind of data structure). If performance is still a factor, you can use other tricks.

The above goes through each element in the domains you're checking, so if you have 1000 domains in your list, you need to do 1000 lookups for each email address. If this is an issue, you can do this to achieve O(1) per lookup (you also probably want to make sure you're not checking more than 5 suffixes, to protect yourself from maliciously crafted email addresses).

MY_DOMAINS = set(MY_DOMAINS)

def suffixes(domain):
    """
        suffixes('foo.bar.com') -yields-> ['foo.bar.com', 'bar.com', 'com']
    """
    while True:
        yield domain
        parts = domain.split('.',1)
        if len(parts>1)
            domain = parts[1]
        else:
            break
def isInList(address):
    user,sep,domain = address.rpartition('@')
    return any(suffix in MY_DOMAINS for suffix in suffixes(domain))
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜