Searching a string in a list in python
Hello I have a string which c开发者_运维技巧ontains a mail address. For example ( user@foo.bar.com ) And I have a list which contains only domains ('bar.com','stackoverflow.com') etc.
I want to search the list if it contains my string's domain. Right now I am using a code like this
if tokens[1].partition("@")[2] in domainlist:
tokens[1] contains the mail address and domainlist contains the domains.
But as you can see the result of tokens[1].partition("@")[2]
will return foo.bar.com
but my list has the domain bar.com
.
How can I make this if statement return true? And it should be very fast because hundreds of mail addresses will come in every second
It should work like this:
if any(tokens[1].endswith(domain) for domain in domainlist):
If speed is really an issue for you, you can look into methods like Aho-Corasick. There are plenty of implementations available, like esmre
/esm
http://code.google.com/p/esmre/
As pointed out by @Riccardo Galli, simple string matching will produce some false positives, so you can try with esmre
first, adding according regexes into index, something like index.enter("(^|\.){0}$".format(domain))
Opposite to other answers, here 'foo.com' would not match also '@y.afoo.com'
def mailInDomains(mail,domains):
for domain in domainList:
dLen = len(domain)
if mail[-dLen:]==domain and mail[-dLen-1] in ('.','@'):
return True
return False
First, make domainlist
a set. It will be faster to check whether there is something contained in it.
Second, add all 'superdomains' into this set, such as 'bar.com' for 'foo.bar.com'.
domainlist = ['foo.bar.com', 'bar2.com', 'foo3.bar3.foobar.com']
domainset = set()
for domain in domainlist:
parts = domain.split('.')
domainset.update('.'.join(parts[i:]) for i in xrange(len(parts)-1))
#domainset is now:
set(['bar.com',
'bar2.com',
'bar3.foobar.com',
'foo.bar.com',
'foo3.bar3.foobar.com',
'foobar.com'])
And now you can test
if tokens[1].partition("@")[2] in domainset:
Hundreds of mail addresses should not be an issue. The following is a one-liner:
any(domain.endswith(d) for d in MY_DOMAINS)
Here, you can do user,sep,domain = address.rpartition('@')
. Otherwise, your current method will fail for email addresses such as "B@tm4n"@something.com
, which are valid according to https://www.rfc-editor.org/rfc/rfc5322
If performance becomes a factor, you can use a Trie (a kind of data structure). If performance is still a factor, you can use other tricks.
The above goes through each element in the domains you're checking, so if you have 1000 domains in your list, you need to do 1000 lookups for each email address. If this is an issue, you can do this to achieve O(1)
per lookup (you also probably want to make sure you're not checking more than 5 suffixes, to protect yourself from maliciously crafted email addresses).
MY_DOMAINS = set(MY_DOMAINS)
def suffixes(domain):
"""
suffixes('foo.bar.com') -yields-> ['foo.bar.com', 'bar.com', 'com']
"""
while True:
yield domain
parts = domain.split('.',1)
if len(parts>1)
domain = parts[1]
else:
break
def isInList(address):
user,sep,domain = address.rpartition('@')
return any(suffix in MY_DOMAINS for suffix in suffixes(domain))
精彩评论