开发者

How does Python handle checking 'if object in list'

I'm wondering because I need to have have a function that is disgustingly fast at checking if a word is in a dictionary list - I'm considering leaving the dictionary as a large st开发者_开发百科ring and running regex against instead. This needs to be absurdly fast. So I just need a basic overview of how python handles checking if a string is in a list of strings and if its beyond-reasonable fast.


If you want a blazingly fast membership test, then a list is the wrong data structure. Take a look at the implementation of list_contains in listobject.c, line 437. It iterates over the list in order, comparing the item with each element in turn. The later the item appears in the list, the longer it will take to find it, and if the item is missing, then the whole list must be scanned.

Use a set instead. Sets are implemented internally by a hash table, so looking up an object involves computing its hash and then scanning a few table entries (usually just one). For the particular case of looking up a string, see set_lookkey_string in setobject.c, line 156.


A set of strings will have O(1) lookup time: effectively constant regardless of the size of the set. Making a set from your list of strings is easy:

my_set = set(my_list)
if my_word in my_set:
    print "it's there!"


If you need real fast checking, use a set:

words = set(words_list)
if "hello" in words:
    print("hello found!"")

A set is faster because it uses a hash-algorithm, instead of a direct search approach.


According to this site, x in s is O(n). Therefore, it checks each entry (in the worst case).

At any rate, do not use a regex. Using sets or lists is a much more intuitive way to represent the data and regexes will not perform better than O(n).


If you're using a regular list, consider a set instead.

If you want to implement your own fine-tuned membership test for your container object, override __contains__.


You probably want to use a Set if you're worried about time. A Set is much like a list, but it checks for membership based on hashing.


Use a set. If you need case-insensitive checking, just store the words into the set downcased. Then when checking if a certain word is in the set, downcase the word before checking membership.

The general rule is: normalize entries when building the set, and normalize an item before checking against the set. Another example of normalization is collapsing consecutive whitespace chars into a single space and stripping leading/trailing whitespace.


Running a regex against your word list is a very bad idea; it scales very badly. Using dict(), set() or frozenset() will scale a lot better:

s = set(['one','two','three'])
'two' in s     ## true

b='four'
b in s         ## false

s.add('four')
b in s         ## true
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜