Fastest way to update a dictionary & check for keys
I am building a dictionary of a very long string (~1G), where key is a fixed-length k-mer, and value is all the occurrence positions. When k is large (>9) it makes no sense to pre-build the k-mer dictionary, since not all values will occur & it inflates the table.
Currently I'm doing the task like this:
def hash_string(st, mersize):
    stsize = len(st)
    hash = {}
    r = stsize-mersize+1
    for i in range(0, r):
        mer = st[i:i+mersize]
        if mer in hash:
            hash[mer].append(i)
        else:
            hash[mer] = [i]
    return hash
# test for function hash_st above        
mer3 = hash_string("ABCDABBBBBAAACCCCABCDDDD", 3) 
The most time consuming step (I did cProfile) is looking up if a key encountered (as we move along the string), is a new key, or if it already exists. Wh开发者_运维百科at is the fastest way to do this?
(I am currently testing out a two-pass strategy that avoids this step (which is much faster for large sequences), where I am first building a list of keys by simply over-writing doubles. And then I don't have to check for key existence -- I seed my dict with these keys, and then on the second pass simply do appends as I encounter them along.)
But I'd still be interested in knowing, to sum up, the fastest way to look up a dict key in Python, since this is a common pattern:
if key exists, append new entry, else, create key & add first element.
What's the fastest implementation of this pattern?
I would use collections.defaultdict:
import collections
...
hash = collections.defaultdict(list)
r = stsize-mersize+1
for i in range(0, r):
    mer = st[i:i+mersize]
    hash[mer].append(i)
though have never profiled it vs if ... else.
Generally, the method used will be dependent upon your data. I've constructed some simple tests that use different types of data to illustrate how the timings can change.
Strings used:
- The string in the question.
- A larger, pseudorandom string (supposedly with more distinct mers/keys in the hash).
- A string that has very few distinct mers/keys in the hash.
Here is some quick code to test the various methods (I've up-voted the defaultdict answer since it seems to be the fastest).
import random
from timeit import Timer
from collections import defaultdict
def test_check_first(st, mersize):
    """ Look for the existance of the mer in the dict.
    """
    mer_hash = {}
    r = len(st)-mersize+1
    for i in range(0, r):
        mer = st[i:i+mersize]
        if mer in mer_hash:
            mer_hash[mer].append(i)
        else:
            mer_hash[mer] = [i]
    return mer_hash
def test_throw_exception(st, mersize):
    """ Catch the KeyError thown if a mer doesn't exist in the dict.
    """
    mer_hash = {}
    r = len(st)-mersize+1
    for i in range(0, r):
        mer = st[i:i+mersize]
        try:
            mer_hash[mer].append(i)
        except KeyError:
            mer_hash[mer] = [i]
    return mer_hash
def test_defaultdict(st, mersize):
    """ Use a defaultdict.
    """
    mer_hash = defaultdict(list)
    r = len(st)-mersize+1
    for i in range(0, r):
        mer = st[i:i+mersize]
        mer_hash[mer].append(i)
    return mer_hash
def test_dict_setdefault(st, mersize):
    """ Use dict's setdefault method
    """
    mer_hash = {}
    r = len(st)-mersize+1
    for i in range(0, r):
        mer = st[i:i+mersize]
        mer_hash.setdefault(mer, []).append(i)
    return mer_hash
def gen_pseudorandom_string(size):
    """ Generate a larger, more "random" string of data.
    """
    # only use four letters
    letters = ('A', 'B', 'C', 'D')
    return ''.join(random.choice(letters) for i in range(size))
if __name__ == '__main__':
    # test functions
    test_strings = ('ABCDABBBBBAAACCCCABCDDDD', gen_pseudorandom_string(1000), 'A'*100 + 'B'*100 + 'C'*100 + 'D'*100)
    mer_size = 3
    passes = 10000
    for string in test_strings:
        display_string = string if len(string) <= 30 else string[:30] + '...'
        print 'Testing with string: "' + display_string + '" and mer size: ' + str(mer_size) + ' and number of passes: ' + str(passes)
        t1 = Timer("test_check_first(string, mer_size)", "from __main__ import test_check_first, string, mer_size")
        print '\tResults for test_check_first: ',
        print "%.2f usec/pass" % (1000000 * t1.timeit(number=passes)/passes)
        t2 = Timer("test_throw_exception(string, mer_size)", "from __main__ import test_throw_exception, string, mer_size")
        print '\tResults for test_throw_exception: ',
        print "%.2f usec/pass" % (1000000 * t2.timeit(number=passes)/passes)
        t3 = Timer("test_defaultdict(string, mer_size)", "from __main__ import test_defaultdict, string, mer_size")    
        print '\tResults for test_defaultdict: ',
        print "%.2f usec/pass" % (1000000 * t3.timeit(number=passes)/passes)
        t4 = Timer("test_dict_setdefault(string, mer_size)", "from __main__ import test_dict_setdefault, string, mer_size")    
        print '\tResults for test_dict_setdefault: ',
        print "%.2f usec/pass" % (1000000 * t4.timeit(number=passes)/passes)
Here are the results that I got when I ran it on my machine:
Testing with string: "ABCDABBBBBAAACCCCABCDDDD" and mer size: 3 and number of passes: 10000
    Results for test_check_first:  8.70 usec/pass
    Results for test_throw_exception:  22.78 usec/pass
    Results for test_defaultdict:  10.61 usec/pass
    Results for test_dict_setdefault:  8.88 usec/pass
Testing with string: "BACDDDADAAABADBDADDBBBCAAABBBC..." and mer size: 3 and number of passes: 10000
    Results for test_check_first:  305.19 usec/pass
    Results for test_throw_exception:  320.62 usec/pass
    Results for test_defaultdict:  254.56 usec/pass
    Results for test_dict_setdefault:  342.55 usec/pass
Testing with string: "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA..." and mer size: 3 and number of passes: 10000
    Results for test_check_first:  114.23 usec/pass
    Results for test_throw_exception:  107.96 usec/pass
    Results for test_defaultdict:  94.11 usec/pass
    Results for test_dict_setdefault:  125.72 usec/pass
Dictionaries have the detdefault method which does what you want, but not sure how much faster it'll be.
So your new pattern can be:
hash.setdefault(mer, []).append(i)
You can also try the exception-based method:
# Python2-3 compatibility
try: xrange
except NameError: xrange= range
for i in xrange(r):
    mer = st[i:i+mersize]
    try: hash[mer].append(i)
    except KeyError: # not there
        hash[mer]= [i]
Note that this will be slower than your method if most of the times mer is not found, but faster if most of the times it is found. You know your data and can make your choice.
Also, it might be better not to mask builtins like hash.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论