Python: fast dictionary of big int keys

2023-03-03 03:06 问答作者：

I have got a list of >10.000 int items. The values of the items can be very high, up to 10^27. Now I want to create all pairs of the items and calculate their sum. Then I want to look for different pairs with the same sum.

For example:

l[0] = 4
l[1] = 3
l[2] = 6
l[3] = 1
...

pairs[10] = [(0,2)] # 10 is the sum of the values of l[0] and l[2]
pairs[7] = [(0,1), (2,3)] # 7 is the sum of the values of l[0] and l[1] or l[2] and l[3]
pairs[5] = [(0,3)]
pairs[9] = [(1,2)]
...

The contents of pairs[7] is what I am looking for. It gives me two pairs with the same value sum.

I have implemented it as follows - and I wonder if it can be done faster. Currently, for 10.000 items it takes >6 hours on a fast machine. (As I said, the values of l and so the keys of pairs are ints up to 10^27.)

l = [4,3,6,1]
pairs = {}
for i in range( len( l  )  ):
    for j in range(i+1, len( l ) ):
        s = l[i] + l[j]
        if not s in pairs:
            pairs[s] = []
        pa开发者_开发百科irs[s].append((i,j))

# pairs = {9: [(1, 2)], 10: [(0, 2)], 4: [(1, 3)], 5: [(0, 3)], 7: [(0, 1), (2, 3)]}

Edit: I want to add some background, as asked by Simon Stelling.

The goal is to find Formal Analogies like

lays : laid :: says : said

within a list of words like

[ lays, lay, laid, says, said, foo, bar ... ]

I already have a function analogy(a,b,c,d) giving True if a : b :: c : d. However, I would need to check all possible quadruples created from the list, which would be a complexity of around O((n^4)/2).

As a pre-filter, I want to use the char-count property. It says that every char has the same count in (a,d) and in (b,c). For instance, in "layssaid" we have got 2 a's, and so we do in "laidsays"

So the idea until now was

for every word to create a "char count vector" and represent it as an integer (the items in the list l)
create all pairings in pairs and see if there are "pair clusters", i.e. more than one pair for a particular char count vector sum.

And it works, it's just slow. The complexity is down to around O((n^2)/2) but this is still a lot, and especially the dictionary lookup and insert is done that often.

There are the trivial optimizations like caching constant values in a local variable and using xrange instead of range:

pairs = {}
len_l = len(l)
for i in xrange(len_l):
    for j in xrange(i+1, len_l):
        s = l[i] + l[j]
        res = pairs.setdefault(s, [])
        res.append((i,j))

However, it is probably far more wise to not pre-calculate the list and instead optimize the method on a concept level. What is the intrinsic goal you want to achieve? Do you really just want to calculate what you do? Or are you going to use that result for something else? What is that something else?

Just a hint. Have a look on itertools.combinations.

This is not exactly what you are looking for (because it stores pair of values, not of indexes), but it can be a starting code:

from itertools import combinations
for (a, b) in combinations(l, 2):
    pairs.setdefault(a + b, []).append((a, b))

The above comment from SimonStelling is correct; generating all possible pairs is just fundamentally slow, and there's nothing you can do about it aside from altering your algorithm. The correct function to use from itertools is product; and you can get some minor improvements from not creating extra variables or doing unnecessary list indexes, but underneath the hood these are still all O(n^2). Here's how I would do it:

from itertools import product
l = [4,3,6,1]
pairs = {}
for (m,n) in product(l,repeat=2):
    pairs.setdefault(m+n, []).append((m,n))

Finally, I have came up with my own solution, just taking half of the calculation time on average.

The basic idea: Instead of reading and writing into the growing dictionary n^2 times, I first collect all the sums in a list. Then I sort the list. Within the sorted list, I then look for same neighbouring items.

This is the code:

from operator import itemgetter

def getPairClusters( l ):

    # first, we just store all possible pairs sequentially
    # clustering will happen later
    pairs = []

    for i in xrange( len( l)  ):
        for j in xrange(i+1, len( l ) ):
            pair = l[i] + l[j]
            pairs.append( ( pair, i, j ) )
    pairs.sort(key=itemgetter(0))

    # pairs = [ (4, 1, 3), (5, 0, 3), (7, 0, 1), (7, 2, 3), (9, 1, 2), (10, 0, 2)]
    # a list item of pairs now contains a tuple (like (4, 1, 3)) with
    # * the sum of two l items: 4
    # * the index of the two l items: 1, 3

    # now clustering starts
    # we want to find neighbouring items as
    # (7, 0, 1), (7, 2, 3)
    # (since 7=7)

    pairClusters = []

    # flag if we are within a cluster
    # while iterating over pairs list
    withinCluster = False

            # iterate over pair list
    for i in xrange(len(pairs)-1):
        if not withinCluster:
            if pairs[i][0] == pairs[i+1][0]:
                # if not within a cluster
                # and found 2 neighbouring same numbers:
                # init new cluster
                pairCluster = [ ( pairs[i][1], pairs[i][2] ) ]
                withinCluster = True
        else:
            # if still within cluster
            if pairs[i][0] == pairs[i+1][0]:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
            # else cluster has ended
            # (next neighbouring item has different number)
            else:
                pairCluster.append( ( pairs[i][1], pairs[i][2] ) )
                pairClusters.append(pairCluster)
                withinCluster = False

    return pairClusters

l = [4,3,6,1]

print getPairClusters(l)

继续阅读：biginteger int optimization python

Python: fast dictionary of big int keys

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？