开发者

Massive Python set of integers using too much memory

Setup

  • Python 2.6
  • Ubuntu x64

I have a set of unique integers with values between 1 and 50 million. New integers are added at random e.g. numberset.add(random.randint(1, 50000000)). I need to be able to quickly add new integers and quickly check if an integer is already present.

Problem

After a while, the set grows too large for my low memory system and I experience MemoryErrors.

Question

How can I achieve this while using less memory? What's the fastest way to do this using the disk without reconfiguring the system e.g. swapfiles? Should I use a d开发者_StackOverflow中文版atabase file like sqlite? Is there a library that will compress the integers in memory?


You can avoid dependencies on 3rd-party bit-array modules by writing your own -- the functionality required is rather minimal:

import array

BITS_PER_ITEM = array.array('I').itemsize * 8

def make_bit_array(num_bits, initially=0):
    num_items = (num_bits + BITS_PER_ITEM - 1) // BITS_PER_ITEM
    return array.array('I', [initially]) * num_items

def set_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    bit_array[item_index] |= 1 << bit_index

def clear_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    bit_array[item_index] &= ~(1 << bit_index)

def get_bit(bit_array, offset):
    item_index = offset // BITS_PER_ITEM
    bit_index = offset % BITS_PER_ITEM
    return (bit_array[item_index] >> bit_index) & 1


Use a bit-array.This will reduce the need for huge space requirement.

Realted SO Question:

  • Python equivalent to Java's BitSet


Use an array of bits as flags for each integer - the memory needed will be only 50 million bits (about 6 MB). There are a few modules that can help. This example uses bitstring, another option is bitarray:

from bitstring import BitArray
i = BitArray(50000000) # initialise 50 million zero bits
for x in xrange(100):
    v = random.randint(1, 50000000)
    if not i[v]: # Test if it's already present
        i.set(1, v) # Set a single bit

Setting and checking bits is very fast and it uses very little memory.


Try to use array module.


Depending on your requirements, you might also consider a bloom filter. It is a memory-efficient data structure for testing if an element is in a set. The catch is that it it can give false-positives, though it will never give false-negatives.


If integers are unique then use bits. Example: binary 01011111 means that there are: 1, 3, 4, 5, 6 and 7. This way every bit is used to check if its integer index is used (value 1) or not (value 0).

It was described in one chapter of "Programming Pearls" by Jon Bentley (look for "The file contains at most ten million records; each record is a seven-digit integer.")

It seems that there is bitarray module mentioned by Emil that works this way.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜