Massive Python set of integers using too much memory
Setup
- Python 2.6
- Ubuntu x64
I have a set of unique integers with values between 1 and 50 million. New integers are added at random e.g. numberset.add(random.randint(1, 50000000))
. I need to be able to quickly add new integers and quickly check if an integer is already present.
Problem
After a while, the set grows too large for my low memory system and I experience MemoryError
s.
Question
How can I achieve this while using less memory? What's the fastest way to do this using the disk without reconfiguring the system e.g. swapfiles? Should I use a d开发者_StackOverflow中文版atabase file like sqlite? Is there a library that will compress the integers in memory?
You can avoid dependencies on 3rd-party bit-array modules by writing your own -- the functionality required is rather minimal:
import array
BITS_PER_ITEM = array.array('I').itemsize * 8
def make_bit_array(num_bits, initially=0):
num_items = (num_bits + BITS_PER_ITEM - 1) // BITS_PER_ITEM
return array.array('I', [initially]) * num_items
def set_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
bit_array[item_index] |= 1 << bit_index
def clear_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
bit_array[item_index] &= ~(1 << bit_index)
def get_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
return (bit_array[item_index] >> bit_index) & 1
Use a bit-array.This will reduce the need for huge space requirement.
Realted SO Question:
- Python equivalent to Java's BitSet
Use an array of bits as flags for each integer - the memory needed will be only 50 million bits (about 6 MB). There are a few modules that can help. This example uses bitstring, another option is bitarray:
from bitstring import BitArray
i = BitArray(50000000) # initialise 50 million zero bits
for x in xrange(100):
v = random.randint(1, 50000000)
if not i[v]: # Test if it's already present
i.set(1, v) # Set a single bit
Setting and checking bits is very fast and it uses very little memory.
Try to use array module.
Depending on your requirements, you might also consider a bloom filter. It is a memory-efficient data structure for testing if an element is in a set. The catch is that it it can give false-positives, though it will never give false-negatives.
If integers are unique then use bits. Example: binary 01011111
means that there are: 1, 3, 4, 5, 6 and 7. This way every bit is used to check if its integer index is used (value 1) or not (value 0).
It was described in one chapter of "Programming Pearls" by Jon Bentley (look for "The file contains at most ten million records; each record is a seven-digit integer.")
It seems that there is bitarray
module mentioned by Emil that works this way.
精彩评论