Hash+mapping or index+mapping to condense use of strings
I have ~200K named properties and ~25K files. I extract whether the properties hold or not for each file using Python as a set of properties that hold, one set for each file.
To do this extraction I might run hundreds of individual python extraction scripts on a compute farm, in parallel. Each leaving behind some representation of the set extraction from each of the files.
Further processing involves reading these 20K sets and working on the accumulated data. to generate a report on this s开发者_如何学Goet of files/properties.
One issue I have is that if I store the extracted set as text then the long property name strings and file name strings will get repeated wasting disk space and increasing parse time.
I was thinking of creating a central index of the sorted property names and just saving the index - same for the file names, as probably .py files to import.
An alternative to using an index into the sorted list of names would be to use the str.hash() value as the index which would mean probably faster processing, but I am worried about the possibility of two unequal strings ending up with the same hash() value. Could this happen?
I would be using the same Python executable and OS version on all machines.
Do you know the properties in advance? If you do than you might want to consider Perfect hashing (i.e. you can distribute the settings of the hash instead of the full list of properties/files).
A very crude (but possibly working way) would be having a few different hash functions (h1, h2...); start e.g. with the str.hash() and compute the hashes. If there are collisions, try using a tuple (h1(property), h2(property)) as a hash. If there are still collisions, use (h1(property), h2(property), h3(property)) - etc., until there are no collisions. The h_x functions can be actually somewhat configurable function and the recomended way would be to try some random hash functions.
However it seems to me that it might be overkill, just distributing the list of files/properties might be a lot easier...
Hashes can collide. You will have to consider that.
精彩评论