Parallelizable hashing algorithm where size and order of sub-strings is irrelevant
EDIT
Here is the problem I am trying to solve:
I have a string broken up into multiple parts. These parts are not of equal, or predictable length. Each part will have a hash value. When I concatenate parts I want to be able to use the hash values from each part to quickly get the hash value for the parts together. In addition the hash generated by putting the parts together must match the hash generated if the string were hashed as a whole.
Basically I want a hashing algorithm where the parts of the data being hashed can be hashed in parallel, and I do not want the order or length of the pieces to matter. I am not breaking up the string, but rather receiving it in unpredictable chunks in an unpredictable order.
I am willing to ensure an elevated collision rate, so long as it is not too elevated. I am also ok with a slightly slower algorithm as it is hardly noticeable on small strings, and done in parallel for large strings.
I am familiar with a few hashing algorithms, however I currently have a use-case for a hash algorithm with the property that the sum of two hashes is equal to a hash of the sum of the two items.
Requirements/givens
- This algorithm will be hashing byte-strings with length of at least 1 byte
- hash("ab") = hash('a') + hash('b')
- Collisions between strings with the same characters in different order is ok
- Generated hash should be an integer of native size (usually 32/64 bits)
- String may contain any character from 0-256 (length is known, not \0 terminated)
- The ascii alpha-numeric characters will be by far the most used
- A disproportionate number of strings will be 1-8 ASCII characters
- A very tiny percentage of the strings will actually contain bytes with values at or above 127
If this is a type of algorithm that has terminology associated with it, I would love to know that terminology. If I knew what a proper term/name for this type of hashing algorithm was it would be much easier to google.
I am thinking the simplest way to achieve this is:
- Any byte's hash should be its value, normalize开发者_StackOverflow中文版d to <128 (if >128 subtract 128)
- To get the hash of a string you normalize each byte to <128 and add it to the key
- Depending on key size I may need to limit how many characters are used to hash to avoid overflow
I don't see anything wrong with just adding each (unsigned) byte value to create a hash which is just the sum of all the characters. There is nothing wrong with having an overflow: even if you reach the 32/64 bit limit (and it would have to be a VERY/EXTREMELY long string to do this) the overflow into a negative number won't matter in 2's complement arithmetic. As this is a linear process it doesn't matter how you split your string.
精彩评论