What is the default hash code that Mathematica uses?
The o开发者_如何学Cnline documentation says
Hash[expr]
gives an integer hash code for the expression expr.
Hash[expr,"type"]
gives an integer hash code of the specified type for expr.
It also gives "possible hash code types":
- "Adler32" Adler 32-bit cyclic redundancy check
- "CRC32" 32-bit cyclic redundancy check
- "MD2" 128-bit MD2 code
- "MD5" 128-bit MD5 code
- "SHA" 160-bit SHA-1 code
- "SHA256" 256-bit SHA code
- "SHA384" 384-bit SHA code
- "SHA512" 512-bit SHA code
Yet none of these correspond to the default returned by Hash[expr]
.
So my questions are:
- What method does the default
Hash
use? - Are there any other hash codes built in?
The default hash algorithm is, more-or-less, a basic 32-bit hash function applied to the underlying expression representation, but the exact code is a proprietary component of the Mathematica kernel. It's subject to (and has) change between Mathematica versions, and lacks a number of desirable cryptographic properties, so I personally recommend you use MD5 or one of the SHA variants for any serious application where security matters. The built-in hash is intended for typical data structure use (e.g. in a hash table).
The named hash algorithms you list from the documentation are the only ones currently available. Are you looking for a different one in particular?
I've been doing some reverse engeneering on 32 and 64 bit Windows version of Mathematica 10.4 and that's what I found:
32 BIT
It uses a Fowler–Noll–Vo hash function (FNV-1, with multiplication before) with 16777619 as FNV prime and 84696351 as offset basis. This function is applied on Murmur3-32 hashed value of the address of expression's data (MMA uses a pointer in order to keep one instance of each data). The address is eventually resolved to the value - for simple machine integers the value is immediate, for others is a bit trickier. The Murmur3-32 implementing function contains in fact an additional parameter (defaulted with 4, special case making behaving as in Wikipedia) which selects how many bits to choose from the expression struct in input. Since a normal expression is internally represented as an array of pointers, one can take the first, the second etc.. by repeatedly adding 4 (bytes = 32 bit) to the base pointer of the expression. So, passing 8 to the function will give the second pointer, 12 the third and so on. Since internal structs (big integers, machine integers, machine reals, big reals and so on) have different member variables (e.g. a machine integer has only a pointer to int, a complex 2 pointers to numbers etc..), for each expression struct there is a "wrapper" that combine its internal members in one single 32-bit hash (basically with FNV-1 rounds). The simplest expression to hash is an integer.
The murmur3_32()
function has 1131470165 as seed, n=0 and other params as in Wikipedia.
So we have:
hash_of_number = 16777619 * (84696351 ^ murmur3_32( &number ))
with " ^ " meaning XOR.
I really didn't try it - pointers are encoded using WINAPI EncodePointer()
, so they can't be exploited at runtime. (May be worth running in Linux under Wine with a modified version of EncodePonter
?)
64 BIT
It uses a FNV-1 64 bit hash function with 0xAF63BD4C8601B7DF as offset basis and 0x100000001B3 as FNV prime, along with a SIP64-24 hash (here's the reference code) with the first 64 bit of 0x0AE3F68FE7126BBF76F98EF7F39DE1521 as k0 and the last 64 bit as k1. The function is applied to the base pointer of the expression and resolved internally. As in 32-bit's murmur3, there is an additional parameter (defaulted to 8) to select how many pointers to choose from the input expression struct. For each expression type there is a wrapper to condensate struct members into a single hash by means of FNV-1 64 bit rounds.
For a machine integer, we have:
hash_number_64bit = 0x100000001B3 * (0xAF63BD4C8601B7DF ^ SIP64_24( &number ))
Again, I didn't really try it. Could anyone try?
Not for the faint-hearted
If you take a look at their notes on internal implementation, they say that "Each expression contains a special form of hash code that is used both in pattern matching and evaluation."
The hash code they're referring to is the one generated by these functions - at some point in the normal expression wrapper function there's an assignment that puts the computed hash inside the expression struct itself.
It would certainly be cool to understand HOW they can make use of these hashes for pattern matching purpose. So I had a try running through the bigInteger wrapper to see what happens - that's the simplest compound expression. It begins checking something that returns 1 - dunno what. So it executes
var1 = 16777619 * (67918732 ^ hashMachineInteger(1));
with hashMachineInteger() is what we said before - including values.
Then it reads the length in bytes of the bigInt from the struct (bignum_length
) and runs
result = 16777619 * (v10 ^ murmur3_32(v6, 4 * v4));
Note that murmur3_32()
is called if 4 * bignum_length
is greater than 8 (may be related to the max value of machine integers $MaxMachineNumber
2^32^32
and by converse to what a bigInt is supposed to be).
So, the final code is
if (bignum_length > 8){
result = 16777619 * (16777619 * (67918732 ^ ( 16777619 * (84696351 ^ murmur3_32( 1, 4 )))) ^ murmur3_32( &bignum, 4 * bignum_length ));
}
I've made some hypoteses on the properties of this construction. The presence of many XORs and the fact that 16777619 + 67918732 = 84696351
may make one think that some sort of cyclic structure is exploited to check patterns - i.e. subtracting the offset and dividing by the prime, or something like that. The software Cassandra uses the Murmur hash algorithm for token generation - see these images for what I mean with "cyclic structure". Maybe various primes are used for each expression - must still check.
Hope it helps
It seems that Hash calls the internal Data`HashCode function, then divides it by 2, takes the first 20 digits of N[..] and then the IntegerPart, plus one, that is:
IntegerPart[N[Data`HashCode[expr]/2, 20]] + 1
精彩评论