Is it a good approach to generate hash codes?
I have to write a hash function, under the following two conditions:
- I don't know anything about
Object o
that is passed to the method - it can be a String, and Integer, or an actual custom object; - I am not allowed to call
hashCode()
at all.
Approach that I am using now, to calculate the hash code:
- Write object to the byte stream;
- Convert byte stream to the byte array;
Loop through the byte array and calculate hash by doing something like this:
hash = hash * PRIME + byteArray[i]
My question is it a passable approach and is there a way to improve it? Personally I feel like the scope for this function is too broad - there is no information about what the objects are,开发者_StackOverflow社区 but I have little say in this situation.
You could use HashCodeBuilder.reflectionHashCode instead of implementing your own solution.
The serialization approach does only work for objects which in fact are serializable. Thus, for all types of objects is not really possible.
Also, this compares objects by have equivalent object graphs, which is not necessarily the same as are equal by .equals()
.
For example, StringBuilder objects created by the same code (with same data) will have an equal OOS output (i.e. also equal hash), while b1.equals(b2)
is false, and a ArrayList and LinkedList with same elements will be register as different, while list1.equals(list2)
is true
.
You can avoid the convert byte stream to array step by creating a custom HashOutputStream
, which simply takes the byte data and hashes it, instead of saving it as an array for later iteration.
class HashOutputStream extends OutputStream {
private static final int PRIME = 13;
private int hash;
// all the other write methods delegate to this one
public void write(int b) {
this.hash = this.hash * PRIME + b;
}
public int getHash() {
return hash;
}
}
Then wrap your ObjectOutputStream around an object of this class.
Instead of your y = y*13 + x
method you might look at other checksum algorithms. For example, java.util.zip contains Adler32
(used in the zlib
format) and CRC32
(used in the gzip
format).
hash = (hash * PRIME + byteArray[i]) % MODULO ?
Also, while you're at it, if you want to avoid collisions as much as possible, you can use a standardized (cryptographic if intentional collisions are an issue) hash function in step 3, like SHA-2 or so?
Have a look at DigestInputStream
, which also spares you step 2.
Take a look at Bob Jenkin's article on non-cryptographic hashing. He walks through a number of approaches and discusses their strengths, weakness, and tradeoffs between speed and the probability of collisions.
If nothing else, it will allow you to justify your algorithm decision. Explain to your instructor why you chose speed over correctness or vice versa.
As a starting point, try his One-at-a-time hash:
ub4 one_at_a_time(char *key, ub4 len)
{
ub4 hash, i;
for (hash=0, i=0; i<len; ++i)
{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return (hash & mask);
}
It's simple, but does surprisingly well against more complex algorithms.
精彩评论