开发者

How do I make a good hash function for large Strings?

Here开发者_运维知识库's my hash function for Strings

public class GoodHashFunctor implements HashFunctor {

    @Override
    public int hash(String item) {

        String binaryRepString = "";

        for(int i = 0; i < item.length(); i++){
            // Add the String version of the binary version of the integer  version of each character in item
            binaryRepString += Integer.toBinaryString((int)(item.charAt(i)));
        }


        long longVersion = Long.parseLong(binaryRepString, 2) % Integer.MAX_VALUE;


        return (int) longVersion;

    }

}

However, when I try hashing large Strings (around 10-15 characters), I'm getting errors because when it tries to parseLong, it dies because it's too big a number.

What do you all think I should do? And my professor said we can't use Java's hashCode()

I saw a similar post where the best answer was to hash this way:

int hash=7;

for (int i=0; i < strlen; i++) {
    hash = hash*31+charAt(i);
}

But wouldn't I run into the same problem? I guess it'd probably take a lot longer Strings to break it this new way. I dunno I'm fairly confused...


Why do you need to convert each character into a string (and that too in binary form) before converting it into a long? Why not just have a long value to which you add the char?

This is homework, so I'm not posting code. You can also see any good algorithm book or search the web) for more about hashing.

Edit: I understand you don't want to just sum them up because anagrams will all have the same hash value. But I think you already know how to avoid that. Notice how by concatenating bits, you are basically adding bits to a value after having shifted them by some positions. i.e. "10101"+"10001" is the same as 1010100000+10001 - 21<<5+17.

By shifting each character by an amount proportional to its position in the string, the value added to the hash depends on both the value and position of the character. Also, observe the same effect can be had by simply multiplying rather than scaling.

Another thing to watch out for is the fact that a long has only 64 bits. you can only pack so many chars into it before it starts to overflow. So most practical hash functions take the value modulo some number. Of course that means there is only a limited number of possible hash values for an unlimited number of input strings. Collisions are inevitable, but well chosen values for your shift/multiplier and mod can minimize the number of collisions.


What is a good hash-function depends heavily on what you mean by good. I know this sounds cliché BUT it is just so true. To identify which hash-function is best for your specific problem-domain you have to specify:

  • how long the input is

  • which letters the input contains (letters in a certain alphabet, or just the 4 possible letters in genetic sequences, and if you want a really good hash function you even need to specify the expected probability of each letter)

  • in which way you want to differentiate strings (your comment on MAK's answer shows that you want the hash to be different for permutations of the same string. So your += is not a candidate, but see the link below for some functions that satisfy this requirement)

The combination of these 3 considerations allows you to select a good hash-function, but you first have to specify these 3 points.

As a side-note: obviously your += into a Long only works for short strings. But even with a different hash-function you dont get unique hash values for every possible string that you can fit into the 64-bit Long (Java): You can distinguish only 2^64 strings even with a perfect hash function. In general if you have a hashtable that maps aKey->anObject you still store the original key (not just the hash-value that this bucket represents) so you can compare it with the requested key string.

Depending on your requirements you might want to take a look into the topic of cryptographic hash-functions to decide if those are what you want. However first take a look at the very good Wikipedia-entry which lists some good hash-functions and more importantly the situations for which they are good: http://en.wikipedia.org/wiki/Hash_function

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜