开发者

Creating A Better Hash Function

#include <iostream>
#include <iomanip>
#include <string>
#include <vector>

using namespace std;

class Item {
public:
    Item(const string & v): value(v), next(0) { }
    string value;
    Item * next;
};

int hash_function(const string & s)
{
    unsigned int hashval = 0;
    int i = s.length();
    while (i > 0)
{
        hashval += s[--i];
}       
return hashval%101;
}

main()
{
    string name;
    int index;
    Item * p;

    vector<Item *> bucket(101);

    for (index = 0; index < 101; index++)
        bucket[index] = 0;

    while (cin >> name) {
        p = new Item(name);
        index = hash_function(name);

        // push front
        if (bucket[index] != 0)
            p->next = bucket[index];
        bucket[index] = p;
    }

    for (index = 0; index < 101; index++)
        if (bucket[index] != 0) {
            cout << setw(3) << index << ": ";
            p = bucket[index];
            while (p != 0) {
                cout << p->value << " ";
                p = p->next;
            }
            cout << endl;
        }

    Item * temp;
    for (index = 0; index < 101; index++) {
        p = bucket[index];
        while (p != 0) {
            temp = p;
            p = p->next;
            delete temp;
        }
    }
}

which contains two very simple hash functions. I'm trying to work on the one which is not commented out, as it seems like the better of the two when tested. I want a set of names that is input to be distributed evenly in it's own bucket and so far, that seems to 开发者_StackOverflowbe working, with the exception of names which begin with the same letter. For example, Amy and Alice will appear in the same bucket and so on.

Here is a sample input/output:

Alice
Amy  
Barry
Carrie
David
Garret 
Edward
Henry
Ingrid
Fred
 65: Amy Alice 
 66: Barry 
 67: Carrie 
 68: David 
 69: Edward 
 70: Fred 
 71: Garret 
 72: Henry 
 73: Ingrid 

What can I add to my algorithm that would allow Amy and Alice to be placed in their own bucket?


Your function hash_function isn't actually returning a value. You should pay more attention to your compiler's warnings!

Apparently it happens to have the effect of returning the first character in the string. This is purely arbitrary. On another platform it might always return zero, or cause your computer to explode. (Probably not actually the latter.)

As for making a better hash function: once you fix this bug, you'll no longer find that the hash value depends only on the first character. However, you will find e.g. that "Brian" and "Brain" hash to the same value. That's the next thing you should think about.


Instead of blindly adding each letter, give some weight to each, so that cpp, pcp, ppc all could produce different hashvalue.

Here is little improved version:

int hash_function(const string & s)
{
    double hashval = 0;
    int i = s.length();
    double weight = 1.0;
    while (i > 0)
    {
        hashval +=  weight * s[--i];
        weight *= 1.5;
    }       
    return (int) hashval;
}

Assuming the string s is not too long, otherwise there will be overflow!


Check these out(suggested by the google sparsehash): Bob Jenkins: http://burtleburtle.net/bob/hash/ or Paul Hsieh: http://www.azillionmonkeys.com/qed/hash.html


Try weighting the different letters differently. In your current implementation (assuming it worked, as mentioned above), the name ab would hash to the same value as ba. Something like:

for (int i = 0 to str.len())
    hash = hash + hash + str[i]

would return different values for two strings with the same letters, yet is still very simple.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜