开发者

String hashing with linear probing

I am stuck trying to figure out how to do string hashing with linear probing.

Basically, the idea is to ha开发者_如何学运维sh every string from a dictionary (90000 words), and retrieve anagrams of selected words.

Here's what I did:

  1. created a hash table 2*90000 in size

  2. using a simple hash function, I hash each word from the dictionary, get a value

  3. check if that hash table index is empty, if it is, assign the value, if not, generate a new hash value.

  4. after every word is in the hash table, and I perform a search

  5. the search word will receive a hash value after the hash function, and it will be checked whether that value exists in the hash table or not.

  6. if it exists, it will compare the string using permutations. if the match is true, it will output it. if not, it will keep looking using a new hash value.

problem is, the whole process is extremely slow... it indexes fine, but searching takes REALLY long time.

I am out of ideas on how to make this faster..

Thank you for your time reading this.


Put all the letters in alphabetical order first, then hash the result with any hashing algorithm you please (crc32, md5sum, sha1, count the vowels, anything... though counting the vowels will lead to a less-efficient solution), and store the word as a leaf node to that hash entry (in a linked list, obviously) -- do a mod(x) on the hash result to limit the buckets to 2^x.

Then, when you go to find an anagram, do the exact same "insert" procedure on your test word: alphabetize the letters, then run it through your same hash function. Then for each leaf node, compare the alphabetized letter list with the saved word's alphabetized list. Each match is an anagram.

(I normally don't like to give homework help, but this one was too tempting. Now I kind of want to go write a fun little program to find all the anagrams in a given dictionary.)


Linear Probing is used in the case when the hash function you are using gives collision for some input string.In that case you search sequentially the hash table until you find your search key.

This method has a disadvantage in that if there is one collision,there will be more.

One way is that you can use Separate Chaining and use balanced trees as buckets to improve lookup.

If you just want to improve performance, make sure that there are no collisions (in that case lookup is perfectly O(1)) and if there are, increase the hashtable size.


If you are searching each permute of your input word, this is the source of your problem. The number of permutations of the letters of a word can get quite large.

Instead, pick a hash function which is the same for any permutation (anagram) of a word. For example, one based on the sum of character codes of the characters in the word.


Are you attempting to create Anagrams of a given string? In that case, just create an anagram on getting the string as input. What's the point in hashing those strings?

EDIT: First of all I would suggest you to get all the permutations of the given string and then loop through the dictionary file having all the words. This has the advantage that you don't need to have all the words in memory. If you want to optimize further, sort the entire file based on the string lengths by ascending or descending order and keep checking for the strings in the dictionary file till you are <= to the next string's length.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜