开发者

A Hash that finds results that are close

I have a table that has about 10000 entries each entry has almost 100 boo开发者_如何学Golean values. A user checkboxes a bunch of the booleans and hopes to get a result that matches their request. If that record doesn't exist, I want to show them maybe 5 records that are close(have only 1 or two values different). Is there a good hash system or data structure that can help me find these results.


Bitmap indices. Google for the paper if you want the complete background, it's not easy but worth a read. Basically build bitmpas for your boolean values like this:

010110101010
110100010100
000101001100

And then just XOR your filter through them, sort by number of matches, return. Since all operations are insanely fast (about one cycle per element, and the data structure uses (edit) 100 bits of memory per element), this will usually work even though it's linear.

Addendum: How to XOR. (fixed a bug)

000101001100 source
000101001010 target
000000000110 result of XOR

 int n = 0; if (v) do { n++; } while (v &= (v-1)); return(n);

The two 1's tell you that there are 2 errors and m-2 matches, where m is the number of bits.


What you describe is a nearest neighbor search: based on a record, find the 5 closest records based on an arbitrary distance function (such as the number of different values).

A hashing function intentionally discards any information except "these values are equal", so it's not really the way to go.

Consider using instead a data structure optimized for nearest neighbor searching, such as a kd-tree or vp-tree. If there's a high probability that a record already exists in the list, you could first use a hash table to look for it, and then fall back on the kd-tree if it does not exist.


This builds on the answer from Kdansky.

Create a dynamic array of entries.
Create a cache.

for each lookup,
   check the cache.
   return the cache entry if the value exists.
   otherwise for each value in the dynamic array,
       if hamming distance is less than threshold add to the result list
   cache the value against the result
   return the result

to find the hamming distance: xor and find the hamming weight http://en.wikipedia.org/wiki/Hamming_weight

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜