开发者

boolean search on an array

I have multiple arrays with about 100 possible values, ie:

a[0] = (a, b, c, d)
a[1] = (a, e)
a[2] = (d, f, g)

I want to FASTLY return which arrays contains (a || b) && (d || e)

in this example, 0 and 1

I was thinking about bitwise operations... like representing "abcd" by "1111"; "ad" by "1001", and so on. Then I could solve the "OR" with just a bitwise OR,开发者_运维技巧 and then check if both are non-zero

can anyone think on a better solution? this one isn't very pratical since it doesn't seem to be very escalable

are there any DBMS that can do that quickly? I tried with mongodb, but it seems they didn't add the "$and" function yet (doc says it's on version 1.9.1, but I can only download 1.9.0, and it's not stable anyway)

I suppose that's a "boolean search", similar to what google does all the time... so I'm guessing there's a better way (maybe not so fast, but more escalable) than that


Yes, a bitwise solution works quite nicely for this. Yes, some databases include such a capability, usually named a bitmapped column (or bitmapped index, depending). The usual advice is to apply it to a column that has relatively low cardinality (i.e., a fairly small number of possible values, such as sex).


In what sense is it not scalable? 16 bytes of data per (bit)array isn't bad! I'm not sure why you want a DBMS for this; you can put binary data in there if you need to (hopefully blocks of arrays), and pull it all out to query. Unless you're planning on having billions of arrays.

For small numbers of elements, bit logic is fastest. But if you start going far above 100 values, then keeping the arrays sorted and doing binary (or even linear!) search will be faster. You'll need to benchmark on your system to find the exact cutoff point, but if your arrays have ~4 elements each, I generally find linear search faster (counting up the occurrences of the elements you're looking for in the boolean logic as you go), and that it beats binary math at about the same point that binary representations also become larger.


Store your arrays as a trie, e.g.,

a
 b
  c
   d
 e
d
 f
  g

Create a trie from the expression as well, e.g.,

a
 b
  d
  e
 d
 e
b
 d
 e

You can match the latter trie against the former (ignoring any values that aren't in the expression, i.e., 'c', 'f', and 'g') to get the solutions. I leave the details of the trie representation and the matching algorithm up to you.


As you said the possible values are about 100, but you have lot of arrays, I think a hash table does better than bit level operation(s).
Eg.
have a hash table set with the values in expression, i.e a, b set to 1 and d, e set to 2.

for each array a in arrays      
  for each value v in array
    sum+= ht[v]
    if sum == 3
      print found
      break

(the above will not with duplicates though!)
the first for loop can be parallelized, probably with a map-reduce framework or even openMP.
(btw the second for can also be parallelized!)
This should be faster than constructing a bit representation of entire elements in the array and doing AND or OR. You basically benefit with the best case (for eg a and d are the first 2 elements!) worst case being same for both methods (may be the if done for every element be overhead)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜