开发者

Finding Common Sets within noisy data

Context: Consider each set within G to be a collection of the files (contents or MD5 hashes, not names) that are found on a particular computer.

Suppose I have a giant list of giant sets G and an unknown to me list of sets H. Each individual set I in G was created by taking the union of some unknown number of sets from list H, then adding and removing an unknown number of elements.

Now, I could use other data to construct a few of the sets in list H. However, I feel like there might be some sort of technique involving Bayesian probability to do this. E.g. something like,开发者_如何转开发 "If finding X in a set within G means there is a high probability of also finding Y, then there is probably a set in H containing both X and Y."

Edit: My goal is to construct a set of sets that is, with high probability, very similar or equal to H.

Any thoughts?

Example usage:

Compress G by replacing chunks of it with pieces of H, e.g.

G[1]  = {1,2,3,5,6,7,9,10,11}
H[5]  = {1,2,3}
H[6]  = {5,6,7,8,9,10}
G[1]' = {H[5],H[6],-8,11}


Define the distance d(i,j) = 1/(number of sets in G which contain both i and j) and then run a cluster analysis.(http://en.wikipedia.org/wiki/Cluster_analysis) The resulting clusters are your candidates for the elements in H.


There are tons of non-brainy ad hoc ways to attack this. Here's one.

Start by taking a random sample from G, say 64 sets.

For each file in these sets, construct a 64-bit integer telling which sets it appears in.

Group the files by this 64-bit value; so all the files that always appear together end up in the same group. Find the group with maximum ((number of files in group - 1) × (number of bits set in the bit-vector - 1)) and call that H[0].

Now throw that sample back and take a new random sample. Reduce it as much as you can using the H[0] you've already defined. Then apply the same algorithm to find H[1]. Rinse. Repeat.

Stop when additional H's are no longer helping you compress the sets.

To improve on this algorithm:

  • You can easily choose a slightly different measure of the goodness of groups that promotes groups with lots of nearby neighbors--files that appear in nearly the same set of sets.
  • You can also pretty easily test your existing H's against random samples from G to see if there are files you should consider adding or removing.


Well, the current ad-hoc way, which seems to be good enough, is as follows:

  • Remove all elements from all G_x that are in under 25 sets.
  • Create a mapping from element to set and from set to element.
  • For each element E in the element map, pick 3 sets and take their intersection. Make two copies of this, A and B.
  • For each set S in the set map that does not contain E, remove all elements of S from A or B (alternate between them)
  • Add Union(A,B) to H
  • Remove all elements of `Union(A,B) from the element to set map (i.e. do not find overlapping sets).


How about a deterministic way (if you do not wish sets to intersect at all): A) Turn sets in H into vertices labeled 1, 2, 3, ... size(H). Create a complete [un] directed graph between them all. Each vertex gets a value - equal to the cardinality / size of the set. B) Go through all elements x in sets in H, create a mapping x -> [x1, x2, ... xm] if and only if x is in H[xi]. An array of sets will do. This helps you find overlapping sets. C) Go through through all sets in this array, for every pair of x1, x2 that are within the same set - eliminate two edges between x1 and x2. D) In the remaining graph only non-overlapping sets (well, their indices in H). E) Now find the non-intersecting path within this graph with the highest total value. From this you can reconstruct the list of non-intersecting sets with highest coverage. It is trivial to compute the missing elements. F) If you want to minimize the cardinality of the remaining set, then subtract 0.5 from the value of each vertex. We know that 1 + 1 = 2, but 0.5 + 0.5 < 1.5 - so the algorithm will prefer a set {a,b} over {a} and {b}. This may not be exactly what you want, but it might expire you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜