开发者

Maintain a top-k set in Java

Can't think of a neat way of doing this in java:

I'm streaming sets of strings from a file, lin开发者_开发百科e by line.

s1 s2 s3
s4 s5
s6 s7 s8 s9 s10
...

I load each line into a TreeSet, do some analysis and throw it away and move to the next line... I can fit the content of individual lines in memory, but not everything.

Now I want to maintain the top 5 biggest sets of strings I've encountered in the scan so far (storing nothing else).

I'm thinking a PriorityQueue with a SetSizeComparator, with add/poll when the queue reaches a size of 5. Anyone got a neater solution?

(I can't brain today. I have the dumb...)


  1. Create a tuple, say LineTuple, consisting of a line and its string frequency.

  2. Have a min heap of LineTuples, with comparator as the comparison of the frequency values.

  3. For first k lines, insert them into the heap.

  4. From (k+1)st line onwards,

    • extract the root, i.e. the tuple with minimum frequency, from the heap, and (This operation is O( lg k )).
    • create a tuple with the current line, and insert it into the heap. (This operation is average constant time, worst case O( lg k ))
  5. At any point of time, the k tuples contained in the heap are the k biggest lines.

I am not fluent in Java, so I can't provide any code sample. But, check here, here.


Why doesn't the below work?

<T> T[] topK(Iterator<? extends T> items, int k, Class<T> clazz, Comparator<? super T> cmp) {
  T[] topK = Arrays.newInstance(clazz, k);
  if (k == 0) { return topK; }
  for (int i = 0; i < k && items.hasNext(); ++i) {
    topK[i] = items.hasNext();
  }
  // TODO: what is the appropriate output when there are less than k input?
  Arrays.sort(topK, cmp);
  for (T item; items.hasNext();) {
    item = items.next();
    if (cmp.compareTo(item, topK[k - 1]) < 0) {
      int pos = Arrays.binarySearch(topK, item);
      if (pos < 0) { pos = ~pos; }
      System.arraycopy(topK, pos, topK, pos + 1, k - (pos + 1));
      topK[pos] = item;
    }
  }
  return topK;
}

The shifting around is O(k) which is less than ideal, but the number of successful comparisons should decrease as the topK get progressively greater, and the comparison each step is O(log k) which is the same as any heap based approach you would get with PriorityQueues.


Here's an algorithm for randomly selecting k elements from a stream:

from random import randint

def rand_k(a, k):
  ret = []
  n = 0
  for e in a:
    n += 1
    if len(ret) < k:
      ret.append(e)
    else:
      if randint(1, n) <= k:
        ret[randint(0, k-1)] = e
  return ret

Note that each element will have probability k / n of being selected, where n is the total number of elements. Takes O(n) time and O(k) memory.

EDIT

The probability of choosing the element in position i (1-based), with i > k, is:

(k / i) * (1 - (k/(i+1))*(1/k)) * ... * (1 - (k/n)*(1/k))

That is, the probability of choosing the ith element, and not replacing it by any of the following elements. Simplifying each factor of the product:

= (k / i) * (i/(i+1)) * ((i+1)/(i+2)) * ... * ((n-1)/n)

Which after canceling, results in:

= k / n

The case of i <= k is similar.


You can use a TreeSet for this. See the reducer class in this Hadoop example:

https://trac.declarativity.net/browser/hadoop-0.19.2/src/examples/org/apache/hadoop/examples/TopK.java?rev=4870#L152

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜