Maintain a top-k set in Java

2023-01-17 16:18 问答作者：

Can't think of a neat way of doing this in java:

I'm streaming sets of strings from a file, lin开发者_开发百科e by line.

s1 s2 s3
s4 s5
s6 s7 s8 s9 s10
...

I load each line into a TreeSet, do some analysis and throw it away and move to the next line... I can fit the content of individual lines in memory, but not everything.

Now I want to maintain the top 5 biggest sets of strings I've encountered in the scan so far (storing nothing else).

I'm thinking a PriorityQueue with a SetSizeComparator, with add/poll when the queue reaches a size of 5. Anyone got a neater solution?

(I can't brain today. I have the dumb...)

Create a tuple, say LineTuple, consisting of a line and its string frequency.
Have a min heap of LineTuples, with comparator as the comparison of the frequency values.
For first k lines, insert them into the heap.
From (k+1)st line onwards,
- extract the root, i.e. the tuple with minimum frequency, from the heap, and (This operation is O( lg k )).
- create a tuple with the current line, and insert it into the heap. (This operation is average constant time, worst case O( lg k ))
At any point of time, the k tuples contained in the heap are the k biggest lines.

I am not fluent in Java, so I can't provide any code sample. But, check here, here.

Why doesn't the below work?

<T> T[] topK(Iterator<? extends T> items, int k, Class<T> clazz, Comparator<? super T> cmp) {
  T[] topK = Arrays.newInstance(clazz, k);
  if (k == 0) { return topK; }
  for (int i = 0; i < k && items.hasNext(); ++i) {
    topK[i] = items.hasNext();
  }
  // TODO: what is the appropriate output when there are less than k input?
  Arrays.sort(topK, cmp);
  for (T item; items.hasNext();) {
    item = items.next();
    if (cmp.compareTo(item, topK[k - 1]) < 0) {
      int pos = Arrays.binarySearch(topK, item);
      if (pos < 0) { pos = ~pos; }
      System.arraycopy(topK, pos, topK, pos + 1, k - (pos + 1));
      topK[pos] = item;
    }
  }
  return topK;
}

The shifting around is O(k) which is less than ideal, but the number of successful comparisons should decrease as the topK get progressively greater, and the comparison each step is O(log k) which is the same as any heap based approach you would get with PriorityQueues.

Here's an algorithm for randomly selecting k elements from a stream:

from random import randint

def rand_k(a, k):
  ret = []
  n = 0
  for e in a:
    n += 1
    if len(ret) < k:
      ret.append(e)
    else:
      if randint(1, n) <= k:
        ret[randint(0, k-1)] = e
  return ret

Note that each element will have probability k / n of being selected, where n is the total number of elements. Takes O(n) time and O(k) memory.

EDIT

The probability of choosing the element in position i (1-based), with i > k, is:

(k / i) * (1 - (k/(i+1))*(1/k)) * ... * (1 - (k/n)*(1/k))

That is, the probability of choosing the ith element, and not replacing it by any of the following elements. Simplifying each factor of the product:

= (k / i) * (i/(i+1)) * ((i+1)/(i+2)) * ... * ((n-1)/n)

Which after canceling, results in:

= k / n

The case of i <= k is similar.

You can use a TreeSet for this. See the reducer class in this Hadoop example:

https://trac.declarativity.net/browser/hadoop-0.19.2/src/examples/org/apache/hadoop/examples/TopK.java?rev=4870#L152

Maintain a top-k set in Java

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？