开发者

Vectorizing documents with Apache Mahout - MinLLR parameter

I'm working with Apache Mahout to vectorize and cluster a decent sized set of documents (~500k). In working through the examples both on the pr开发者_StackOverflow社区oject website and in the Mahout in Action book, I have seen the minLLR parameter of seq2sparse used a couple of times, but I'm unsure of what kind of values it expects. Is there any kind of 'starting ground' or method for estimating a decent value for this parameter?


The LLR value isn't normalized, so I don't believe there is a single good answer. And the answer will depend on how much pruning you want. The LLR values will increase linearly with the size of your corpus (well, number of n-grams). The default value of 1.0 is reasonable and I'd just advise you to find the right value experimentally, then scale it to other input linearly based on the size of input.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜