Vectorizing documents with Apache Mahout - MinLLR parameter
I'm working with Apache Mahout to vectorize and cluster a decent sized set of documents (~500k). In working through the examples both on the pr开发者_StackOverflow社区oject website and in the Mahout in Action book, I have seen the minLLR
parameter of seq2sparse
used a couple of times, but I'm unsure of what kind of values it expects. Is there any kind of 'starting ground' or method for estimating a decent value for this parameter?
The LLR value isn't normalized, so I don't believe there is a single good answer. And the answer will depend on how much pruning you want. The LLR values will increase linearly with the size of your corpus (well, number of n-grams). The default value of 1.0 is reasonable and I'd just advise you to find the right value experimentally, then scale it to other input linearly based on the size of input.
精彩评论