I have some 100.000+ text documents. I\'d like to find a way to answer this (somewhat ambiguous) question:
We are trying to implement a WEKA classifier from inside a Java program. So far so good, everything works well however when building 开发者_C百科the classifier from the training set in Weka GUI we use
I am implementing the tf-idf algorithm in a web application using Python, however it runs extremely slow. What I basically do i开发者_JS百科s:
I am trying to do some pattern \'mining\' in piece of multi word on each line. I have done the N-gram analysis using the Text::Ngrams module in perl which give me the frequency of each word . I am how
I have a DB containing tf-idf vectors of about 30,000 documents. I would like to return for a given document a set of similar documents - about 4 or so.
I\'m trying to implement the naive Bayes classifier for sentiment analysis. I plan to use the TF-IDF weighting measure. I\'m just a little stuck 开发者_如何转开发now. NB generally uses the word(featur
I am confused by the following comment about TF-IDF and Cosine Similarity. I was reading up on both and then on wiki under Cosine Similarity I find this sentence \"In case of of information retrieva
The goal is to assess semantic relatedness between terms in a large text corpus, e.g. \'police\' and \'crime\' should have a stronger semantic relatedness than \'p开发者_StackOverflow中文版olice\' and
I would like to have, in addition to standard term search with tf-idf similarity over text content field, scoring based on \"similarity\" of numeric fields. This similarity will be depending on distan
I just wonder how Lucene can make it,and from the source code I know that it opens and loads the segment files when intializing a searcher with a IndexReader,but Is there any kind person tell me how L