开发者

Java API : downloading and calculating tf-idf for a given web page

I am new to IR techniques.

I looking for a Java based API or 开发者_开发技巧tool that does the following.

  1. Download the given set of URLs
  2. Extract the tokens
  3. Remove the stop words
  4. Perform Stemming
  5. Create Inverted Index
  6. Calculate the TF-IDF

Kindly let me know how can Lucene be helpful to me.

Regards Yuvi


You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.


Actually, TF-IDF is a score given to a term in a document, rather than the whole document. If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene. If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜