Java API : downloading and calculating tf-idf for a given web page
I am new to IR techniques.
I looking for a Java based API or 开发者_开发技巧tool that does the following.
- Download the given set of URLs
- Extract the tokens
- Remove the stop words
- Perform Stemming
- Create Inverted Index
- Calculate the TF-IDF
Kindly let me know how can Lucene be helpful to me.
Regards Yuvi
You could try the Word Vector Tool - it's been a while since the latest release, but it works fine here. It should be able to perform all of the steps you mention. I've never used the crawler part myself, however.
Actually, TF-IDF is a score given to a term in a document, rather than the whole document. If you just want the TF-IDFs per term in document, maybe use this method, without ever touching Lucene. If you want to create a search engine, you need to do a bit more (such as extracting text from the given URLs, whose corresponding documents would probably not contain raw text). If this is the case, consider using Solr.
精彩评论