开发者

Nutch - Lucene - capture the content of the pages

I have crawled a few pages with Java Nutch Also I have made a module with Lucene in Java which allows execute queries on indexed documents. I know I created Nutch fields like url, weight and the title. But I am interested in capturing the content of each page. Ho开发者_StackOverfloww I can do it using Lucene and knowing I have crawled with nutch?

Thanks


You need to give more details about what you want to achieve... because Nutch already includes a Lucene Index so I wonder why you want another one???? Nutch has a jsp front-end where you can look at, and find how to query for some field content. There is a cache system implemented so you can retrieve the cached data of page, but then you have to parse it again and index it again.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜