开发者

Apache Nutch to index only part of page content

Going to use Apache Nutch v1.3 to extract only some specific content from the webpages. Checked parse-html plugin. Seems it normalizes each html page using tagsoup or nekohtml. This is开发者_运维百科 good. I need to extract only text inside <span class='xxx'> and <span class='yyy'> elemetns on the web-page. Would be great if extracted texts are saved into different fields (e.g. content_xxx, content_yyy). My question is: should I write my own plugin or this could be done using some standard way?

The best way would be apply XSLT on normalized web-page and get the result. Is that possible?


Building your own ParsingFilter and IndexingFilter is easy. Nutch provides you with the DOM document, which you only need to traverse and search for your div. Then you simply add the new fields to your index and schema and your done.

There are some examples on how to do this:

http://wiki.apache.org/nutch/HowToMakeCustomSearch

http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Good luck


By default the content is flat after parsing. So I don't think you can do what you want, unless you can get extract your content at the indexing step ie once content has been flattened.


Instead of writing your own plugins, you can also use these custom plugins which can be configured to extract parts of pages:

  • https://github.com/BayanGroup/nutch-custom-search
  • http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜