Custom Parser for Nutch (or open source .NET Crawler)

2023-03-07 19:55 问答作者：

I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I'm working on, I am using Master pages, a开发者_运维百科s a result, content in the header and footer is getting indexed and distorts the results. For example, I have a link to the Contact Us page in the header. Now, when I search for 'Contact' the result returns all the pages in the site.

Is there a customizable Nutch parser that i can maybe pass a div id and then it only indexes content inside the div.

Or if there are .NET based crawlers that I can customize.

See https://issues.apache.org/jira/browse/NUTCH-585 and https://issues.apache.org/jira/browse/NUTCH-961

BTW you'd get a more relevant audience by posting to the Nutch user list

You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.

继续阅读：asp.net nutch solr solrnet web-crawler

Custom Parser for Nutch (or open source .NET Crawler)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？