Custom Parser for Nutch (or open source .NET Crawler)
I have been using Nutch/Solr/SolrNet for my search solutions, I must say, it works a treat. On a new site I'm working on, I am using Master pages, a开发者_运维百科s a result, content in the header and footer is getting indexed and distorts the results. For example, I have a link to the Contact Us page in the header. Now, when I search for 'Contact' the result returns all the pages in the site.
Is there a customizable Nutch parser that i can maybe pass a div id and then it only indexes content inside the div.
Or if there are .NET based crawlers that I can customize.
See https://issues.apache.org/jira/browse/NUTCH-585 and https://issues.apache.org/jira/browse/NUTCH-961
BTW you'd get a more relevant audience by posting to the Nutch user list
You can implement a Nutch filter (I like Jericho HTML Parser) to extract only the parts of the page you need to index using DOM manipulation. You can use the TextExtractor class to grab clean text (sans HTML tags) to be used in your index. I usually save that data in custom fields.
精彩评论