Solr and Nutch - How to take control over Facets?
Sorry if this question might be too general. I'd be happy with good links to documentation, if there are any. Google won't help me find them.
I need to understand how facets can be extracted from a web site crawled by Nutch then indexed by Solr. On the web site, pages have meta tags, like <meta name="price" content="123.45"/>
or <meta name="categories" content="category1, category2"/>
. Can I tell Nut开发者_StackOverflow中文版ch to extract those and Solr to treat them as facets?
In the example above, I want to specify manually that the meta name "categories" is to be treated as a facet, but the content should be dynamically used as categories.
Does it make sense? Is it possible to do with Nutch and Solr, or should I rethink my way of using it?
I haven't used Nutch (I use Heritrix), but at the end of the day, Nutch need to extract the "meta" tag values and index them in Solr (using SolrJ for ex), with different solr fields "price", "categories", etc
Then you do
to get facets per categories. Here is a page on facets:
One of the options is to use nutch with metadata plugin
Although it is given as an example, it is very much included with the distribution. Assuming you know the other processes of configuring, and crawling data using nutch Before indexing, you need to configure nutch to use metadata plugin like this. Edit conf/nutch-site.xml
<property>
<name>plugin.includes</name>
<value>urlmeta|(rest of the plugins)</value>
</property>
The metadata tags that need to be indexed, like price can be supplied as another property
<property>
<name>urlmeta.tags</name>
<value>price</value>
</property>$
Now, you can run the nutch crawl command. After crawling and indexing with solr, you should see a field price in the index. The facet search can be used by adding facet.field in your query.
Here are some links of interest.
- Using Solr to index nutch data link :Link
- Help on Solr faceting queries link :Link
精彩评论