How does (carrot) clustering work in solr?

2023-03-16 14:28 问答作者：

i have running Lucene/Solr 4 for testing different features, also "clustering". Currently, 1 million documents are indexed. Every document has the following fields:

ID (unique Key) Example1: 10245
               Example2: 24974
TOPIC (Keywords of the document) Example1: "disaster/japan/nuclear power station"
                                 Example2: "world/japan/nuclear power"
HEADLINE (1 line of text): Example1: "explosion at nuclear power plant in japan"
                           Example2: "news about japans nuclear power plant"
TEXT (the full text): "In the Japanese nuclear power plant in Fukushima..."

All the fields are indexed and stored, exapt TEXT, which is only indexed, not stored. I use the f开发者_JAVA技巧ollowing specific configuration:

  <str name="carrot.title">TOPIC</str>
   <str name="carrot.snippet">HEADLINE</str>

If you looking the example you see, that the TOPIC is different, but japan is the same. Is it possible to configure solr/carrot in that way, that example1 and example2 will be in one cluster? Because of the matching "japan"?!

Further there could be an 3rd TOPIC like "news/nuclear power", no "japan" inside but HEADLINE and TEXT are using the words: japans power plant. What solr/carrot configuration is relevant in order to receive those 3 news in one cluster?

Thank you!

Carrot2 is designed to cluster natural / unstructured text and such algorithms will very rarely produce results that a human would find perfect. Unfortunately, such algorithms are also hard to "debug" -- the clusters they produce depend on many factors, such as the frequencies with which words occur in your documents. In your specific example, the word Japan may not have been chosen to form a cluster because it's too frequent -- it appears in all of the documents you quoted.

Here are a few tips you may want to try to tweak the clusters:

Try separating keywords with a period followed by a space rather than a slash, e.g. "disaster. japan. nuclear power station". If you do that, Carrot2 will treat word sequences, such as "nuclear power station", as phrases rather than individual words.
Try a different Carrot2 clustering algorithm, e.g. STC.
If there is a chance to get your full story text field stored (or maybe part of it, such as the first paragraph), use the HEADLINE for carrot.title and the full text / excerpt for carrot.snippet.
Play with the specific settings of Carrot2 algorithms. The best tool for this would be Carrot2 Clustering Workbench. Here's how to connect it to Solr: http://wiki.apache.org/solr/ClusteringComponent#Tuning_Carrot2_clustering

继续阅读：carrot2 cluster-analysis lucene solr

How does (carrot) clustering work in solr?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？