Efficiently selecting a title (the center of the cluster) for a cluster of strings

2023-04-05 23:20 问答作者：

I have an (imperfectly) clustered string data, where the items in one cluster might look like this:

[ 
  Yellow ripe banana very tasty,
  Yellow ripe banana with little dots,
  Green apple with little dots,
  Green ripe banana - from the market, 
  Yellow ripe banana,
  Nice yellow ripe banana,
  Cool yellow ripe banana - my favourite,
  Yellow ripe,
  Yellow ripe
],

where the optimal title would be 'Yellow ripe banana'.

Currently, I am using simple heuristics - choosing the most common, or the shortest name if tie, - with the help of SQL GROUP BY. My data contains a large amount of such clusters, they change frequently, and, every time a new fruit is added to or removed from the cluster, the title for the cluster has to be re-calculated.

I would like to improve two things:

(1) Efficiency - e.g., compare the new fruit name to the title of the cluster only, and avoid grouping / phrase clustering of all fruit titles each time.

(2) Precision - instead of looking for the most common complete name, I would like to extract the most common phrase. The current algorithm would choose 'Yellow ripe', which repeats 2 times a开发者_StackOverflownd is the most common complete phrase; however, as the phrase, 'Yellow ripe banana' is the most common in the given set.

I am thinking of using Solr + Carrot2 (got no experience with the second). At this point, I do not need to cluster the documents - they are already clustered based on other parameters - I only need to choose the central phrase as the center/title of the cluster.

Any input is very appreciated, thanks!

Solr provides an analysis component called a ShingleFilter that you can use to create tokens from groups of adjacent words. If you put that in your analysis chain (ie apply it it incoming documents when you index them), and then compute facets for the resulting field with a query restricted to the "fruit cluster", you will be able to get a list of all distinct shingles along with their occurrence frequencies - I think you can even retrieve them sorted by frequency - which you can use easily I think to derive the title you want. Then when you add a new fruit, its shingles will automatically be included in the facet computations the next time around.

Just a bit more concrete version of this proposal:

create two fields: fruit_shingle, and cluster_id.

Configure fruit_shingle with the ShingleFilter and any other processing you might want (like tokenizing at word boundaries with maybe StandardTokenizer, prior to the ShingleFilter).

Configure cluster_id as a unique id, using whatever data you use to identify the clusters.

For each new fruit, store its text in fruit_shingle and its id in cluster_id.

Then retrieve facets for a query: "cluster_id:", and you will get a list of words, word pairs, word triplets, etc (shingles). You can configure the ShingleFilter to have a max length, I believe. Sort the facets by some combination of length and/or frequency that you deem appropriate and use that as the "title" of the fruit cluster.

继续阅读：carrot cluster-analysis phrase similarity solr

Efficiently selecting a title (the center of the cluster) for a cluster of strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？