How can I cluster short messages [Tweets] based on topic ? [Topic Based Clustering]

2023-01-01 04:30 问答作者：

I am planning an application which will make clusters of short messages/tweets based on topics. The number of topics will be limited like Sports [ 开发者_JAVA技巧NBA, NFL, Cricket, Soccer ], Entertainment [ movies, music ] and so on...

I can think of two approaches to this

Ask users to tag questions like Stackoverflow does. Users can select tags from a predefined list of tags. Then on server side I will cluster them based on tags. Pros:- Simple design. Less complexity in code. Cons:- Choices for users will be restricted. Clusters will not be dynamic. If a new event occurs, the predefined tags will miss it.
Take the message, delete the stopwords [ predefined in a dictionary ], apply some clustering algorithm on the stemmed message to make a cluster and depending on its popularity display the cluster. The cluster will be displayed till the time it remains popular [ many messages/minute].New messages will be skimmed and assigned to corresponding clusters. Pros:- Dynamic clustering based on the popularity of the event/accident. Cons:- Increased complexity. More server resources required.

I would like to know whether there are any other approaches to this problem. Or are there any ways of improving the above mentioned methods?

Also suggest some good clustering algorithms.I think "K-Nearest Clustering" algorithm is apt for this situation.

Check out Carrot2, this tool extracts the tags from the text and clusters. You can download it from here and check the algorithms implemented (Lingo, mainly) here.

Hope this help you.

Use Bayesian classification. Train the filter with some predefined corpus, and (optionally) provide a way for users to further refine it by flagging things that were incorrectly categorized.

Here's some examples of using the Bayesian classifier in NLTK.

I am also doing a similar kind of thing. I think hashtags are a good way if you are talking specifically about twitter. You could also perform some classification but it should be enriched with some external knowledge base like Wikipedia etc. Anyways, if your solution is better, please post it here

继续阅读：cluster-analysis tagging

How can I cluster short messages [Tweets] based on topic ? [Topic Based Clustering]

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？