Text Mining on huge list of strings

2023-04-02 15:26 问答作者：

I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:

1,Hi

2,Hi How r u?

2,How r u?

3,where r u?

3,what does this mean

3,what it means

Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way

1-Hi

2-Hi How r u?

 ----How r u?

3-What does this mean?

 ----what it means?

3-Where are you?

This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.

Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines. and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and s开发者_StackOverflow社区eq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.

Thanks & Regards, Atul

I think that what you really need is hierarchical clustering. There was one implementation proposed for Mahout, one is also implemented in Shogun Toolbox (also designed for large-scale computation). But it's hard to guarantee that it will work, because the input seems to be hard.

继续阅读：data-mining mahout text-mining

Text Mining on huge list of strings

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？