Text Mining on huge list of strings
I have list of strings. (pretty big list of ids and strings scattered in 4-5 big files. around a GB each). These strings are formatted like this:
1,Hi
2,Hi How r u?
2,How r u?
3,where r u?
3,what does this mean
3,what it means
Now I want to do text mining on these strings and want to prepare a dendrogram which I want to display the strings in the following way
1-Hi
2-Hi How r u?
----How r u?
3-What does this mean?
----what it means?
3-Where are you?
This output is based on the similarities of strings following the comma after an id(suppose ID of a person who used those strings) for a particular person. If some other person used same words, then it should be grouped according to strings he used.
Now, it seems to be a simple task. But I want something to be done like this on hadoop/Mahout or something which can support huge set of data on clustered linux machines. and also how should I approach this problem for the solution. I have tried different approaches in Mahout already, wherein i tried to create sequence file and s开发者_StackOverflow社区eq2sparse vectores and then trying to do clustering. but it didn't work for me. Any help or pointers in the direction would be a great help.
Thanks & Regards, Atul
I think that what you really need is hierarchical clustering. There was one implementation proposed for Mahout, one is also implemented in Shogun Toolbox (also designed for large-scale computation). But it's hard to guarantee that it will work, because the input seems to be hard.
精彩评论