algorithm to calculate similarity between texts
I am trying to sco开发者_高级运维re similarity between posts from social networks, but didn't find any good algorithms for that, thoughts?
I just tried Levenshtein, JaroWinkler, and others, but those one are more used to compare texts without sentiments. In posts we can get one text saying "I really love dogs" and an other saying "I really hate dogs", we need to classify this case as totally different.
Thanks
Ahh... but "I really love dogs" and "I really hate dogs" are totally similar ;), both discuss one's feelings towards dogs. It seems that you're missing a step in there:
- Run your algorithm and get the general topic groups (i.e. "feelings towards dogs").
- Run your algorithm again, but this time on each previously "discovered" group and let your algorithm further classify them into subgroups (i.e. "i hate dogs"/"i love dogs").
If your algorithm adjusts itself based on its experience (i.e. there some learning involved)., then make sure you run separate instances of the algorithm for the first classification, and a new instance of the algorithm for each sub-classification... if you don't, you may end up with a case where you find some groups and any time you run your algo on the same groups the results are nearly identical and/or nothing has changed at all.
Update
Apache Mahout provides a lot of useful algorithms and examples of Clustering, Classification, Genetic Programming, Decision Forest, Recommendation Mining. Here are a some of the text classification examples from mahout:
- Wikipedia classification
- Twenty Newsgroups classification
- Creating Vectors from Text
- Document Similarity with Mahout
- Item Based Recommender
I'm not sure which one would best apply to your problem, but maybe if you look them over you'll figure out which one is the most suitable for your specific application.
My research is about sentiment analysis, and I agree with Pierre, it's a hard problem, and given its subjective nature, no general algorithm exists. One of the approaches I had first tried was mapping the sentences into an emotional space and decide on its sentiment regarding the distance of the sentence to the sentiment centroids. You may have a look at it at:
http://dtminredis.housing.salle.url.edu:8080/EmoLib/
The sentences above work well ;)
You might want to have a look at Opinion mining and sentiment analysis to give you an idea of the complexity of the task.
Short answer: there a no "good algorithms" for this, only mediocre ones. And this is a very hard problem. Good luck.
精彩评论