开发者

question on sentiment analysis

I have a question regarding sentiment analysis that i need help with.

Right now, I have a bunch of tweets I've gathered through the twitter search api. Because I used my search terms, I know what are the subjects or entities (Person names) that I want to look at. I want to know how others feel about these people.

For starters, I downloaded a list of english words with known valence/sentiment score and calculate the sentiments (+/-) based on availability of these words in the tweet. The problem is that sentiments calculated this way - I'm actually l开发者_如何转开发ooking more at the tone of the tweet rather than ABOUT the person.

For instance, I have this tweet:

"lol... Person A is a joke. lmao!"

The message is obviously in a positive tone, but person A should get a negative.

To improve my sentiment analysis, I can probably take into account negation and modifiers from my word list. But how exactly can I get my sentiments analysis to look at the subject of the message (and possibly sarcasm) instead?

It would be great if someone can direct me towards some resources....


While awaiting for answers from researchers in AI field I will give you some clues on what you can do quickly.

Even though this topic requires knowledge from natural language processing, machine learning and even psychology, you don't have to start from scratch unless you're desperate or have no trust in the quality of research going on in the field.

One possible approach to sentiment analysis would be to treat it as a supervised learning problem, where you have some small training corpus that includes human made annotations (later about that) and a testing corpus on which you test how well you approach/system is performing. For training you will need some classifiers, like SVM, HMM or some others, but keep it simple. I would start from binary classification: good, bad. You could do the same for a continuous spectrum of opinion ranges, from positive to negative, that is to get a ranking, like google, where the most valuable results come on top.

For a start check libsvm classifier, it is capable of doing both classification {good, bad} and regression (ranking). The quality of annotations will have a massive influence on the results you get, but where to get it from?

I found one project about sentiment analysis that deals with restaurants. There is both data and code, so you can see how they extracted features from natural language and which features scored high in the classification or regression. The corpus consists of opinions of customers about restaurants they recently visited and gave some feedback about the food, service or atmosphere. The connection about their opinions and numerical world is expressed in terms of numbers of stars they gave to the restaurant. You have natural language on one site and restaurant's rate on another.

Looking at this example you can devise your own approach for the problem stated. Take a look at nltk as well. With nltk you can do part of speech tagging and with some luck get names as well. Having done that you can add a feature to your classifier that will assign a score to a name if within n words (skip n-gram) there are words expressing opinions (look at the restaurant corpus) or use weights you already have, but it's best to rely on a classfier to learn weights, that's his job.


In the current state of technology this is impossible.

English (and any other language) is VERY complicated and cannot be "parsed" yet by programs. Why? Because EVERYTHING has to be special-cased. Saying that someone is a joke is a special-case of a joke, which is another exception in your program. Etcetera, etc, etc.

A good example (posted by ScienceFriction somewhere here on SO):

Similarly, the sentiment word "unpredictable" could be positive in the context of a thriller but negative when describing the breaks system of the Toyota.

If you are willing to spend +/-40 years of your life on this subject, go ahead, it will be much appreciated :)


I don't entirely agree with what nightcracker said. I agree that it is a hard problem, but we are making a good progress towards the solution.

For example, 'part-of-speech' might help you to figure out subject, verb and object in the sentence. And 'n-grams' might help you in the Toyota vs. thriller example to figure out the context. Look at TagHelperTools. It is built on top of weka and provides part-of-speech and n-grams tagging.

Still, it is difficult to get the results that OP wants, but it won't take 40 years.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜