Twitter Subjectivity Training Sets
I need a reliable and accurate method to filter twe开发者_如何转开发ets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.
Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?
For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.
SentiWordNet : http://sentiwordnet.isti.cnr.it/
Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java
Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf
The other approach I would try:
Example
Tweet 1: @xyz u should see the dark knight. Its awesme.
1) First a dictionary lookup for the for meanings.
"u" and "awesme" will not return anything.
2) Then go against the known abbreviations/shorthands and substitute matches with the expansions (Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)
Now the original tweet will look like:
Tweet 1: @xyz you should see the dark knight. Its awesme.
3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)
Related Link: Looking for Java spell checker library
Now the original tweet will look like:
Tweet 1: @xyz you should see the dark knight. Its awesome.
4) Split and feed the tweet into SWN3, aggregate the result
The problem with this approach is that
a) Negations should be handled outside SWN3.
b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.
There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:
- From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
- Carnegie Mellon Study of Twitter Sentiments Yields Results Similar to Public Opinion Polls
I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.
The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.
You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.
The DynamicLMClassifier
in LingPipe works pretty good.
http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html
精彩评论