Training data for sentiment analysis [closed]

2023-04-07 05:09 问答作者：

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the 开发者_StackOverflowquestion so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

Where can I get a corpus of documents that have already been classified as positive/negative for sentiment in the corporate domain? I want a large corpus of documents that provide reviews for companies, like reviews of companies provided by analysts and media.

I find corpora that have reviews of products and movies. Is there a corpus for the business domain including reviews of companies, that match the language of business?

http://www.cs.cornell.edu/home/llee/data/

http://mpqa.cs.pitt.edu/corpora/mpqa_corpus

You can use twitter, with its smileys, like this: http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf

Hope that gets you started. There's more in the literature, if you're interested in specific subtasks like negation, sentiment scope, etc.

To get a focus on companies, you might pair a method with topic detection, or cheaply just a lot of mentions of a given company. Or you could get your data annotated by Mechanical Turkers.

This is a list I wrote a few weeks ago, from my blog. Some of these datasets have been recently included in the NLTK Python platform.

Lexicons

Opinion Lexicon by Bing Liu
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
- PAPERS: Mining and summarizing customer reviews
- NOTES: Included in the NLTK Python platform
MPQA Subjectivity Lexicon
- URL: http://mpqa.cs.pitt.edu/#subj_lexicon
- PAPERS: Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis (Theresa Wilson, Janyce Wiebe, and Paul Hoffmann, 2005).
SentiWordNet
- URL: http://sentiwordnet.isti.cnr.it
- NOTES: Included in the NLTK Python platform
Harvard General Inquirer
- URL: http://www.wjh.harvard.edu/~inquirer
- PAPERS: The General Inquirer: A Computer Approach to Content Analysis (Stone, Philip J; Dexter C. Dunphry; Marshall S. Smith; and Daniel M. Ogilvie. 1966)
Linguistic Inquiry and Word Counts (LIWC)
- URL: http://www.liwc.net
Vader Lexicon
- URLs: https://github.com/cjhutto/vaderSentiment, http://comp.social.gatech.edu/papers
- PAPERS: Vader: A parsimonious rule-based model for sentiment analysis of social media text (Hutto, Gilbert. 2014)

Datasets

MPQA Datasets
- URL: http://mpqa.cs.pitt.edu
- NOTES: GNU Public License.
  - Political Debate data
  - Product Debate data
  - Subjectivity Sense Annotations
Sentiment140 (Tweets)
- URL: http://help.sentiment140.com/for-students
- PAPERS: Twitter Sent classification using Distant Supervision (Go, Alec, Richa Bhayani, and Lei Huang)
- URLs: http://help.sentiment140.com, https://groups.google.com/forum/#!forum/sentiment140
STS-Gold (Tweets)
- URL: http://www.tweenator.com/index.php?page_id=13
- PAPERS: Evaluation datasets for twitter sentiment analysis (Saif, Fernandez, He, Alani)
- NOTES: As Sentiment140, but the dataset is smaller and with human annotators. It comes with 3 files: tweets, entities (with their sentiment) and an aggregate set.
Customer Review Dataset (Product reviews)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Mining and summarizing customer reviews
- NOTES: Title of review, product feature, positive/negative label with opinion strength, other info (comparisons, pronoun resolution, etc.)
  
  Included in the NLTK Python platform
Pros and Cons Dataset (Pros and cons sentences)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Mining Opinions in Comparative Sentences (Ganapathibhotla, Liu 2008)
- NOTES: A list of sentences tagged <pros> or <cons>
  
  Included in the NLTK Python platform
Comparative Sentences (Reviews)
- URL: http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#datasets
- PAPERS: Identifying Comparative Sentences in Text Documents (Nitin Jindal and Bing Liu), Mining Opinion Features in Customer Reviews (Minqing Hu and Bing Liu)
- NOTES: Sentence, POS-tagged sentence, entities, comparison type (non-equal, equative, superlative, non-gradable)
  
  Included in the NLTK Python platform
Sanders Analytics Twitter Sentiment Corpus (Tweets)
- URL: http://www.sananalytics.com/lab/twitter-sentiment
5513 hand-classified tweets wrt 4 different topics. Because of Twitter’s ToS, a small Python script is included to download all of the tweets. The sentiment classifications themselves are provided free of charge and without restrictions. They may be used for commercial products. They may be redistributed. They may be modified.
Spanish tweets (Tweets)
- URL: http://www.daedalus.es/TASS2013/corpus.php
SemEval 2014 (Tweets)
- URL: http://alt.qcri.org/semeval2014/task9
You MUST NOT re-distribute the tweets, the annotations or the corpus obtained (from the readme file)
Various Datasets (Reviews)
- URL: https://personalwebs.coloradocollege.edu/~mwhitehead/html/opinion_mining.html
- PAPERS: Building a General Purpose Cross-Domain Sentiment Mining Model (Whitehead and Yaeger), Sentiment Mining Using Ensemble Classification Models (Whitehead and Yaeger)
Various Datasets #2 (Reviews)
- URL: http://www.text-analytics101.com/2011/07/user-review-datasets_20.html

References:

Keenformatics - Sentiment Analysis lexicons and datasets (my blog)
Personal experience

Here are a few more;

http://inclass.kaggle.com/c/si650winter11

http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

If you have some resources (media channels, blogs, etc) about the domain you want to explore, you can create your own corpus. I do this in python:

using Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ for parsing the content that I want to classify.
separate those sentences meaning positive/negative opinions about companies.
Use NLTK to process this sentences, tokenize words, POS tagging, etc.
Use NLTK PMI to calculate bigrams or trigrams mos frequent in only one class

Creating corpus is a hard work of pre-processing, checking, tagging, etc, but has the benefits of preparing a model for a specific domain many times increasing the accuracy. If you can get already prepared corpus, just go ahead with the sentiment analysis ;)

I'm not aware of any such corpus being freely available, but you could try an unsupervised method on an unlabeled dataset.

You can get a large select of online reviews from Datafiniti. Most of the reviews come with rating data, which would provide more granularity on sentiment than positive / negative. Here's a list of businesses with reviews, and here's a list of products with reviews.

继续阅读：machine-learning sentiment-analysis text-analysis training-data

Training data for sentiment analysis [closed]

Lexicons

Datasets

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Lexicons

Datasets

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？