SA: 情感分析资源(Corpus、Dictionary)

2023-06-25,,

先主要摘自一篇中文Survey,
http://wenku.baidu.com/view/0c33af946bec0975f465e277.html

 

4.2 情感分析资源建设 4.2.1 情感分析的语料 除了4.1节中三个国际/国内评测所提供的语料外,不少研究单位和个人也提供了一定规模的语料. 1. 康奈尔大学(Cornell)提供的影评数据集(http://www.cs.cornell.edu/people/pabo/movie-review-data/):由电影评论组成,其中持肯定和否定态度的各1,000篇;另外还有标注了褒贬极性的句子各5,331句,标注了主客观标签的句子各5,000句.目前影评库被广泛应用于各种粒度的,如词语、句子和篇章级情感分析研究中. 2. 伊利诺伊大学芝加哥分校(UIC)的Hu和Liu提供的产品领域的评论语料:主要包括从亚马逊和Cnet下载的五种电子产品的网络评论(包括两个品牌的数码相机,手机,MP3和DVD播放器).其中他们将这些语料按句子为单元详细标注了评价对象,情感句的极性及强度等信息.因此,该语料适合于评价对象抽取和句子级主客观识别,以及情感分类方法的研究.此外,Liu还贡献了比较句研究[74]方面的语料. 3. Janyce Wiebe等人所开发的MPQA(Multiple-Perspective QA)库:包含535篇不同视角的新闻评论,它是一个进行了深度标注的语料库.其中标注者为每个子句手工标注出一些情感信息,如观点持有者,评价对象,主观表达式以及其极性与强度.文献[75]描述了整个的标注流程.MPQA语料适合于新闻评论领域任务的研究. 4. 麻省理工学院(MIT)的Barzilay等人构建的多角度餐馆评论语料:共4,488篇,每篇语料分别按照五个角度(饭菜,环境,服务,价钱,整体体验)分别标注上1~5个等级.这组语料为单文档的基于产品属性的情感文摘提供了研究平台. 5. 国内的中科院计算所的谭松波博士提供的较大规模的中文酒店评论语料:约有10,000篇,并标注了褒贬类别,可以为中文的篇章级的情感分类提供一定的平台. 4.2.2 情感分析的词典资源 情感分析发展到现在,有不少前人总结出来的情感资源,大多数表现为评价词词典资源. 1. GI(General Inquirer)评价词词典(英文,http://www.wjh.harvard.edu/~inquirer/).该词典收集了1,914个褒义词和2,293个贬义词,并为每个词语按照极性,强度,词性等打上不同的标签,便于情感分析任务中的灵活应用. 2. NTU评价词词典(繁体中文).该词典由台湾大学收集,含有2,812个褒义词与8,276个贬义词[76]. 3. 主观词词典(英文,http://www.cs.pitt.edu/mpqa/).该词典的主观词语来自OpinionFinder系统,该词典含有8,221个主观词,并为每个词语标注了词性,词性还原以及情感极性. 4. HowNet评价词词典(简体中文、英文,http://www.keenage.com/html/e_index.html).该词典包含9,193个中文评价词语/短语, 9,142个英文评价词语/短语,并被分为褒贬两类.其中,该词典提供了评价短语,为情感分析提供了更丰富的情感资源.

再补上上次总结的:
http://site.douban.com/204776/widget/notes/12599608/note/284723117/
##Datasets for SA:
###Lexicons:
[1]
The General Inquirer Lexicon
•Homepage: http://www.wjh.harvard.edu/~inquirer
•Categories
–Positive (1,915 words) and Negative (2,291 words)
–Strong vs Weak, Active vs Passive, Overstated versus Understated
–Pleasure, Pain, Virtue, Vice, Motivation, Cognitive Orientation, etc
•Free for research use
Philip J. Stone, Dexter C Dunphy, Marshall S. Smith, Daniel M. Ogilvie. 1966. The General Inquirer: A Computer Approach to Content Analysis. MIT Press.

[2]
LIWC (Linguistic Inquiry and Word Count)
•Homepage: http://www.liwc.net/
•2,300 words, > 70 classes
–Affective Processes
•negative emotion (bad, weird, hate, problem, tough)
•positive emotion (love, nice, sweet)
–Cognitive Processes
•Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
–Pronouns, Negation (no, never), Quantifiers (few, many)
•$30 or $90 fee
Pennebaker, J.W., Booth, R.J., & Francis, M.E. (2007). Linguistic Inquiry and Word Count: LIWC 2007.

[3]
MPQA Subjectivity Cues Lexicon
•Homepage: http://www.cs.pitt.edu/mpqa/subj_lexicon.html
•6,885 words from 8,221 lemmas
–2,718 positive
–4,912 negative
•Each word annotated for intensity (strong, weak)
•GNU GPL
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann (2005). Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis. HLT-EMNLP-2005.
Riloff and Wiebe (2003). Learning extraction patterns for subjective expressions. EMNLP-2003

[4]
Opinion Lexicon
•Homepage: http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar
•6,786 words
–2,006 positive
–4,783 negative
•Bing Liu's Page on Opinion Mining
Minqing Hu and Bing Liu. Mining and Summarizing Customer Reviews. ACM SIGKDD-2004

[5]
SentiWordNet
•Homepage: http://sentiwordnet.isti.cnr.it/
•All WordNet synsets automatically annotated for degrees of positivity, negativity, and neutrality/objectiveness
–[estimable(J,3)] “may be computed or estimated”
•Pos 0 Neg 0 Obj 1
–[estimable(J,1)] “deserving of respect or high regard”
•Pos .75 Neg 0 Obj .25
Stefano Baccianella, Andrea Esuli, and Fabrizio Sebastiani. 2010 SENTIWORDNET 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. LREC-2010

Sentiment Classification of Reviews Using SentiWordNet
http://arrow.dit.ie/cgi/viewcontent.cgi?article=1000&context=ittpapnin

###Corpus and Reviews:
[1]
Movie reviews
–Internet Movie Database (IMDb)
•http://www.cs.cornell.edu/people/pabo/movie-review-data/
•http://reviews.imdb.com/Reviews/
–700 positive / 700 negative

[2]
MOVIEREVIEWSET (Pang and Lee 2004)
[3]
MPQACORPUS (Wiebeet al. 2005)
[4]
PRODUCTREVIEWSET (Yi et al. 2003)

[2]-[4]
http://www.cs.uic.edu/liub/FBS/sentiment-analysis.html
http://www.cs.pitt.edu/mpqa/
http://ai.stanford.edu/amaas/data/sentiment
http://people.csail.mit.edu/jrennie/20Newsgroups

[5]
BOOKREVIEWSET (Aueand Gamon, 2005)
[6]
SENTENCESET (Kim and Hovy2004)

[7]
The J.D. Power and Associates Sentiment Corpus
http://verbs.colorado.edu/jdpacorpus/
The JDPA Corpus consists of user-generated content (blog posts) containing opinions about automobiles and digital cameras. They have been manually annotated for named, nominal, and pronominal mentions of entities. Entities are marked with the aggregate sentiment expressed toward them in the document. Mentions of each entity are marked as co-referential. Mentions are assigned semantic types consisting of the Automatic Content Extraction (ACE) mention types and additional domain-specific types. Meronymy (part-of and feature-of) and instance relations are also annotated. Expressions which convey sentiment toward an entity are annotated with the polarity of their prior and contextual sentiments as well the mentions they target. The following modifiers are annotated. These may target other modifiers or sentiment expressions

negators (expressions which invert the polarity of a sentiment expression or modifier)
neutralizers (expressions that do not commit the the speaker to the truth of the target sentiment expression or modifier)
committers (expressions which shift the commitment of the speaker toward the truth a sentiment expression or modifier)
intensifiers (expressions which shift the intensity of a sentiment expression or modifier)
Additionally, we have annotated when the opinion holder of a sentiment expression is someone other than the author of the blog by linking the expression to the holder. We also annotate when two entities are compared on a particular dimension.

The data, organized into training and testing sets, consists of 515 documents (blog posts) covering 330,762 tokens which make up 19,322 sentences. 87,532 mentions, 15,637 sentiment expressions, and 22,662 relations between entities (co-reference groups) are annotated.

Please see the included README file for more information about this data. For a more detailed explanation of the preparation of the corpus, please read The JDPA Sentiment Corpus Annotation Guidelines or The ICWSM 2010 JDPA Sentiment Corpus for the Automotive Domain.

##Packages and APIs for SA: 
http://stackoverflow.com/questions/10233087/sentiment-analysis-using-r
https://sites.google.com/site/miningtwitter/questions/sentiment

##Apps for SA:
Twitteratr
Tweetfeel
Twitter sentiment / Sentiment140

SA: 情感分析资源(Corpus、Dictionary)的相关教程结束。

《SA: 情感分析资源(Corpus、Dictionary).doc》

下载本文的Word格式文档,以方便收藏与打印。