Arabic Sentiment Analysis

These lexicons were created by measuring the extent to which the words in a tweets corpus co-occurred with a set of seed positive and seed negative terms. This is based on the idea that positive terms co-occur more with positive words and less with negative words; and negative words co-occur more with negative words and less with positive words.

Lexicon	# positive	# negative	Seeds
Arabic Emoticon Lexicon	22,962	20,342	A set of twenty three emoticons such as :) and :(.
Arabic Hashtag Lexicon	13,118	8,846	A set of 230 Arabic words that were manually selected for being highly positive or highly negative.
Arabic Hashtag Lexicon (dialectal)	11,941	8,179	A set of 483 dialectal Arabic words compiled by Refaee and Rieser (2014) from tweets.

Translated Lexicon	# positive	# negative	# neutral	Original English Lexicon
Arabic translation of Bing Liu’s Lexicon	2,006	4,783	-	Link to Bing Liu’s Lexicon
Arabic translation of MPQA Subjectivity Lexicon	2,718	4,911	570	Link to MPQA Subjectivity Lexicon
Arabic translation of NRC Emotion Lexicon	2,317	3,338	8,527	Link to NRC Emotion Lexicon
Arabic translation of NRC Emoticon Lexicon	38,312	24,156	-	Link to NRC Emoticon Lexicon
Arabic translation of NRC Hashtag Sentiment Lexicon	32,048	22,081	-	Link to NRC Hashtag Sentiment Lexicon

A random subset of 1200 Levantine dialectal sentences
chosen from the BBN Arabic-Dialect–English Parallel Text. The sentences are extracted social media posts and provided with their translation. Apart from manual translations, the dataset also includes automatic translations into English using the Portage MT system. We manually annotated this subset and its translations (both manual and automatic) for sentiment (positive, negative, or neutral).

This is the first such resource where text in one language and its translations into another language (both manually and automatically produced) are each manually labeled for sentiment.

A dataset of 2000 tweets originating from Syria (a country where Levantine dialectal Arabic is commonly spoken). These tweets were collected in May 2014 by polling the Twitter API. The dataset also includes automatic translations into English using the Portage MT system. (This dataset is not provided with manual English translation.) We manually annotated this subset and its translations (both manual and automatic) for sentiment (positive, negative, or neutral).

Further details about these resources (lexicons and corpora) are available in these publications:

Sentiment Lexicons for Arabic Social Media. Saif M. Mohammad, Mohammad Salameh, and Svetlana Kiritchenko. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference, May 2016, Portorož (Slovenia).
Paper (pdf)    BibTeX    Presentation    Video

How Translation Alters Sentiment. Saif M. Mohammad, Mohammad Salameh, and Svetlana Kiritchenko, Journal of Artificial Intelligence Research, January 2016, Volume 55, pages 95-130.
Paper (pdf) Pre-print version (pdf)    BibTeX

Sentiment After Translation: A Case-Study on Arabic Social Media Posts. Mohammad Salameh, Saif M Mohammad and Svetlana Kiritchenko, In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-2015), June 2015, Denver, Colorado.
Paper (pdf)    BibTeX

This study has been approved by the NRC Research Ethics Board (NRC-REB) under protocol number 2014-26. REB review seeks to ensure that research projects involving humans as participants meet Canadian standards of ethics.

FAQ:

Q. When we look at the tokens on a browser, there are some tokens like Ø§Ù…Ø§Ù†ØªÙ† which are not meaningful. How can we resolve this?

A. This is an encoding issue that occurs when using a browser. If you are opening the lexicons using a web browser, for example Chrome, just click
on "Customize Chrome button" -> More Tools -> Encoding (and choose UTF8). If you download the lexicons and want to open it using any software (like Notepad++), please make sure to set the encoding correctly. Also make sure to set the encoding to "UTF8" if you open the files using any programming language.
Please let us know if you have any other issues in using the lexicons on that page.

Terms of use:

The datasets mentioned in this page are available for direct download and can be used freely for research purposes.
The papers listed at the bottom of this page provide details of the creation and use. If you use a dataset, then please cite the associated papers.
If you use a dataset in a product or application, then please credit the authors and NRC appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the dataset.
Rather than redistributing the data, please direct interested parties to this page.
National Research Council Canada (NRC) disclaims any responsibility for the use of the datasets listed here and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications.

Updated: March, 2016

SemEval-2016 Task 7: Determining Sentiment Intensity of English and Arabic Phrases

Arabic Sentiment (Valence) Lexicons (see this Video)

Arabic Corpora Annotated for Sentiment (Valence)

Further details about these resources (lexicons and corpora) are available in these publications: