Building Large Arabic Multi-Domain Resources for Sentiment Analysis

Building Large Arabic
Multi-domain Resources
for Sentiment Analysis
Hady ElSahar and Samhaa R. El-Beltagy
Center for Informatics Science, Nile University
CICLing 2015 – April 19, 2014
hadyelsahar@gmail.com

Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results

Problem Statement
• Small size
• Domain Specificity
• Not publicly available
• Insufficient coverage of different Arabic dialects and non standard
terms
Current resources for sentiment analysis suffer many deficiencies:

Problem Statement
Author Dataset name Size Multi Domain Publicly Available
Rushdi-Saleh et al. OCA 500 NO YES
Abdul-Mageed & Diab AWATIF < 10K Yes NO
Aly, M. & Atiya, A. LABR 63K NO YES
Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES
Sentiment Datasets related work

Problem Statement
Author Size MSA / Dialect Multi Domain Publicly Available
El-Beltagy et al. 4K MSA + Dialect N/A YES
Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO
Badaro et al. 150K MSA only N/A YES
Sentiment lexicons related work

Proposed solution
• Building large Arabic datasets and lexicons for sentiment analysis
• Large size
• Multi-domain
• Arabic dialects
• Well documented, tested for sentiment classification
• Publicly available for every one to use

Building Datasets
Building datasets from reviewing content on the internet

Building Datasets
• Lack of Arabic reviewing content on the internet:
• Less Arabic based e-commerce & reviewing websites
• Arabic speakers use the English language to write their reviews
English *** , Do you Speak it !!!!

Domain Reviewing Websites Scrapped
Hotel reviews
Restaurant reviews
Product Reviews
Movie Reviews
Building Datasets
Scrapping Arabic Reviewing content on the Internet

Building Datasets
• Normalize different ratings systems into ( positive, negative and neutral )
classes using heuristics.
• Automatic labeling of reviews.

Building Datasets
• Removing redundant and spamming reviews
• Removing contradicting reviews ( Similar Text Different polarity )
• Remove duplicate reviews

Datasets Statistics
Hotels Restaurants Movies Products ALL
#Reviews 15579 11310 1524 14279 42692
#Unique Reviews 15562 10940 1522 5092 33116
#Users 13407 1639 416 7465 24653
#Items 8100 4654 933 5906 19593
Sizes of Extracted Datasets

Datasets Statistics
Number of reviews for each class

Datasets Statistics
Number of tokens per review for each of the datasets

Building multi domain lexicons
• Manually hand crafting sentiment lexicons is a tedious task
• Proposed approach  utilizes feature selection and ranking of Support Vector
Machines (SVM)
• SVM with L1 regularization penalty results in sparse coefficient vectors
doesn't
deserve
bad failure happen scene better wonderful enjoyable
‫يستحق‬ ‫ال‬ ‫سئ‬ ‫فشل‬ ……. ‫حصل‬ ‫مشهد‬ …….. ‫افضل‬ ‫رائع‬ ‫ممتع‬
-0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357
Coefficient vector of a trained support vector machine

Datasets
Training
L1-norm SVM
Selecting Top
Features
Manually
Verification
Multi domain
Lexicons
Building multi domain lexicons
• Train SVM classifier on each of the generated datasets using a unigram + bigram
model
• Omit features corresponding to zero coefficients
• Label features with positive coefficient values as positive lexicon entries
• Label features with negative coefficient values as Negative lexicon entries
• Manually filter and verify resulting lexicon ( a lot easier ! )

Building multi domain lexicon from the
datasets
Hotels Restaurants Movies Products LABR / Books ALL
# non-zero coef.
features
556 1413 526 661 3552 6708
# Manually filtered 218 734 87 369 874 1913
Size of built multi-domain lexicons before and after manual filtration

Building multi domain lexicon from the
datasets
Selected examples from the Generated lexicons:
Hotels Restaurants Movies
‫أعود‬ ‫لن‬
not coming back
‫بارد‬
cold
‫المشاهدة‬ ‫يستحق‬
worth watching
‫المياه‬‫ضعيفة‬
low water pressure
‫يشبع‬
Enough portions
‫برافو‬
Bravo

Experiments and bench marking Datasets
• Verify the viability of using the datasets for sentiment analysis
• Test the effectiveness of the generated lexicon
• Export the results of all experiments publicly for further analysis
• Provide easy benchmarking framework for future sentiment classifiers
Experiments Benchmarking the datasets for the task of sentiment analysis :

Datasets setups :
• 2 Class sentiment Classification (Positive or Negative)
• 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral )
• Balanced / Unbalanced Setups
• 20%-80% Splits (testing generated lexicons on unseen data)
• Cross validation

Feature building Methods :
• Standard feature building methods :
• Count, TF-IDF, Delta-TFIDF
• Features built from generated lexicons :
• (term existence, term count, weighted count )
• Domain specific lexicon, domain general lexicon
• Merging Lexicon based features with other features
Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD

• 3075 experiments, resulted from using all classifiers, features and
Datasets setups combinations together.
• Results are publicly available for further analysis and as benchmarks

Mining experiments results
Mining the experiments results to answer questions like :
• What are the top performing classifiers and features combinations ?
• Can we rely only on lexicons for sentiment analysis ?
• What is the effect of combining lexicon based features with other
features ?
• Are shorter documents easier to classify ?
• Are documents richer with subjective words easier to classify ?

Can we rely only on lexicon based
features for sentiment
classification?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.

Features Number of features Average Accuracy
2Class
Lex-domain ~ 500 0.768
Lex-all 1913 0.782
Count ~ 50K features 0.783

Features Number of features Average Accuracy
3Class
Lex-domain ~ 500 0.549
Lex-all 1913 0.554
Count ~ 50K features 0.570

Effect of merging lexicon based
features with other features?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.

Features Aggregated Lexicon Average Accuracy Enhancement
2Class
Count
None 0.783
Lex-domain 0.790 + 1 %
Lex-all 0.796 + 1.6 %
TFIDF
None 0.7
Lex-domain 0.791 + 9.1 %
Lex-all 0.8 +10 %
Delta-TFIDF
None 0.692
Lex-domain 0.789 + 9.7 %
Lex-all 0.798 + 10.6 %

Shorter documents are easier to
classify?
Or longer ones?, How about longer ones rich with subjective terms ?

Small Space

Storyline : Patch Adams was desperate and attempt to commit a
suicide many times, until he was sent to a mental hospital….
……..
Then he started unintentionally helping others through socializing
with them until they have become better

• Document length : No. of tokens in per document (log scale)
• Subjectivity score
• Sum of polarities of words that appear in the document (using generated
lexicons)
• Error Rate
• Number of misclassified documents of this specific group (doc. Length and
subjectivity score )

The error rate for various document lengths and subjectivity score groups (the Darker the worse)

Conclusion
• Built a large multi-domain datasets for sentiment Analysis ( 33K
reviews)
• Proposed an approach for semi-automatically learning multi-domain
lexicons (~2K)
• Everything is publicly available :
• Datasets (raw + processed)
• Lexicons
• Web Scrappers (to rerun for more recent reviews)
• Experiments code and results

Questions ?
Slides : bit.ly/cicling2015_elsahar_slides
Datasets : bit.ly/cicling2015_elsahar_resources

Building Large Arabic Multi-Domain Resources for Sentiment Analysis

More Related Content

Viewers also liked (20)

Similar to Building Large Arabic Multi-Domain Resources for Sentiment Analysis (20)

Recently uploaded (20)

Building Large Arabic Multi-Domain Resources for Sentiment Analysis