SlideShare a Scribd company logo
Building Large Arabic
Multi-domain Resources
for Sentiment Analysis
Hady ElSahar and Samhaa R. El-Beltagy
Center for Informatics Science, Nile University
CICLing 2015 – April 19, 2014
hadyelsahar@gmail.com
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Problem Statement
Problem Statement
• Small size
• Domain Specificity
• Not publicly available
• Insufficient coverage of different Arabic dialects and non standard
terms
Current resources for sentiment analysis suffer many deficiencies:
Problem Statement
Author Dataset name Size Multi Domain Publicly Available
Rushdi-Saleh et al. OCA 500 NO YES
Abdul-Mageed & Diab AWATIF < 10K Yes NO
Aly, M. & Atiya, A. LABR 63K NO YES
Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES
Sentiment Datasets related work
Problem Statement
Author Size MSA / Dialect Multi Domain Publicly Available
El-Beltagy et al. 4K MSA + Dialect N/A YES
Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO
Badaro et al. 150K MSA only N/A YES
Sentiment lexicons related work
Proposed solution
• Building large Arabic datasets and lexicons for sentiment analysis
• Large size
• Multi-domain
• Arabic dialects
• Well documented, tested for sentiment classification
• Publicly available for every one to use
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Building Datasets
Building datasets from reviewing content on the internet
Building Datasets
• Lack of Arabic reviewing content on the internet:
• Less Arabic based e-commerce & reviewing websites
• Arabic speakers use the English language to write their reviews
English *** , Do you Speak it !!!!
Domain Reviewing Websites Scrapped
Hotel reviews
Restaurant reviews
Product Reviews
Movie Reviews
Building Datasets
Scrapping Arabic Reviewing content on the Internet
Building Datasets
• Normalize different ratings systems into ( positive, negative and neutral )
classes using heuristics.
• Automatic labeling of reviews.
Building Datasets
• Removing redundant and spamming reviews
• Removing contradicting reviews ( Similar Text Different polarity )
• Remove duplicate reviews
Datasets Statistics
Hotels Restaurants Movies Products ALL
#Reviews 15579 11310 1524 14279 42692
#Unique Reviews 15562 10940 1522 5092 33116
#Users 13407 1639 416 7465 24653
#Items 8100 4654 933 5906 19593
Sizes of Extracted Datasets
Datasets Statistics
Number of reviews for each class
Datasets Statistics
Number of tokens per review for each of the datasets
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Building multi domain lexicons
• Manually hand crafting sentiment lexicons is a tedious task
• Proposed approach  utilizes feature selection and ranking of Support Vector
Machines (SVM)
• SVM with L1 regularization penalty results in sparse coefficient vectors
doesn't
deserve
bad failure happen scene better wonderful enjoyable
‫يستحق‬ ‫ال‬ ‫سئ‬ ‫فشل‬ ……. ‫حصل‬ ‫مشهد‬ …….. ‫افضل‬ ‫رائع‬ ‫ممتع‬
-0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357
Coefficient vector of a trained support vector machine
Datasets
Training
L1-norm SVM
Selecting Top
Features
Manually
Verification
Multi domain
Lexicons
Building multi domain lexicons
• Train SVM classifier on each of the generated datasets using a unigram + bigram
model
• Omit features corresponding to zero coefficients
• Label features with positive coefficient values as positive lexicon entries
• Label features with negative coefficient values as Negative lexicon entries
• Manually filter and verify resulting lexicon ( a lot easier ! )
Building multi domain lexicon from the
datasets
Hotels Restaurants Movies Products LABR / Books ALL
# non-zero coef.
features
556 1413 526 661 3552 6708
# Manually filtered 218 734 87 369 874 1913
Size of built multi-domain lexicons before and after manual filtration
Building multi domain lexicon from the
datasets
Selected examples from the Generated lexicons:
Hotels Restaurants Movies
‫أعود‬ ‫لن‬
not coming back
‫بارد‬
cold
‫المشاهدة‬ ‫يستحق‬
worth watching
‫المياه‬‫ضعيفة‬
low water pressure
‫يشبع‬
Enough portions
‫برافو‬
Bravo
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Experiments and bench marking Datasets
• Verify the viability of using the datasets for sentiment analysis
• Test the effectiveness of the generated lexicon
• Export the results of all experiments publicly for further analysis
• Provide easy benchmarking framework for future sentiment classifiers
Experiments Benchmarking the datasets for the task of sentiment analysis :
Experiments and bench marking Datasets
Datasets setups :
• 2 Class sentiment Classification (Positive or Negative)
• 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral )
• Balanced / Unbalanced Setups
• 20%-80% Splits (testing generated lexicons on unseen data)
• Cross validation
Experiments and bench marking Datasets
Feature building Methods :
• Standard feature building methods :
• Count, TF-IDF, Delta-TFIDF
• Features built from generated lexicons :
• (term existence, term count, weighted count )
• Domain specific lexicon, domain general lexicon
• Merging Lexicon based features with other features
Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD
Experiments and bench marking Datasets
• 3075 experiments, resulted from using all classifiers, features and
Datasets setups combinations together.
• Results are publicly available for further analysis and as benchmarks
Agenda
• Problem Statement
• Building Multi-Domain Datasets for Sentiment Analysis
• Building Multi-Domain lexicons
• Experiments and Evaluation
• Mining experiments results
Mining experiments results
Mining the experiments results to answer questions like :
• What are the top performing classifiers and features combinations ?
• Can we rely only on lexicons for sentiment analysis ?
• What is the effect of combining lexicon based features with other
features ?
• Are shorter documents easier to classify ?
• Are documents richer with subjective words easier to classify ?
Can we rely only on lexicon based
features for sentiment
classification?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.
Mining experiments results
Features Number of features Average Accuracy
2Class
Lex-domain ~ 500 0.768
Lex-all 1913 0.782
Count ~ 50K features 0.783
Mining experiments results
Features Number of features Average Accuracy
3Class
Lex-domain ~ 500 0.549
Lex-all 1913 0.554
Count ~ 50K features 0.570
Effect of merging lexicon based
features with other features?
Can features generated from lexicons provide an adequate accuracy relative to
other feature generating methods.
Mining experiments results
Features Aggregated Lexicon Average Accuracy Enhancement
2Class
Count
None 0.783
Lex-domain 0.790 + 1 %
Lex-all 0.796 + 1.6 %
TFIDF
None 0.7
Lex-domain 0.791 + 9.1 %
Lex-all 0.8 +10 %
Delta-TFIDF
None 0.692
Lex-domain 0.789 + 9.7 %
Lex-all 0.798 + 10.6 %
Shorter documents are easier to
classify?
Or longer ones?, How about longer ones rich with subjective terms ?
Mining experiments results
Small Space
Mining experiments results
Storyline : Patch Adams was desperate and attempt to commit a
suicide many times, until he was sent to a mental hospital….
……..
Then he started unintentionally helping others through socializing
with them until they have become better
Mining experiments results
• Document length : No. of tokens in per document (log scale)
• Subjectivity score
• Sum of polarities of words that appear in the document (using generated
lexicons)
• Error Rate
• Number of misclassified documents of this specific group (doc. Length and
subjectivity score )
Mining experiments results
The error rate for various document lengths and subjectivity score groups (the Darker the worse)
Conclusion
• Built a large multi-domain datasets for sentiment Analysis ( 33K
reviews)
• Proposed an approach for semi-automatically learning multi-domain
lexicons (~2K)
• Everything is publicly available :
• Datasets (raw + processed)
• Lexicons
• Web Scrappers (to rerun for more recent reviews)
• Experiments code and results
Questions ?
Slides : bit.ly/cicling2015_elsahar_slides
Datasets : bit.ly/cicling2015_elsahar_resources

More Related Content

PPTX
Vectors in Search - Towards More Semantic Matching
PPTX
Measuring Search Engine Quality using Spark and Python
PDF
Rank by time or by relevance - Revisiting Email Search
PDF
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
PDF
Natural Language Processing using Java
PDF
Deep Domain
PPTX
Big Data + Sentiment Analysis = Awesome
PPTX
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Vectors in Search - Towards More Semantic Matching
Measuring Search Engine Quality using Spark and Python
Rank by time or by relevance - Revisiting Email Search
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Natural Language Processing using Java
Deep Domain
Big Data + Sentiment Analysis = Awesome
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...

Viewers also liked (20)

PDF
WDAqua introduction presentation
PDF
Word Embeddings, why the hype ?
PDF
Guidedesurviedecisionsabsurdes
PPT
Data mining project
PPTX
A Fuzzy Approach For Multi-Domain Sentiment Analysis
PPTX
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
PDF
Mike davies sentiment_analysis_presentation_backup
PDF
Machine Learning - Object Detection and Classification
PPT
Sentiment Analysis in Twitter
PDF
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
PPTX
Twitter sentiment analysis
PPTX
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
PPT
Arabic Text mining Classification
PPTX
Arabic tokenization and stemming
PPTX
Sentiment tool Project presentaion
PPTX
A Simple Introduction to Word Embeddings
PPTX
Sentiment Analaysis on Twitter
PPTX
Sentiment analysis of arabic,a survey
PPTX
Tweets Classification using Naive Bayes and SVM
PPTX
Sentiment analysis using naive bayes classifier
WDAqua introduction presentation
Word Embeddings, why the hype ?
Guidedesurviedecisionsabsurdes
Data mining project
A Fuzzy Approach For Multi-Domain Sentiment Analysis
Sentiment mining- The Design and Implementation of an Internet Public Opinion...
Mike davies sentiment_analysis_presentation_backup
Machine Learning - Object Detection and Classification
Sentiment Analysis in Twitter
Sentiment Analysis in Twitter with Lightweight Discourse Analysis
Twitter sentiment analysis
Project prSentiment Analysis of Twitter Data Using Machine Learning Approach...
Arabic Text mining Classification
Arabic tokenization and stemming
Sentiment tool Project presentaion
A Simple Introduction to Word Embeddings
Sentiment Analaysis on Twitter
Sentiment analysis of arabic,a survey
Tweets Classification using Naive Bayes and SVM
Sentiment analysis using naive bayes classifier
Ad

Similar to Building Large Arabic Multi-Domain Resources for Sentiment Analysis (20)

PDF
Conceptual Sentiment Analysis Model
PDF
Sentiment Classification with Case-Based Reasoning
PDF
Text pre-processing of multilingual for sentiment analysis based on social ne...
PDF
A Domain-Invariant Transfer Learning by Bert for Cross-Domain Sentiment Analysis
PPTX
Fypca4
PPTX
Fypca4
PDF
A Survey On Sentiment Analysis And Opinion Mining Techniques
PDF
A Survey on Sentiment Analysis and Opinion Mining.pdf
PDF
Icdm2013 slides
PDF
RCOMM 2011 - Sentiment Classification with RapidMiner
PDF
RCOMM 2011 - Sentiment Classification
PDF
Nick Hathaway - Senior Essay (2018)
PPT
Fypca4
PPTX
Opinion mining
PDF
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
PDF
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
PDF
Sentimental Analysis For Electronic Product Review
PPTX
COMMENT POLARITY MOVIE RATING SYSTEM-1.pptx
PDF
A hybrid composite features based sentence level sentiment analyzer
PDF
Streaming Analytics
Conceptual Sentiment Analysis Model
Sentiment Classification with Case-Based Reasoning
Text pre-processing of multilingual for sentiment analysis based on social ne...
A Domain-Invariant Transfer Learning by Bert for Cross-Domain Sentiment Analysis
Fypca4
Fypca4
A Survey On Sentiment Analysis And Opinion Mining Techniques
A Survey on Sentiment Analysis and Opinion Mining.pdf
Icdm2013 slides
RCOMM 2011 - Sentiment Classification with RapidMiner
RCOMM 2011 - Sentiment Classification
Nick Hathaway - Senior Essay (2018)
Fypca4
Opinion mining
APPROXIMATE ANALYTICAL SOLUTION OF NON-LINEAR BOUSSINESQ EQUATION FOR THE UNS...
FEATURE SELECTION AND CLASSIFICATION APPROACH FOR SENTIMENT ANALYSIS
Sentimental Analysis For Electronic Product Review
COMMENT POLARITY MOVIE RATING SYSTEM-1.pptx
A hybrid composite features based sentence level sentiment analyzer
Streaming Analytics
Ad

Recently uploaded (20)

PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
BIOMOLECULES PPT........................
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
2Systematics of Living Organisms t-.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
famous lake in india and its disturibution and importance
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Comparative Structure of Integument in Vertebrates.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Microbiology with diagram medical studies .pptx
neck nodes and dissection types and lymph nodes levels
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
BIOMOLECULES PPT........................
Cell Membrane: Structure, Composition & Functions
2Systematics of Living Organisms t-.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
lecture 2026 of Sjogren's syndrome l .pdf
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
famous lake in india and its disturibution and importance
HPLC-PPT.docx high performance liquid chromatography
Phytochemical Investigation of Miliusa longipes.pdf
Comparative Structure of Integument in Vertebrates.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
The KM-GBF monitoring framework – status & key messages.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Microbiology with diagram medical studies .pptx

Building Large Arabic Multi-Domain Resources for Sentiment Analysis

  • 1. Building Large Arabic Multi-domain Resources for Sentiment Analysis Hady ElSahar and Samhaa R. El-Beltagy Center for Informatics Science, Nile University CICLing 2015 – April 19, 2014 hadyelsahar@gmail.com
  • 2. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 4. Problem Statement • Small size • Domain Specificity • Not publicly available • Insufficient coverage of different Arabic dialects and non standard terms Current resources for sentiment analysis suffer many deficiencies:
  • 5. Problem Statement Author Dataset name Size Multi Domain Publicly Available Rushdi-Saleh et al. OCA 500 NO YES Abdul-Mageed & Diab AWATIF < 10K Yes NO Aly, M. & Atiya, A. LABR 63K NO YES Eshrag Refaee et al. Twitter Corpus 8,868 N/A YES Sentiment Datasets related work
  • 6. Problem Statement Author Size MSA / Dialect Multi Domain Publicly Available El-Beltagy et al. 4K MSA + Dialect N/A YES Abdul-Mageed & Diab (SANA) 225K MSA + Dialect Yes NO Badaro et al. 150K MSA only N/A YES Sentiment lexicons related work
  • 7. Proposed solution • Building large Arabic datasets and lexicons for sentiment analysis • Large size • Multi-domain • Arabic dialects • Well documented, tested for sentiment classification • Publicly available for every one to use
  • 8. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 9. Building Datasets Building datasets from reviewing content on the internet
  • 10. Building Datasets • Lack of Arabic reviewing content on the internet: • Less Arabic based e-commerce & reviewing websites • Arabic speakers use the English language to write their reviews English *** , Do you Speak it !!!!
  • 11. Domain Reviewing Websites Scrapped Hotel reviews Restaurant reviews Product Reviews Movie Reviews Building Datasets Scrapping Arabic Reviewing content on the Internet
  • 12. Building Datasets • Normalize different ratings systems into ( positive, negative and neutral ) classes using heuristics. • Automatic labeling of reviews.
  • 13. Building Datasets • Removing redundant and spamming reviews • Removing contradicting reviews ( Similar Text Different polarity ) • Remove duplicate reviews
  • 14. Datasets Statistics Hotels Restaurants Movies Products ALL #Reviews 15579 11310 1524 14279 42692 #Unique Reviews 15562 10940 1522 5092 33116 #Users 13407 1639 416 7465 24653 #Items 8100 4654 933 5906 19593 Sizes of Extracted Datasets
  • 15. Datasets Statistics Number of reviews for each class
  • 16. Datasets Statistics Number of tokens per review for each of the datasets
  • 17. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 18. Building multi domain lexicons • Manually hand crafting sentiment lexicons is a tedious task • Proposed approach  utilizes feature selection and ranking of Support Vector Machines (SVM) • SVM with L1 regularization penalty results in sparse coefficient vectors doesn't deserve bad failure happen scene better wonderful enjoyable ‫يستحق‬ ‫ال‬ ‫سئ‬ ‫فشل‬ ……. ‫حصل‬ ‫مشهد‬ …….. ‫افضل‬ ‫رائع‬ ‫ممتع‬ -0.532 -0.52 -0.4 ……. 0 0 …….. 0.270 0.272 0.357 Coefficient vector of a trained support vector machine
  • 19. Datasets Training L1-norm SVM Selecting Top Features Manually Verification Multi domain Lexicons Building multi domain lexicons • Train SVM classifier on each of the generated datasets using a unigram + bigram model • Omit features corresponding to zero coefficients • Label features with positive coefficient values as positive lexicon entries • Label features with negative coefficient values as Negative lexicon entries • Manually filter and verify resulting lexicon ( a lot easier ! )
  • 20. Building multi domain lexicon from the datasets Hotels Restaurants Movies Products LABR / Books ALL # non-zero coef. features 556 1413 526 661 3552 6708 # Manually filtered 218 734 87 369 874 1913 Size of built multi-domain lexicons before and after manual filtration
  • 21. Building multi domain lexicon from the datasets Selected examples from the Generated lexicons: Hotels Restaurants Movies ‫أعود‬ ‫لن‬ not coming back ‫بارد‬ cold ‫المشاهدة‬ ‫يستحق‬ worth watching ‫المياه‬‫ضعيفة‬ low water pressure ‫يشبع‬ Enough portions ‫برافو‬ Bravo
  • 22. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 23. Experiments and bench marking Datasets • Verify the viability of using the datasets for sentiment analysis • Test the effectiveness of the generated lexicon • Export the results of all experiments publicly for further analysis • Provide easy benchmarking framework for future sentiment classifiers Experiments Benchmarking the datasets for the task of sentiment analysis :
  • 24. Experiments and bench marking Datasets Datasets setups : • 2 Class sentiment Classification (Positive or Negative) • 3 Class Sentiment Classification problem (Positive, Negative or Mixed/Neutral ) • Balanced / Unbalanced Setups • 20%-80% Splits (testing generated lexicons on unseen data) • Cross validation
  • 25. Experiments and bench marking Datasets Feature building Methods : • Standard feature building methods : • Count, TF-IDF, Delta-TFIDF • Features built from generated lexicons : • (term existence, term count, weighted count ) • Domain specific lexicon, domain general lexicon • Merging Lexicon based features with other features Classifiers : Linear SVM, Logistic regression, BNB , KNN and SGD
  • 26. Experiments and bench marking Datasets • 3075 experiments, resulted from using all classifiers, features and Datasets setups combinations together. • Results are publicly available for further analysis and as benchmarks
  • 27. Agenda • Problem Statement • Building Multi-Domain Datasets for Sentiment Analysis • Building Multi-Domain lexicons • Experiments and Evaluation • Mining experiments results
  • 28. Mining experiments results Mining the experiments results to answer questions like : • What are the top performing classifiers and features combinations ? • Can we rely only on lexicons for sentiment analysis ? • What is the effect of combining lexicon based features with other features ? • Are shorter documents easier to classify ? • Are documents richer with subjective words easier to classify ?
  • 29. Can we rely only on lexicon based features for sentiment classification? Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.
  • 30. Mining experiments results Features Number of features Average Accuracy 2Class Lex-domain ~ 500 0.768 Lex-all 1913 0.782 Count ~ 50K features 0.783
  • 31. Mining experiments results Features Number of features Average Accuracy 3Class Lex-domain ~ 500 0.549 Lex-all 1913 0.554 Count ~ 50K features 0.570
  • 32. Effect of merging lexicon based features with other features? Can features generated from lexicons provide an adequate accuracy relative to other feature generating methods.
  • 33. Mining experiments results Features Aggregated Lexicon Average Accuracy Enhancement 2Class Count None 0.783 Lex-domain 0.790 + 1 % Lex-all 0.796 + 1.6 % TFIDF None 0.7 Lex-domain 0.791 + 9.1 % Lex-all 0.8 +10 % Delta-TFIDF None 0.692 Lex-domain 0.789 + 9.7 % Lex-all 0.798 + 10.6 %
  • 34. Shorter documents are easier to classify? Or longer ones?, How about longer ones rich with subjective terms ?
  • 36. Mining experiments results Storyline : Patch Adams was desperate and attempt to commit a suicide many times, until he was sent to a mental hospital…. …….. Then he started unintentionally helping others through socializing with them until they have become better
  • 37. Mining experiments results • Document length : No. of tokens in per document (log scale) • Subjectivity score • Sum of polarities of words that appear in the document (using generated lexicons) • Error Rate • Number of misclassified documents of this specific group (doc. Length and subjectivity score )
  • 38. Mining experiments results The error rate for various document lengths and subjectivity score groups (the Darker the worse)
  • 39. Conclusion • Built a large multi-domain datasets for sentiment Analysis ( 33K reviews) • Proposed an approach for semi-automatically learning multi-domain lexicons (~2K) • Everything is publicly available : • Datasets (raw + processed) • Lexicons • Web Scrappers (to rerun for more recent reviews) • Experiments code and results
  • 40. Questions ? Slides : bit.ly/cicling2015_elsahar_slides Datasets : bit.ly/cicling2015_elsahar_resources