Sentiment analysis: Incremental learning to build domain-models

Sentiment analysis:
Incremental learning to build domain-models
Raimon Bosch (@raimonbosch)
TALN, DTIC, UPF

What is sentiment analysis?
[Liu, 2010] Proposes a quintuple (oj, fjk, ooijkl, hi, tj). Text
unstructured data to structured data.
oj: Object
fjk: Object features (Aspect)
ooijkl: Opinion orientations (positive/negative),
(calm/anger/joy/happiness), intensity, ...
hi: Opinion holder
tj: Time frame

What is sentiment analysis?
(oj, fjk, ooijkl, hi, tj) examples:
("easyjet", "baggage", "too expensive" => -5, "John", "01-07-
2013")
("rentaz", "house rent", "horrible people" => -10, "John", "02-07-
2013")
...
("jazztel", "internet", "no problems" => +4, "John", "03-07-
2013")

State-of-the-art
- Twitter as a corpus [Pak and Paroubek, 2010]: Text-
classification problem. Features for machine learning
techniques.
- Emoticons :)
- N-grams
- Negations
- Pos-tagging
- Syntax
- Twitter specific features.

State-of-the-art
- Pointwise Mutual Information [Su and Xiang, 2006]: We can
have the probability of certain words in a phrase of being
positive or negative depending on their co-occurrences in the
WWW.

State-of-the-art
- Sentiment dictionaries: Sentiwordnet [Baccianella and Esuli,
2010]. Positive score and Negative score for each meaning
(#N). Calculated with Random-walk algorithm.

State-of-the-art
- Cross-domain models [Pan, 2010]: Bipartite graph.

State-of-the-art
- Twitter prediction [O’Connor, 2010]: Correlation between
tweets and polls. Real-time information.

Not developed in state-of-the-art
Structured N-grams.
Most of the work is done with N-grams.
Buzz detection.
Aspect identification is not a main focus.

Technology stack
- Simplicity. Ruby.
- Integration with Java (JRuby, Hadoop Streaming).
- Big Data ready. Hadoop.

Hypothesis
H1: We can create groups of N-grams that influence specifically
to one aspect in a negative or a positive orientation. This is what
we call sentigrams.
H2: By using incremental learning the system improves in
each iteration. User interaction increases precision.
H3: After certain number of iterations is reached we can assign
sentigrams to a tweet automatically.

Hypothesis (H1) - Sentigrams
We define as sentigram the relation between sentiwords and
aspects that define if a tweet is postive or negative.
- Sentigram is an evolution from N-grams. Which could be
considered as structured N-gram.
- Detect aspects and sentiwords inside a text.

Hypothesis (H1) - Sentigrams
- Mark opinion orientations. Not only if they are positive or
negative, also which aspect are they referring to.

Hypothesis (H2) - Incremental learning
By using incremental learning the system improves in each
iteration. Increasing precision.
- Original sentiwordnet version was not very adapted to our
domain.
- We include new sentiwords from annotations in our dictionary
with scores (pos_score: 0, neg_score: 0).
- Random-walk update word scores until accuracy converges.

Hypothesis (H3) - Automatization
After certain number of iterations is reached we can assign
sentigrams to a tweet automatically without manual
intervention.
- Multi class problem!! Each tweet has several words to guess.
Text-classification problem!!

Hypothesis (H3) - ML
- Convert a multiclass problem in a binary problem
(i.e. "ryanair is a joke").
0,801829636,-
545403680,1561023766,2119008529,11,801829636,-
545403680,1561023766,2119008529,0
2,801829636,-545403680,1561023766,2119008529,0
3,801829636,-545403680,1561023766,2119008529,2
- Focus the problem by position: (0..N). N partial observations
from each tweet.
- Numerical codes for words. Three classes available {0,1,2}

Hypothesis (H3) - Dependency parsing
- Mate Tools
1 ryanair _ ryanair _ NN _ _ -1 2 _ SBJ _ _
2 is _ be _ VBZ _ _ -1 0 _ ROOT _ _
3 a _ a _ DT _ _ -1 4 _ NMOD _ _
4 joke _ joke _ NN _ _ -1 2 _ PRD _ _
- Still noisy. Work in progress.
- ML approach: Accuracy is 85% against our gold standard.
Focusing only on aspects we can get 94% accuracy.

Conclusions
- Sentiwordnet version was not very adapted to our domain.
Accuracy 47%. Random-walk necessary.
- Design of interface to perform interactive annotations. Semi-
supervised approach.
- With words from annotations pos scores and neg scores are
changed randomly until accuracy is optimized. Convergence
reached. Accuracy 89%.

Conclusions
- Focus on aspect identification. Not only +/-. We detect what
the user is complaining about.
- Convert a multi class problem in a binary problem. Divide &
conquer!!
- Machine-learning & dependency parsing of tweets to detect
patterns. Accuracy 85%

What's next?
- Finish integration with dependency parsing.
- Data visualization. Comparison between several topics.
Positive aspects and negative aspects of each topic.
- Train the system for several domains: airlines, politics, tv,
telecommunications, etc...

Sentiment analysis: Incremental learning to build domain-models

More Related Content

What's hot (20)

Similar to Sentiment analysis: Incremental learning to build domain-models (20)

Recently uploaded (20)

Sentiment analysis: Incremental learning to build domain-models

Editor's Notes