Implicit Sentiment Mining in Twitter Streams

RIP Boris Strugatski
Science Fiction will never be the same

Implicit Sentiment Mining
(do you tweet like Hamas?)

Maksim Tsvetovat
Jacqueline Kazil
Alexander Kouznetsov

Sentiment Mining, old-schoool

• Start with a corpus of words that have sentiment
orientation (bad/good):
• “awesome” : +1
• “horrible”: -1
• “donut” : 0 (neutral)

• Compute sentiment of a text by averaging all
words in text

…however…
• This doesn’t quite work (not reliably, at least).

• Human emotions are actually quite complex

• ….. Anyone surprised?

We do things like this:

“This restaurant would deserve highest praise if
you were a cockroach” (a real Yelp review ;-)


“This is only a flesh wound!”


“This concert was f**ing awesome!”


“My car just got rear-ended! F**ing awesome!”


“A rape is a gift from God” (he lost! Good ;-)

To sum up…

• Ambiguity is rampant

• Context matters

• Homonyms are everywhere

• Neutral words become charged as discourse
changes, charged words lose their meaning

More Sentiment Analysis

• We can parse text using POS (parts-of-
speech) identification

• This helps with homonyms and some
ambiguity

More Sentiment Analysis

• Create rules with amplifier words and inverter
words:
– “This concert (np) was (v) f**ing (AMP) awesome (+1) = +2

– “But the opening act (np) was (v) not (INV) great (+1) = -1

– “My car (np) got (v) rear-ended (v)! F**ing (AMP)
awesome (+1) = +2??

To do this properly…
• Valence (good vs. bad)

• Relevance (me vs. others)

• Immediacy (now/later)

• Certainty (definitely/maybe)
• …. And about 9 more less-significant dimensions

Samsonovich A., Ascoli G.: Cognitive map dimensions of the human value
system extracted from the natural language. In Goertzel B. (Ed.): Advances in
Artificial General Intelligence (Proc. 2006 AGIRI Workshop), IOS Press, pp. 111-
124 (2007).

This is hard

• But worth it?
Michelle de Haaff (2010), Sentiment Analysis, Hard But Worth It!, CustomerThink

Hypothesis

• Support for a political candidate, party, brand,
country, etc. can be detected by observing
indirect indicators of sentiment in text

Mirroring – unconscious copying
of words or body language

Fay, W. H.; Coleman, R. O. (1977). "A human sound transducer/reproducer: Temporal
capabilities of a profoundly echolalic child". Brain and language 4 (3): 396–402

Marker words
• All speakers have some words and
expressions in common (e.g.
conservative, liberal, party designation,
etc)
• However, everyone has a set of
trademark words and expressions that
make him unique.

Observing Mirroring

• We detect marker words and expressions in
social media speech and compute sentiment
by observing and counting mirrored phrases

The research question

• Is media biased towards Israel or Hamas in
the current conflict?

• What is the slant of various media sources?

Data harvest
• Get Twitter feeds for:
– @IDFSpokesperson
– @AlQuassam
– Twitter feeds for CNN, BBC, CNBC, NPR, Al-Jazeera,
FOX News – all filtered to only include articles on
Israel and Gaza

• (more text == more reliable results)

Fast Computational Linguistics

Text Cleaning
import string
stoplist_str="""
a
a's • Tweet text is dirty
able
About • (RT, VIA, #this and
... @that, ROFL, etc)
...
z • Use a stoplist to produce a
zero
rt stripped-down tweet
via
"""

stoplist=[w.strip() for w in stoplist_str.split('n') if w !='']

Language ID

• Language identification is pretty easy…

• Every language has a characteristic
distribution of tri-grams (3-letter sequences);
– E.g. English is heavy on “the” trigram

• Use open-source library “guess-language”

Stemming
• Stemming identifies root of a word, stripping
away:
– Suffixes, prefixes, verb tense, etc

• “stemmer”, “stemming”, “stemmed” ->>
“stem”
• “go”,”going”,”gone” ->> “go”

Term Networks
• Output of the cleaning step is a term
vector
• Union of term vectors is a term network
• 2-mode network linking speakers with
bigrams
• 2-mode network linking locations with
bigrams
• Edge weight = number of occurrences
of edge bigram/location or
candidate/location

Build a larger net

• Periodically purge single co-occurrences
– Edge weights are power-law distributed
– Single co-occurrences account for ~ 90% of data

• Periodically discount and purge old co-
occurrences
– Discourse changes, data should reflect it.

Metrics computation

• Extract ego-networks for IDF and HAMAS
• Extract ego-networks for media organizations
• Compute hamming distance H(c,l)
– Cardinality of an intersection set between two networks
– Or… how much does CNN mirror Hamas? What about FOX?

• Normalize to percentage of support

Aggregate & Normalize

• Aggregate speech
differences and
similarities by
media source
• Normalize values

Media Sources, Hamas and IDF
Chart Title
IDF Hamas

NPR 0.579395354 0.420604646

AlJazeera 0.530344094 0.469655906

CNN 0.585616438 0.414383562

BBC 0.537492158 0.462507842

FOX 0.49329523 0.50670477

CNBC 0.601137576 0.398862424

Ron Paul, Romney, Gingrich, Santorum
March 2012 (based on Twitter Support)
MT
MN
UT
MD
ID
IA
IL
AR
AK
PA
LA
HI
SD
KY
KS
OK
GA
CO
RI
NE
NC
NJ
WY
WV
WA

0 0.2 0.4 0.6 0.8 1 1.2

Conclusions

• This works pretty well! ;-)

• However – it only works in
aggregates, especially on Twitter.

• More text == better accuracy.

Conclusions

• The algorithm is cheap:
– O(n) for words on ingest – real-time on a stream

– O(n^2) for storage (pruning helps a lot)

• Storage can go to Redis
– make use of built-in set operations

Implicit Sentiment Mining in Twitter Streams

Implicit Sentiment Mining in Twitter Streams

More Related Content

Viewers also liked (16)

Similar to Implicit Sentiment Mining in Twitter Streams (20)

Recently uploaded (20)

Implicit Sentiment Mining in Twitter Streams