NAACL Tutorial Social Media Predictive Analytics

NAACL Tutorial
Social Media Predictive Analytics
Svitlana Volkova1, Benjamin Van Durme1,2,
David Yarowsky1 and Yoram Bachrach3
1Center for Language and Speech Processing,
Johns Hopkins University,
2Human Language Technology Center of Excellence,
3Microsoft Research Cambridge

Tutorial Schedule
Part I: Theoretical Session (2:00 – 4:30pm)
Batch Prediction
Online Inference
Coffee Break (3:30 – 4:00pm)
Dynamic Learning and Prediction
Part II: Practice Session (4:30 – 5:30pm)
Code and Data

Tutorial Materials
• Slides:
– http://www.cs.jhu.edu/~svitlana/slides.pptx
• Code and Data:
– https://bitbucket.org/svolkova/queryingtwitter
– https://bitbucket.org/svolkova/attribute
– https://bitbucket.org/svolkova/psycho-demographics
• References:
– http://www.cs.jhu.edu/~svitlana/references.pdf

Social Media Obsession
Diverse
Billions of messages
Millions of users

What do they
think and feel?
Where do
they go?
What is their
demographics and
personality?
What do
they like?
What do
they buy?

First: a comment on privacy and ethics…

Why is language in social media so
interesting?
• Very Short – 140 chars
• Lexically divergent
• Abbreviated
• Multilingual

Why is language in social media so
challenging?
• Data drift
• User activeness => generalization
• Topical sparsity => relationship, politics
• Dynamic streaming nature

Predictive Analytics Services
• Social Network Prediction –
https://apps.facebook.com/snpredictionapp/
• Twitter Psycho-Demographic Profile and Affect Inference –
http://twitterpredictor.cloudapp.net (pswd: twitpredMSR2014)
• My personality Project – http://mypersonality.org/wiki/doku.php
• You Are What You Like – http://youarewhatyoulike.com/
• Psycho-demographic trait predictions –
http://applymagicsauce.com/
• IBM Personality – https://watson-pi-demo.mybluemix.net
• World Well Being Project – http://wwbp.org

Applications: Retail
Personalized marketing
• Detecting opinions and emotions
users express about products or
services within targeted
populations
Personalized
recommendations and search
• Making recommendations based on
user emotions, demographics and
personality

Applications: Advertising
Online targeted advertising
• Targeting ads based on
predicted user demographics
• Matching the emotional tone
the user expects
Deliver adds fast
Deliver adds to a
true crowd
vs.
vs.
vs.

Applications: Polling
Real-time live polling
• Mining political opinions
• Voting predictions within certain demographics
Large-scale passive
polling
• Passive poling
regarding products
and services
vs.

Applications: Health
Large-scale real-time healthcare analytics
• Identifying smokers, drug addicts, healthy eaters,
people into sports (Paul and Dredze 2011)
• Monitoring flue-trends, food poisonings, chronic
illnesses (Culotta et. al. 2015)

Applications: HR
Recruitment and human resource
management
• Estimating emotional stability and
personality of the potential and
current employees
• Measuring the overall well-being of
the employees e.g., life satisfaction,
happiness (Schwartz et. al. 2013;
Volkova et. al., 2015)
• Monitor depression and stress level
(Coppersmith et. al. 2014)

User Attribute Prediction Task
Political Preference
Rao et al., 2010; Conover et al., 2011,
Pennacchiotti and Popescu, 2011;
Zamal et al., 2012; Cohen and Ruths,
2013; Volkova et. al, 2014
.
.
.
Communications
Gender
Garera and Yarowsky, 2009; Rao et
al., 2010; Burger et al., 2011; Van
Durme, 2012; Zamal et al., 2012;
Bergsma and Van Durme, 2013
Age
Rao et al., 2010; Zamal et al., 2012;
Cohen and Ruth, 2013; Nguyen et al.,
2011, 2013; Sap et al., 2014
…
…
…
…
AAAI 2015 Demo (joint work with Microsoft Research)
Income, Education Level, Ethnicity, Life Satisfaction, Optimism,
Personality, Showing Off, Self-Promoting

Tweets Revealing User Attributes
?
?
?
?

Supervised Models
Classification: binary (SVM) – gender, age, political, ethnicity
• Goswami et. al., 2009; Rao et al. 2010; Burger et al. 2011; Mislove et al.
2012; Nguyen et al. 2011; Nguyen et al. 2013;
• Pennacchiotti and Popescu 2011; Connover et. al. 2011; Filippova et. al.
2012; Van Durme 2012; Bergsma et. al. 2012, 2013; Bergsma and Van
Durme 2013;
• Zamal et al. 2012; Ciot et. al. 2013; Cohen and Ruths 2013;
• Schwartz et. al. 2013; Sap et. al., 2014; Kern et. al., 2014; Schwartz et. al.
2013; Golbeck et. al. 2011; Kosinski et. al. 2013;
• Volkova et. al. 2014; Volkova et al. 2015.
Unsupervised and Generative Models
• name morphology for gender & ethnicity prediction - Rao et al. 2011;
• large-scale clustering - Bergsma et. al. 2013; Culotta et. al. 2015;
• demographic language variations - Eisenstein et al. 2010; O’Connor et
al. 2010; Eisenstein et. al. 2014.
*Rely on more than lexical features e.g., network, streaming

Existing Approaches ~1K Tweets*
….
…
….
…
….
…
….
…
….
…
….
…
….
…
….
…
Does an average Twitter user produce
thousands of tweets?
*Rao et al., 2010; Conover et al., 2011; Pennacchiotti and
Popescu, 2011a; Burger et al., 2011; Zamal et al., 2012; Nguyen
et al., 2013
Tweets as a
document

Attributed Social Network
User Local Neighborhoods a.k.a. Social Circles

Approaches
Static (Batch)
Prediction
Streaming
(Online) Inference
Dynamic (Iterative)
Learning and Prediction
• Offline
training
• Offline
predictions
+ Neighbor
content
• Offline training
+ Online
predictions over
time
• Exploring 6 types
of neighborhoods
• Online predictions
• Relying on neighbors
+ Iterative re-training
+ Active learning
+ Rationale annotation
Topical sparsity
Data drift
Streaming nature
Model generalization

Part I Outline
I. Batch Prediction
i. How to collect and annotate data?
ii. What models and features to use?
iii. Which neighbors are the most predictive?
II. Online Inference
i. How to predict from a stream?
I. Dynamic (Iterative) Learning and Prediction
i. How to learn and predict on the fly?

How to get data? Twitter API
• Twitter API: https://dev.twitter.com/overview/api
• Twitter API Status:https://dev.twitter.com/overview/status
• Twitter API Rate Limits:
https://dev.twitter.com/rest/public/rate-limits

Querying Twitter API
• Twitter Developer Account => access key and token
https://dev.twitter.com/oauth/overview/application-
owner-access-tokens
twitter = Twython(APP_KEY, APP_SECRET,
OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
I. Access 1% Twitter Firehouse and sample from it
II. Query Twitter API to get:
 user timelines (up to 3200 tweets) from userIDs
 tweet json objects from tweetIDs
 lists of friendIDs (5000 per query) from userIDs

JSON Objects
MongoDB: http://docs.mongodb.org/manual/tutorial/getting-started/
Add predictions: sentiment,
attributes, emotions

How to get labeled data?
• Supervised classification in a new domain:
– Labeled data ≈ ground truth
– Costly and time consuming to get!
• Ways to get ≈“ground truth” annotations:
 Fun psychological tests (voluntarily): myPersonality project
 Profile info: Facebook e.g., relationship, gender, age but sparse for Twitter
 Self reports: “I am a republican…” (Volkova et al. 2013), “Happy
##th/st/nd/rd birthday to me” (Zamal et. al. 2012), “I have been diagnosed
with …” (Coppersmith et. al. 2014), “I am a writer …” (Beller at. al., 2014)
 Distant supervision: following Obama vs. Romney (Zamal et. al. 2012),
emotion hashtags (Mohammad et. al, 2014), user name (Burger et. al., 2011)
 Crowdsourcing: subjective perceived annotations (Volkova et. al.2015),
rationales (Bergsma et. al., 2013, Volkova et. al, 2014; 2015)
Attribute
Model
ΦA(u)
UL
UP

Twitter Social Graph
friend hashtag
reply
@mentionfollower
retweet
I. Candidate-Centric (distant
supervision)
1,031 users
II. Geo-Centric (self-reports)
270 users
III. Politically Active (distant
supervision)*
371 users (Dem; Rep)
IV. Age (self-reports)*
387 users (18 – 23; 23 - 25)
V. Gender (name)*
384 users (Male; Female)
Balanced datasets
*Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013
Code, data and trained models for gender, age, political preference prediction http://www.cs.jhu.edu/~svitlana/
10 - 20 neighbors
of 6 types per user
What types of neighbors lead to the best
attribute predictions?

Classification Model
• Logistic regression = max entropy = log linear models
– Map discrete inputs w to binary output y
• Other options: SVM, NB
wi = 0,1{ }
y = M,F{ }
hair eat co
ol
wor
k
… xbo
x
Femal
e
1 1 0 0 … 0
Male 0 1 0 1 … 1
Male 0 0 1 1 … 1
http://scikit-learn.org/stable/modules/generated/
sklearn.linear_model.LogisticRegression.html
Labeledusers
(Training)
Vocabulary size
hair eat co
ol
wor
k
… xbo
x
? 0 1 0 0 … 1
Feature vector
Test user

Features (I)
• Lexical:
– normalized counts/binary ngrams (Goswami el. al. 2010; Rao et. al.
2010; Pennacchiotti and Popescu 2011; Ngyen et. al. 2013; Ciot et. al.
2013; Van Durme 2012; Kern et. al. 2014; Volkova et. al. 2014;
Volkova and Van Durme 2015)
– class-based highly predictive (Bergsma and Van Durme 2013),
rationales (Volkova and Yarowsky 2014); character-based
(Peersman et. al. 2011), stems, co-stems, lemmas (Zamal et. al.
2012; Cohen et. al. 2014)
• Socio-linguistic, syntactic and stylistic:
– syntax and style (Shler et. al. 2006; Cheng at. al., 2011), smiles,
excitement, emoticons and psycho-linguistic (Rao et. al. 2010;
Marquardt et. al. 2014; Kokkos et. sl. 2014; Hovy 2015)
– lexicon features (Sap et. al. 2014); linguistic inquiry and word
count (LIWC) (Mukherjee et. al. 2010; Fink et. al. 2012)

Features (II)
• Communication behavior: response/retweet/tweet frequency,
retweeting tendency (Connover et. al. 2011; Golbeck et. al. 2011;
Pennacchiotti and Popescu 2011; Preotic at. al. 2015)
• Network structure: follower-following ratio, neighborhood size,
in/out degree, degree of connectivity (Bamman et. al. 2012;
Filippova 2012; Zamal et. al. 2012, Culotta et. al. 2015)
• Other: likes (Bachrach et. al. 2012; Kosinski et. al. 2014), name or
census (Burger et. al. 2011; Liu and Ruths 2013), links/images
(Rosenthal and McKeown 2011)
• Topics: word embeddings, LDA topics, word clusters (Preotic at.
al. 2015)
hair eat coo
l
wor
k
… xbo
x
RT neig
h
image
s
….
Female 1 1 0 0 … 0 0.3 30 0.5 ….

Batch Experiments
• Log-linear word unigram models:
(I) Users vs. (II) Neighbors and (III) User-Neighbor
• Evaluate different neighborhood types:
– varying neighborhood size n=[1, 2, 5, 10] and
content amount t=[5, 10, 15, 25, 50, 100, 200]
– 10-fold cross validation with 100 random
restarts for every n and t parameter combination
F = argmaxa P A = a T( )

User Model
Fu
=
D if
1
1+e-q f u ³ 0.5
R otherwise.
ì
í
ï
î
ï
Train Graph
vi
Test Graph
t :…Ron Paul not a fan of Chris Christie
ft
vj
: w1 = 0,w2 = 0,…,wn = 0[ ]
t : Washington Post Columnist…
ft
vi
: w1 =1,w2 =1,…,wn = 0[ ]
vj
t : We're watching you House @GOP
ft
vk
= w1 =1,w2 =1,…,wn = 0[ ]
vk -?

Neighbor Model
HLTCOE Text Meeting, June 09 2014
Train Graph
vi
Test Graph
t :Obama: I'd defend @MajorCBS
ft
N vi( )
= w1,w2,…,wn[ ]
vj
t :@FoxNews: WATCH LIVE
ft
N vk( )
= w1,w2,…,wn[ ]
t : The Lyin King #RepMovies
ft
N vj( )
= w1,w2,…,wn[ ]
vk -?
F
N u( )
=
D if
1
1+e-q f
N u( ) ³ 0.5
R otherwise.
ì
í
ï
î
ï

Joint User-Neighbor Model
Train Graph
vi
Test Graph
t :…Ron Paul not a fan of Christie
The Lyin King #RepublicanMovies
ft
vj +N vj( )
= w1,w2,…,wn[ ]
t : Washington Post Columnist…
Obama: I'd defend @MajorCBS
ft
vi+N vi( )
= w1,w2,…,wn[ ]
vj
t :@FoxNews: WATCH LIVE
We're watching you House @GOP
ft
vk +N vk( )
= w1,w2,…,wn[ ]
vk -?
Learning on user and neighbor features jointly (not prefixing features)
F
u+N u( )
=
D if
1
1+e-q f
u+N u( ) ³ 0.5
R otherwise.
ì
í
ï
î
ï

Gender Prediction
?
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
retweet.counts
usermention.counts
0.73
5 10 20 50 100 500
0.500.600.700.80
Tweets Per User
Accuracy
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
friend.counts
usermention.binary
Neighbor: 0.63
User-Neigh: 0.73
User: 0.82
40

Gender Prediction Quality
Approach Users Tweets Features Accuracy
Rao et al., 2010 1K 405 BOW+socioling 0.72
Burger et al., 2011 184K 22 username, BOW 0.92
Zamal et al., 2012 384 10K neighbor BOW 0.80
Bergsma et al., 2013 33.8K − BOW, clusters 0.90
JHU models 383 200/2K BOW user/neigh 0.82/0.73
• This is not a direct comparison => Twitter data sharing
restrictions
• Poor generalization: different datasets = different
sampling and annotation biases

Age Prediction
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
follower.counts
friend.counts
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
friend.counts
retweet.counts
5 10 20 50 100 500
0.500.600.70
Tweets Per User
Accuracy
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
?
User: 0.77
Neighbor: 0.72
User-Neigh: 0.77
18 – 23 23 – 25

Age Prediction Quality
?
Approach Users Tweets Groups Features Accuracy
Rao et al., 2010 2K 1183 <=30; > 30 BOW+socioling 0.74
Zamal et al., 2012 386 10K 18 – 23; 23 - 25 neighbor BOW 0.80
JHU models 381 200/2K 18 – 23; 23 - 25 BOW/neighbors 0.77/0.74
• This is not a direct comparison!
• Performance for different age groups
• Sampling and annotation biases

Political Preference
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
friend.counts
retweet.binary
5
10
15
20
0 50 100 150 200
Tweets
Neighbors
friend.counts
retweet.binary
usermention.binary
5 10 20 50 100 500
0.550.650.750.85
Tweets Per User
Accuracy
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
Useruni
Userbin
User
bi
Usertri
UserOnlyZLR
?
0.91
User: 0.89
User-Neigh: 0.92
Neighbor: 0.91

Lexical Markers for Political
Preferences

Model Generalization
• Political preference classification is not easy!
• Topical sparsity: average users rarely tweet about politics
0.57
0.67 0.690.72 0.75
0.870.89 0.91 0.92
0.00
0.20
0.40
0.60
0.80
1.00
User Neighbor User-Neighbor
Accuracy
Geo-centric Cand-centric Active

Approach Users Tweets Features Accuracy
Bergsma et al., 2013 400 5K BOW, clusters 0.82
Pennacchiiotti 2011 10.3K − BOW, network 0.89
Conover et al., 2011 1K 1K BOW, network 0.95
Zamal et al., 2012 400 1K neighbor BOW 0.91
JHU active 371 200 BOW user/neigh 0.89/0.92
JHU cand centric 1,051 200 BOW user/neigh 0.72/0.75
Political Preference Prediction Quality
JHU geo-centric 270 200 BOW user/neigh 0.57/0.67
Cohen et al., 2013 262 1K BOW, network 0.68
Politically Active Users (sampling/annotation bias)
Random /Average Users

Querying more neighbors with less
tweets is better than querying more
tweets from the existing neighbors
Limited
Twitter
API Calls

Optimizing Twitter API Calls
Cand-Centric Graph: Friend Circle
?

Summary: Static Prediction
• Features: Binary (political) vs. count-based features (age, gender)
• Homophily: “neighbors give you away” => users with no content
• Attribute assortativity: similarity with neighbors depends on
attribute types
• Content from more neighbors per user >> additional content from
the existing neighbors
• Generalization of the classifiers
FollowerFriend Retweet
Mention MentionFriend
N
UN

Iterative Bayesian Predictions
Time
t1 t2 tk…
Pt1
R Tt1( )= 0.52
Ptk
R Ttk( )= 0.77
tk-1
Pt2
R Tt2( )= 0.62
?
P a = R T( )=
P tk a = R( )×P a = R( )
k
Õ
P tk a = R( )×P a = R( )
k
Õ + P tk a = D( )×P a = D( )
k
Õ
Pt2
R Ttk-1( )= 0.65
t0
?
Class prior
Likelihood
Posterior

Cand-Centric Graph: Posterior Updates
0.5
0.6
0.7
0.8
0.9
1.0
0 20 40 60
p(Republican|T)
0.3
0.4
0.5
blican|T)
t2
?
t0 …
Time
t1 tk-1 t2
?
t0 …
Time
t1 tk-1
0.5
0.6
0 20 40 60
0.0
0.1
0.2
0.3
0.4
0.5
0 20 40 60
Tweet Stream (T)
p(Republican|T)

Cand-Centric: Prediction Time (1)
300
400
500
0 1 2 3 4 5
Time in Weeks
Users
User-Neighbor
300
400
500
0 5 10 15
Time in Weeks
Users
0.75
0.95
User Stream
Dem
Rep
Prediction confidence: 0.95 vs. 0.75
Democrats are easier to predict than republicans
Dem
Rep
Usersclassified
correctly

Cand-Centric Graph: Prediction Time (2)
0.02
12 20
0.01
19
8.9
0.002
1.2
3.2
0.001
3.5
1.1
0.001
0.01
0.1
1
10
100
Weeks(logscale)
How much time does it take to classify 100 users with
75% confidence?
Compare: User Stream vs. Joint User-Neighbor Stream
Cand-centric Geo-Centric Active
60

Batch vs. Online Performance
0.99
0.84
0.89
0.99
0.88
0.99
0.0
0.2
0.4
0.6
0.8
1.0
Cand Geo Active
User Stream User-Neighbor Stream
0.72
0.57
0.75
0.75
0.67
0.86
0.0
0.2
0.4
0.6
0.8
1.0
Cand Geo Active
Accuracy
User Batch Neighbor Batch
?
61

Summary: Online Inference
• Homophily: Neighborhood content is useful*
• Lessons learned from batch predictions:
– Age: user-follower or user-mention joint stream
– Gender: user-friend joint stream
– Political: user-mention and user-retweet joint stream
• Streaming models >> batch models
• Activeness: tweeting frequency matters a lot!
• Generalization of the classifiers: data sampling and
annotation biases
*Pennacchiotti and Popescu, 2011a, 2001b; Conover et al., 2011a, 2001b;
Golbeck et al., 2011; Zamal et al., 2012; Volkova et. al., 2014

Iterative Batch Learning
Time
R
D
?
?
t1
t0 t1
tkt2 …
t1
LabeledUnlabeled
t1
t1
Pt1
R t1( )= 0.52 Ptk
R t1…tm( )= 0.77
 Iterative Batch
Retraining (IB)
 Iterative Batch
with Rationale
Filtering (IBR)
?
tm…
tm
t2 …
t2 …
tm
t2 …

Active Learning
LabeledUnlabeled
F u,t1( )
F n,t1( )
1-Jan-2011 1-Feb-2011 1-Nov-2011 1-Dec-2011
Time
…
…
t0 t1 tk-1 tk
u
ni Î N
Pt0
R t1( )= 0.5 Pt1
R t1…t5( )= 0.55 Ptk-1
R t1…t100( )= 0.77 >q
 Active Without
Oracle (AWOO)
 Active With
Rationale Filtering
(AWR)
 Active With Oracle
(AWO)

Annotator Rationales
Rationales are explicitly highlighted ngrams in tweets that best
justified why the annotators made their labeling decisions
feature norms
(psychology),
feature sparsity
Bergsma and Van Durme, 2013; Volkova and Yarowsky, 2014; Volkova and Van Durme, 2015

Alternative: Rationale Weighting
• Annotator rationales for gender, age and political:
http://www.cs.jhu.edu/~svitlana/rationales.html
• Multiple languages: English, Spanish
• Portable to other languages
Improving Gender Prediction of Social Media Users via Weighted Annotator Rationales. Svitlana
Volkova and David Yarowsky. NIPS Workshop on Personalization: Methods and Applications 2014.

Performance Metrics
• Accuracy over time:
• Find optimal models:
– Data steam type (user, friend, user + friend)
– Time (more correctly classified users faster)
– Prediction quality (better accuracy over time)
Aq,t =
#correctly classified
#above threshold
=
TD+TR
D+ R

Results: Iterative Batch Learning
0.0
0.2
0.4
0.6
0.8
1.0
50
100
150
200
250
300
Mar Jun Sep
Accuracy
Correctlyclassified
user user + friend
0.0
0.2
0.4
0.6
0.8
1.0
50
100
150
200
250
300
Mar Jun Sep
Accuracy
Correctlyclassified
user user + friend
IB: higher recall IBR: higher precision
Time: # correctly classified users increases over time
IB faster, IBR slower
Data stream selection:
User + friend stream > user stream

Results: Active Learning
AWOO: higher recall AWR: higher precision
Time:
Unlike IB/IBR models, AWOO/AWR models classify
more users correctly faster (in Mar) but then plateaus
0.0
0.2
0.4
0.6
0.8
1.0
50
100
150
200
250
300
Mar Jun Sep
Accuracy
Correctlyclassified
user user + friend
0.0
0.2
0.4
0.6
0.8
1.0
50
100
150
200
250
300
Mar Jun Sep
Accuracy
Correctlyclassified
user user + friend

0.5
0.6
0.7
0.8
0.9
1.0
Mar Jun Sep
Accuracy
IB: user
IBR: user
0.5
0.6
0.7
0.8
0.9
1.0
Mar Jun Sep
Accuracy
AWOO: user
AWR: user
0.5
0.6
0.7
0.8
0.9
1.0
Mar Jun Sep
Accuracy
IBR: user + friend
IB: user + friend
0.5
0.6
0.7
0.8
0.9
1.0
Mar Jun Sep
Accuracy
AWR: user + friend
AWOO: user + friend
batch < active
user+friend>user Results: Model Quality

Active with Oracle Annotations
50 160 182 198 213 234
50
125
200
275
350
Feb Apr Jun Aug Oct Dec
Cumul.requests
toOracle
Users in training for user only model
user
user + friend
1.0
1.7
16.8
34.0
30.5
63.9
47.9
103.0
71.3
157.4
122.9
271.0
50
100
150
200
250
Feb Apr Jun Aug Oct Dec
Correctlyclassified
user
friend
Oracle is
100% correct
Thousands of tweets in training

Summary: Dynamic Learning and
Prediction
• Active learning > iterative batch
• N, UN > U: “neighbors give you away”
• Higher confidence => higher precision,
lower confidence => higher recall (as expected)
• Rationales significantly improve results

Practical Recommendations:
Models for Targeted Advertising
Prediction
quality
(better accuracy
over time)
Time
(correctly
classified
users
faster)
Data steam
(user, friend
or joint)
Models with
rationale
filtering
IBR, AWR
Higher
confidence
threshold 0.95
Models
without
rationale
filtering
IB, AWOO
Lower
confidence
threshold 0.55
User + Friend > User

Recap: Why these models are good?
• Models streaming nature of social media
• Limited user content => take advantage of
neighbor content
• Actively learn from crowdsourced rationales
• Learn on the fly => data drift
• Predict from multiple streams => topical sparsity
• Flexible extendable framework:
– More features: word embeddings, interests, profile info,
tweeting behavior

Software Requirements
• Python: https://www.python.org/downloads/ python –V
• Pip: https://pip.pypa.io/en/latest/installing.html
python get-pip.py
• Twython: https://pypi.python.org/pypi/twython/
pip install twython
• matplotlib 1.3.1:
http://sourceforge.net/projects/matplotlib/files/matplotlib/
• numpy 1.8.0: http://sourceforge.net/projects/numpy/files/NumPy/
• scipy 0.13: http://sourceforge.net/projects/scipy/files/scipy/
• scikit-learn 0.14.1: http://sourceforge.net/projects/scikit-learn/files/
python -c "import sklearn; print sklearn.__version__"
python -c "import numpy; print numpy.version.version"
python -c "import scipy; print scipy.version.version"
python -c "import matplotlib; print matplotlib.__version__"

Part II. Practice Session Outline
• Details on data collection and annotation
– JHU: gender, age and political preferences
– MSR: emotions, opinions and psycho-demographics
• Python examples for static inference
– Tweet-based: emotions
– User-based: psycho-demographic attributes
• Python examples for online inference
– Bayesian updates from multiple data streams

JHU: Data Overview and Annotation
Scheme
friend hashtag
reply
@mentionfollower
retweet
Political Preferences:
– Candidate-Centric = 1,031 users (follow candidates)
– Geo-Centric = 270 users (self-reports in DE, MD, VA)
– Politically Active* = 371 users (active & follow cand)
Age (self-reports)*: 387 users
Gender (name)*: 384 users
10 - 20 neighbors of each of 6 types
Details on Twitter data collection:
http://www.cs.jhu.edu/~svitlana/data/data_collection.pdf
*Pennacchiotti and Popescu, 2011; Zamal et al., 2012; Cohen and Ruths, 2013
Explain relationships

Links to Download JHU Attribute Data
• How does the data look like?
– graph_type.neighbor_type.tsv e.g., cand-centric.follower.tsv
• JHU gender and age:
http://www.cs.jhu.edu/~svitlana/data/graph_gender_age.tar.gz
• JHU politically active*:
http://www.cs.jhu.edu/~svitlana/data/graph_zlr.tar.gz
• JHU candidate-
centric:http://www.cs.jhu.edu/~svitlana/data/graph_cand.tar.gz
• JHU geo-
centric:http://www.cs.jhu.edu/~svitlana/data/geo_cand.tar.gz

Code to query Twitter API
• Repo: https://bitbucket.org/svolkova/queryingtwitter
– get lists of friends/followers for a user
– 200 recent tweets for k randomly sampled
retweeted or mentioned users
– tweets for a list of userIDs
JSON
Objects
Extract text fields
time, #friends
Tweet
Collection
userIDs/tweetIDs

Part II. Practice Session Outline
• Data and annotation schema description
• Python examples for static inference:
• Python examples for streaming inference:

MSR: Psycho-Demographic Annotations
via Crowdsourcing
5K profiles
0.0 0.5 1.0
Intelligence
Relationship
Religion
Political
Education
Optimism
Income
Life…
Age
Children
Gender
Ethnicity
Cohen's Kappa
(2% random sample)
Attribute
Models
ΦA(u)
UL
UP
5K Millions!
Trusted
crowd
$6/hour
quality
control

MSR: Emotion Annotations via
Distant Supervision
6 Ekman’s Emotions hashtags (Mohammad et al.’14)
+ emotion synonym hashtags

Part II. Practice Session

How to get MSR models and code?
https://bitbucket.org/svolkova/psycho-demographics
1. Load models for 15 psycho-demographic attributes + emotions
2. Extract features from input tweets
3. Apply pre-trained models to make predictions for input tweets

Predictive Models
Supervised text classification
Log-linear models
User-based:
• Lexical: normalized binary/count-based ngrams
• Affect: emotions, sentiments
Tweet-based:
• BOW + Negation, Stylistic +0.3F1
• Socio-linguistic and stylistic:
• Elongations Yaay, woooow,
• Capitalization COOL, Mixed Punctuation ???!!!
• Hashtags and Emoticons
F u( ) =
a0
1
1+e-q×f
³ 0.5,
a1 otherwise.
ì
í
ï
îï

Tweet-based: Emotion Prediction
0.62
0.64
0.77
0.79
0.80
0.92
0.00 0.50 1.00
Sadness
Surprise
Fear
Joy
Anger
Disgust
F1 score (higher is better)
6 classes: joy, sadness, fear, surprise, disgust, anger
F1=0.78 (Roberts’12 0.67, Qadir’13 0.53,
Mohammad’14 0.49)

User-Based: Attribute Prediction
Religion
Relationship
Age
Political
Children
Optimism
Life Satisfaction
Income
Intelligence
Education
Gender
Race
ROC AUC
0.0
0.2
0.4
0.6
0.8
1.0
0.63
0.63
0.66
0.72
0.72
0.72
0.72
0.73
0.75
0.77
0.9
0.93
Relationship
Religion
Children
Political
Age
Intelligence
Optimism
Life Satisfaction
Income
Education
Gender
Ethnicity
ROC AUC
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.74
0.74
0.8
0.82
0.83
0.83
0.83
0.84
0.85
0.88
0.95
0.97
EmoSentOut
EmoSentDiff
Above + Lexical
+0.04
+0.05
+0.11
+0.12
+0.12
+0.11
+0.08
+0.17
+0.10
+0.08
+0.11
+0.11
Gain over
BOW
Relationship
Religion
Children
Political
Age
Intelligence
Optimism
ife Satisfaction
Income
Education
Gender
Ethnicity
ROC AUC
0.0
0.2
0.4
0.6
0.8
1.0
1.2
0.74
0.74
0.8
0.82
0.83
0.83
0.83
0.84
0.85
0.88
0.95
0.97
EmoSentOut
EmoSentDiff
Above + Lexical
F u( ) =
a0
1
1+e-q×f
³ 0.5,
a1 otherwise.
ì
í
ï
îï

Predicting Demographics from User
Outgoing Emotions and Opinions
disgust
negative
sadness
fear
s_score
anger
surprise
e_score
joy
positive
neutral
Emotion and Opinion Features
i
r
e
i
g
a
c
p
r
r
o
l
-2 0 1 2
Column Z-Score
0
Color Key
and Histogram
Count
disgust
negative
sadness
fear
s_score
anger
surprise
e_score
joy
positive
neutral
intelligence
race
education
income
gender
age
children
political
religion
relationship
optimism
life_satisf
PredictedAttributes
-2 0 1 2
Column Z-Score
0
Color Key
and Histogram
Count
0.76
0.76
0.58
0.58
0.65
0.66
0.72
0.76
0.75
0.76
0.69
0.73
AUC ROC
Satisfied
Optimist
Dissatisfied
Pessimist
No Kids
Below 25 y.o.
Female Male
1/3 attributes AUC >=75%

How to get JHU models and code?
 Ex1: Train and test batch models
 Ex2: Train a model from a training file and save it
 Ex3: Predict an attribute using a pre-trained model and plot
iterative updates
 Ex4: Predict and plot iterative updates for multiple attributes
using pre-trained models from a single communication stream
 Ex5: Predict and plot iterative updates for multiple attributes
from multiple communication streams
https://bitbucket.org/svolkova/attribute

Ex1. Train/Test Batch Models
• Run as e.g., for gender:
• Customize features and model type/parameters:
Accuracy

Ex2. Save Pre-trained Models
• Run as e.g., age:
• Customize features (process.py), model type and
parameters (predict.py)

Part II. Practice Session
– User-based: psycho-demographics
– Tweet-based: emotions, opinions

Recap: Iterative Bayesian Updates
Time
t1 t2 tk…
Pt1
R Tt1( )= 0.52
Ptk
R Ttk( )= 0.77
tk-1
Pt2
R Tt2( )= 0.62
?
P a = R T( )=
P tk a = R( )×P a = R( )
k
Õ
P tk a = R( )×P a = R( )
k
Õ + P tk a = D( )×P a = D( )
k
Õ
Pt2
R Ttk-1( )= 0.65
t0
?
Class prior
Likelihood
Posterior

Ex3. Iterative Updates for a Single
Attribute from a Single Stream

Ex4. Iterative Updates for Multiple
Attributes from a Single Stream
Steps:
1. Loading Models
2. Processing data
3. Setting up train/test priors
4. Making Predictions
5. Plotting results

Joint User-Neighbor Streams
friend hashtag
reply
@mentionfollower
retweet

Ex5. Iterative Updates for Multiple
Attributes from Joint Streams
Likelihood Posterior

Questions?
http://www.cs.jhu.edu/~svitlana/
svitlana@jhu.edu
References: http://www.cs.jhu.edu/~svitlana/references.pdf
Slides: http://www.cs.jhu.edu/~svitlana/slides.pptx

NAACL Tutorial Social Media Predictive Analytics

More Related Content

Similar to NAACL Tutorial Social Media Predictive Analytics (20)

Recently uploaded (20)