From Tweets To Polls Linking Text Sentiment To Public Opinion Time Series

From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series
Brendan O’Connor† Ramnath Balasubramanyan† Bryan R. Routledge§ Noah A. Smith†
brenocon@cs.cmu.edu rbalasub@cs.cmu.edu routledge@cmu.edu nasmith@cs.cmu.edu

† §
School of Computer Science Tepper School of Business
Carnegie Mellon University Carnegie Mellon University

Abstract statistics derived from extremely simple text analysis tech-
niques are demonstrated to correlate with polling data on
We connect measures of public opinion measured from consumer confidence and political opinion, and can also pre-
polls with sentiment measured from text. We analyze
several surveys on consumer confidence and political
dict future movements in the polls. We find that temporal
opinion over the 2008 to 2009 period, and find they smoothing is a critically important issue to support a suc-
correlate to sentiment word frequencies in contempora- cessful model.
neous Twitter messages. While our results vary across
datasets, in several cases the correlations are as high as
80%, and capture important large-scale trends. The re- Data
sults highlight the potential of text streams as a substi-
tute and supplement for traditional polling.
We begin by discussing the data used in this study: Twitter
for the text data, and public opinion surveys from multiple
polling organizations.
Introduction
If we want to know, say, the extent to which the U.S. pop- Twitter Corpus
ulation likes or dislikes Barack Obama, an obvious thing to
do is to ask a random sample of people (i.e., poll). Survey Twitter is a popular microblogging service in which users
and polling methodology, extensively developed through the post messages that are very short: less than 140 characters,
20th century (Krosnick, Judd, and Wittenbrink 2005), gives averaging 11 words per message. It is convenient for re-
numerous tools and techniques to accomplish representative search because there are a very large number of messages,
public opinion measurement. many of which are publicly available, and obtaining them
With the dramatic rise of text-based social media, mil- is technically simple compared to scraping blogs from the
lions of people broadcast their thoughts and opinions on a web.
great variety of topics. Can we analyze publicly available We use 1 billion Twitter messages posted over the years
data to infer population attitudes in the same manner that 2008 and 2009, collected by querying the Twitter API,1 as
public opinion pollsters query a population? If so, then min- well as archiving the “Gardenhose” real-time stream. This
ing public opinion from freely available text content could comprises a roughly uniform sample of public messages, in
be a faster and less expensive alternative to traditional polls. the range of 100,000 to 7 million messages per day. (The
(A standard telephone poll of one thousand respondents eas- primary source of variation is growth of Twitter itself; its
ily costs tens of thousands of dollars to run.) Such analysis message volume increased by a factor of 50 over this two-
would also permit us to consider a greater variety of polling year time period.)
questions, limited only by the scope of topics and opinions Most Twitter users appear to live in the U.S., but we made
people broadcast. Extracting the public opinion from social no systematic attempt to identify user locations or even mes-
media text provides a challenging and rich context to explore sage language, though our analysis technique should largely
computational models of natural language, motivating new ignore non-English messages.
research in computational linguistics. There probably exist many further issues with this text
In this paper, we connect measures of public opinion sample; for example, the demographics and communication
derived from polls with sentiment measured from analy- habits of the Twitter user population probably changed over
sis of text from the popular microblogging site Twitter. this time period, which should be adjusted for given our de-
We explicitly link measurement of textual sentiment in mi- sire to measure attitudes in the general population. There
croblog messages through time, comparing to contempo- are clear opportunities for better preprocessing and stratified
raneous polling data. In this preliminary work, summary sampling to exploit these data.
Copyright c 2010, Association for the Advancement of Artificial
1
Intelligence (www.aaai.org). All rights reserved. This scraping effort was conducted by Brendan Meeder.

Page 1 of 8
To appear in: Proceedings of the International AAAI Conference on Weblogs and Social Media, Washington, DC, May 2010.

Public Opinion Polls

−20
Gallup Econ. Conf.
We consider several measures of consumer confidence and
political opinion, all obtained from telephone surveys to par-

−40
ticipants selected through random-digit dialing, a standard
technique in traditional polling (Chang and Krosnick 2003).
Consumer confidence refers to how optimistic the pub-

−60
lic feels, collectively, about the health of the economy
and their personal finances. It is thought that high con-
sumer confidence leads to more consumer spending; this
q
line of argument is often cited in the popular media and by

75
Michigan ICS
policymakers (Greenspan 2002), and further relationships Index
q q q
q q
with economic activity have been studied (Ludvigson 2004;

65
q
Wilcox 2007). Knowing the public’s consumer confidence is q
q
q
q
q q
of great utility for economic policy making as well as busi- q
q q q

55
ness planning. q

Two well-known surveys that measure U.S. consumer

2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
confidence are the Consumer Confidence Index from the
Consumer Board, and the Index of Consumer Sentiment
(ICS) from the Reuters/University of Michigan Surveys of
Consumers.2 We use the latter, as it is more extensively stud-
ied in economics, having been conducted since the 1950s. Figure 1: Monthly Michigan ICS and daily Gallup consumer
The ICS is derived from answers to five questions adminis- confidence poll.
tered monthly in telephone interviews with a nationally rep-
resentative sample of several hundred people; responses are
combined into the index score. Two of the questions, for whether they would vote for Barack Obama or John McCain.
example, are: Many different organizations administered them throughout
2008; we use a compilation provided by Pollster.com, con-
“We are interested in how people are getting along fi- sisting of 491 data points from 46 different polls.5 The data
nancially these days. Would you say that you (and your are shown in Figure 3.
family living there) are better off or worse off finan-
cially than you were a year ago?”
“Now turning to business conditions in the country Text Analysis
as a whole—do you think that during the next twelve
From text, we are interested in assessing the population’s
months we’ll have good times financially, or bad times,
aggregate opinion on a topic. Immediately, the task can be
or what?”
broken down into two subproblems:
We also use another poll, the Gallup Organization’s “Eco-
nomic Confidence” index,3 which is derived from answers 1. Message retrieval: identify messages relating to the topic.
to two questions that ask interviewees to to rate the overall 2. Opinion estimation: determine whether these messages
economic health of the country. This only addresses a subset express positive or negative opinions or news about the
of the issues that are incorporated into the ICS. We are inter- topic.
ested in it because, unlike the ICS, it is administered daily
(reported as three-day rolling averages). Frequent polling If there is enough training data, this could be formulated as
data are more convenient for our comparison purpose, since a topic-sentiment model (Mei et al. 2007), in which the top-
we have fine-grained, daily Twitter data, but only over a two- ics and sentiment of documents are jointly inferred. Our
year period. Both datasets are shown in Figure 1. dataset, however, is asymmetric, with millions of text mes-
For political opinion, we use two sets of polls. The first is sages per day (and millions of distinct vocabulary items) but
Gallup’s daily tracking poll for the presidential job approval only a few hundred polling data points in each problem. It is
rating for Barack Obama over the course of 2009, which is a challenging setting to estimate a useful model over the vo-
reported as 3-day rolling averages.4 These data are shown in cabulary and messages. The signal-to-noise ratio is typical
Figure 2. of information retrieval problems: we are only interested in
The second is a set of tracking polls during the 2008 information contained in a small fraction of all messages.
U.S. presidential election cycle, asking potential voters We therefore opt to use a transparent, deterministic ap-
2 proach based on prior linguistic knowledge, counting in-
Downloaded from http://www.sca.isr.umich. stances of positive-sentiment and negative-sentiment words
edu/.
3 in the context of a topic keyword.
Downloaded from http://www.gallup.com/poll/
122840/gallup-daily-economic-indexes.aspx.
4 5
Downloaded from http://www.gallup.com/poll/ Downloaded from http://www.pollster.com/
113980/Gallup-Daily-Obama-Job-Approval.aspx. polls/us/08-us-pres-ge-mvo.php

Page 2 of 8

0.012
Approve−Disapprove perc. diff.

0.10
obama

0.006
jobs
60
40

0.000
0.00
2008 2009 2010 2008 2009 2010
20

0.08
0

0.006
mccain
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
2009−12

0.04

job

0.002
0.00
Figure 2: 2009 presidential job approval (Barack Obama). 2008 2009 2010 2008 2009 2010

0.000 0.002 0.004
economy
Percent Support for Candidate

q
q q q
q q q q qq q
q q
q
q q q qq q qq qq
qq q q q q
q
50

q q q q qq q q qqqq qqq qq qq
qq q qq q q
q qq q q qq q q q q q q qq qq q qq qqqq
qq
q q q q q
q q q q
q q q q q q q qq qqqq q q
q qq
q q qq q q q q q qq q q q q
qq
q q q q q q q q
qq q q q q q qq q q qq q qq q
q q q q qqq q q qq q q qqq q qqq qqqqq q
q qq q qq q q q qq q
q qqq q q q
q q q q qq qq q qqq q qq q q q
q qq q q q qq q qq qq q qqq qqq qqqq qqqqqq qq q
q qq q qq q q q qq qq q
qq qq q qq
q
q q q q q q q qq q q q q q q
qq q q qq q q q
qq q q q q q q qq qq q
q q q qq q
q qqqqqqq qq q q qqqq qq q qq qq q qq
q q qq
q q qq q qqq
q q q
q q qq q
q q qqq q q qq q
q q
q q q q qq q qq q q qq qqqq
qq qq q q q qqqqqq qq q qq qqq qq
q q q qqqq q q q qq q q q
qq qq q q q q
q q
q q q q q qqq q qq q qqq q q
q q q
q q q q q
q q q q q qq q q q qq q q q q q q q
qq q q q q q qq q q q
q q q qq q qq qqq
q q qq qq qq qqqq
qq qq q
q q
qq qqqqq q qq
qq q q qq qq q
qq q
qq q q qqqqq qqq q qqq q q
q q q q
q q q q q qq q q qq qq q q
q qq
q q qq qq q q
q q
q q q q q q
q
q q q qq q q q q q qq q
qq q q q q qqqq
q q
q q q q q qqqq q qq qq q
q q q q q q
40

q
q q
q
q
q
q
qq q
q
q
q q q
qq q
q qq
q
q
q
q q
q
q
qq q qq qq qq
qq
q q
qq
q q q
q
q
qq
q q qq q
qq
qq
q q
q
qq q qq q q
q qq q q
qq q
q q qq
q q qq
q
2008 2009 2010
q q q q q q q q qq q qq
q q q q q q q q
qqq q q q
qq q q q
q q q q
q q
q
30

q
Figure 4: Fraction of Twitter messages containing various
topic keywords, per day.
2008−02

2008−03

2008−04

2008−05

2008−06

2008−07

2008−08

2008−09

2008−10

2008−11

Opinion Estimation
We derive day-to-day sentiment scores by counting positive
and negative messages. Positive and negative words are de-
Figure 3: 2008 presidential elections, Obama vs. McCain
fined by the subjectivity lexicon from OpinionFinder, a word
(blue and red). Each poll provides separate Obama and Mc-
list containing about 1,600 and 1,200 words marked as pos-
Cain percentages (one blue and one red point); lines are 7-
itive and negative, respectively (Wilson, Wiebe, and Hoff-
day rolling averages.
mann 2005).6 We do not use the lexicon’s distinctions be-
tween weak and strong words.
Message Retrieval A message is defined as positive if it contains any positive
word, and negative if it contains any negative word. (This
We only use messages containing a topic keyword, manually allows for messages to be both positive and negative.) This
specified for each poll: gives similar results as simply counting positive and negative
words on a given day, since Twitter messages are so short
• For consumer confidence, we use economy, job, and jobs. (about 11 words).
We define the sentiment score xt on day t as the ratio
• For presidential approval, we use obama. of positive versus negative messages on the topic, counting
from that day’s messages:
• For elections, we use obama and mccain.
countt (pos. word ∧ topic word)
Each topic subset contained around 0.1–0.5% of all mes- xt = (1)
countt (neg. word ∧ topic word)
sages on a given day, though with occasional spikes, as seen
p(pos. word | topic word, t)
in Figure 4. These appear to be driven by news events. All =
terms have a weekly cyclical structure, occurring more fre- p(neg. word | topic word, t)
quently on weekdays, especially in the middle of the week, where the likelihoods are estimated as relative frequencies.
compared to weekends. (In the figure, this is most appar- We performed casual inspection of the detected messages
ent for the term job since it has fewer spikes.) Nonetheless, and found many examples of falsely detected sentiment. For
these fractions are small. In the earliest and smallest part of example, the lexicon has the noun will as a weak positive
our dataset, the topic samples sometimes come out just sev- word, but since we do not use a part-of-speech tagger, this
eral hundred messages per day; but by late 2008, there are
6
thousands of messages per day for most datasets. Available at http://www.cs.pitt.edu/mpqa.

Page 3 of 8

5
causes thousands of false positives when it matches the verb
sense of will.7 Furthermore, recall is certainly very low,

4
since the lexicon is designed for well-written standard En-

Sentiment Ratio
glish, but many messages on Twitter are written in an infor-

3
mal social media dialect of English, with different and al-
ternately spelled words, and emoticons as potentially useful

2
signals. Creating a more comprehensive lexicon with dis-
tributional similarity techniques could improve the system;
Velikovich et al. (2010) find that such a web-derived lexicon

1
substantially improves a lexicon-based sentiment classifier.

0
Comparison to Related Work
The sentiment analysis literature often focuses on analyzing

2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
individual documents, or portions thereof (for a review, see
Pang and Lee, 2008). Our problem is related to work on sen-
timent information retrieval, such as the TREC Blog Track
competitions that have challenged systems to find and clas-
sify blog posts containing opinions on a given topic (Ounis, Figure 5: Moving average MAt of sentiment ratio for jobs,
MacDonald, and Soboroff 2008). under different windows k ∈ {1, 7, 30}: no smoothing
The sentiment feature we consider, presence or absence (gray), past week (magenta), and past month (blue). The
of sentiment words in a message, is one of the most basic unsmoothed version spikes as high as 10, omitted for space.
ones used in the literature. If we view this system in the
traditional light—as subjectivity and polarity detection for Moving Average Aggregate Sentiment
individual messages—it makes many errors, like all natural Day-to-day, the sentiment ratio is volatile, much more than
language processing systems. However, we are only inter- most polls.9 Just like in the topic volume plots (Figure 4),
ested in aggregate sentiment. A high error rate merely im- the sentiment ratio rapidly rises and falls each day. In order
plies the sentiment detector is a noisy measurement instru- to derive a more consistent signal, and following the same
ment. With a fairly large number of measurements, these methodology used in public opinion polling, we smooth the
errors will cancel out relative to the quantity we are inter- sentiment ratio with one of the simplest possible temporal
ested in estimating, aggregate public opinion.8 Furthermore, smoothing techniques, a moving average over a window of
as Hopkins and King (2010) demonstrate, it can actually be the past k days:
inaccurate to na¨vely use standard text analysis techniques,
ı
which are usually designed to optimize per-document classi- 1
fication accuracy, when the goal is to assess aggregate pop- MAt = (xt−k+1 + xt−k+2 + ... + xt )
k
ulation proportions.
Several prior studies have estimated and made use of ag- Smoothing is a critical issue. It causes the sentiment ratio
gregated text sentiment. The informal study by Lindsay to respond more slowly to recent changes, thus forcing con-
(2008) focuses on lexical induction in building a sentiment sistent behavior to appear over longer periods of time. Too
classifier for a proprietary dataset of Facebook wall posts much smoothing, of course, makes it impossible to see fine-
(a web conversation/microblog medium broadly similar to grained changes to aggregate sentiment. See Figure 5 for
Twitter), and demonstrates correlations to several polls con- an illustration of different smoothing windows for the jobs
ducted during part of the 2008 presidential election. We are topic.
unaware of other research validating text analysis against
traditional opinion polls, though a number of companies of- Correlation Analysis: Is text sentiment a
fer text sentiment analysis basically for this purpose (e.g., leading indicator of polls?
Nielsen Buzzmetrics). There are at least several other stud-
ies that use time series of either aggregate text sentiment or Figure 6 shows the jobs sentiment ratio compared to the two
good vs. bad news, including analyzing stock behavior based different measures of consumer confidence, Gallup Daily
on text from blogs (Gilbert and Karahalios 2010), news arti- and Michigan ICS. It is apparent that the sentiment ratio
cles (Lavrenko et al. 2000; Koppel and Shtrimberg 2004) captures the broad trends in the survey data. With 15-
and investor message boards (Antweiler and Frank 2004; day smoothing, it is reasonably correlated with Gallup at
Das and Chen 2007). Dodds and Danforth (2009) use an r = 73.1%. The most glaring difference is a region of
emotion word counting technique for purely exploratory high positive sentiment in May-June 2008. But otherwise,
analysis of several corpora. the sentiment ratio seems to pick up on the downward slide
of consumer confidence through 2008, and the rebound in
7
We tried manually removing this and several other frequently February/March of 2009.
mismatching words, but it had little effect.
8 9
There is an issue if errors correlate with variables relevant to That the reported poll results are less volatile does not imply
public opinion; for example, if certain demographics speak in di- that they are more accurate reflections of true population opinion
alects that are harder to analyze. than the text.

Page 4 of 8

4.0

0.9
k=15, lead=0 Text leads poll
k=30, lead=50 Poll leads text
3.5

0.8
Sentiment Ratio

qqqqq qqq
qq
qqqqqqq qqqq
qqq q qq
qqqq q qqqqqqq
qqq
qqq

Corr. against Gallup
qq qq q
q
q
3.0

qq
q
q q
q
q
q q
q
qq
q q

0.7
q
q qqq
qqq
q
q
q qq
qq
q
q
q q
qq
qq
q
q qq
q
qq q
qq
q q
qq
2.5

q
qq qq
qq q
qq q
q
qq qq q
qq q
qq q
qq q
q
q
qq qq
q

0.6
q
q
q qq
q
q q
q
qq qq
qq q
q q
2.0

qq q
q
qq
q q
qq
q q
q
q
q

0.5
q
q
qq
qq
qq
q
q
q
q
1.5

k=30

0.4
q k=15
k=7
−20
Gallup Economic Confidence

Index
−90 −50 −10 30 50 70 90
−30

Text lead / poll lag
−40
−50

0.8
−60

0.6
Corr. against ICS

0.4

Index
75

0.2
Michigan ICS

70

0.0
65

k=30
60

−0.2

k=60
55

−90 −50 −10 30 50 70 90
2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11

Text lead / poll lag

Figure 7: Cross-correlation plots: sensitivity to lead and lag
Figure 6: Sentiment ratio and consumer conﬁdence surveys. for different smoothing windows. L > 0 means the text
Sentiment information captures broad trends in the survey window completely precedes the poll, and L < −k means
data. the poll precedes the text. (The window straddles the poll
for L < −k < 0.) The L = −k positions are marked on
each curve. The two parameter settings shown in Figure 6
are highlighted with boxes.

Page 5 of 8

When consumer confidence changes, can this first be seen and Gallup are correlated (best correlation is r = 86.4%
in the text sentiment measure, or in polls? If text sentiment if Gallup is given its own smoothing and alignment at k =
responds faster to news events, a sentiment measure may be 30, L = 20), which supports the hypothesis that they are
useful for economic researchers and policymakers. We can measuring similar things, and that Gallup is a leading in-
test this by looking at leading versions of text sentiment. dicator for ICS. Fixed to 30-day smoothing, the sentiment
First note that the text-poll correlation reported above is ratio only achieves r = 63.5% under optimal lead L = 50.
the goodness-of-fit metric for fitting slope and bias parame- So it is a weaker indicator than Gallup.
ters a, b in a one variable linear least-squares model: Finally, we also experimented with sentiment ratios for
k−1
the terms job and economy, which both correlate very poorly
with the Gallup poll: 10% and 7% respectively (with the
yt = b + a xt−j + t
default k = 15, L = 0).10
j=0
This is a cautionary note on the common practice of stem-
for poll outcomes yt , daily sentiment ratios xj , Gaussian ming words, which in information retrieval can have mixed
noise t , and a fixed hyperparameter k. A poll outcome is effects on performance (Manning, Raghavan, and Sch¨ tze u
compared to the k-day text sentiment window that ends on 2008, ch. 2). Here, stemming would have conflated job and
the same day as the poll. jobs, severely degrading results.
We introduce a lag hyperparameter L into the model, so
the poll is compared against the text window ending L days Forecasting Analysis
before the poll outcome. As a further validation, we can evaluate the model in a
k−1
rolling forecast setting, by testing how well the text-based
model can predict future values of the poll. For a lag L,
yt+L = b + a xt−j + t
and a target forecast date t + L, we train the model only on
j=0
historical data through day t − 1, then predict using the win-
Graphically, this is equivalent to taking one of the text senti- dow ending on day t. The lag parameter L is how many days
ment lines on Figure 6 and shifting it to the right by L days, in the future the forecasts are for. We repeat this model fit
then examining the correlation against the consumer confi- and prediction procedure for most days. (We cannot forecast
dence polls below. early days in the data, since L + k initial days are necessary
Polls are typically administered over an interval. The ICS to cover the start of the text sentiment window, plus at least
is reported once per month (at the end of the month), and several days for training.)
Gallup is reported for 3-day windows. We always consider
the last day of the poll’s window to be the poll date, which is
−20
Gallup Economic Confidence

Gallup poll
the earliest possible day that the information could actually Text forecasts, lead=30
be used. Therefore, we would expect both daily measures,
−30

Poll self−forecasts, lead=30
Gallup and text sentiment, to always lead ICS, since it mea-
sures phenomena occurring over the previous month.
−40

The sensitivity of text-poll correlation to smoothing win-
dow and lag parameters (k, L) is shown in Figure 7. The re-
−50

gions corresponding to text preceding or following the poll
are marked. Correlation is higher for text leading the poll
−60

and not the other way around, so text seems to be a leading
indicator. Gallup correlations fall off faster for poll-leads-
text than text-leads-poll, and the ICS has similar properties.
20
Text coef.

If text and polls moved at random relative to each other,
these cross-correlation curves would stay close to 0. The
5

fact they have peaks at all strongly suggests that the text sen-
−10

timent measure captures information related to the polls.
Also note that more smoothing increases the correlation:
2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11

for Gallup, 7-, 15-, and 30-day windows peak at r = 71.6%,
76.3%, and 79.4% respectively. The 7-day and 15-day win-
dows have two local peaks for correlation, corresponding to
shifts that give alternate alignments of two different humps
against the Gallup data, but the better-correlating 30-day Figure 8: Rolling text-based forecasts (above), and the text
window smooths over these entirely. Furthermore, for the sentiment (MAt ) coefficients a for each of the text forecast-
ICS, a 60-day window often achieves higher correlation than ing models over time (below).
the 30-day window. These facts imply that the text sentiment Results are shown in Figure 8. Forecasts for one month in
information is volatile, and if polls are believed to be a gold
standard, then it is best used to detect long-term trends. 10
We inspected some of the matching messages to try to under-
It is also interesting to consider ICS a gold standard and stand this result, but since the sentiment detector is very noisy at the
compare correlations with Gallup and text sentiment. ICS message level, it was difficult to understand what was happening.

Page 6 of 8

with "obama"Sentiment Ratio for "obama"
the future (that is, using past text from 44 through 30 days

5
before the target date) achieve 77.5% correlation. This is
slightly worse than a baseline to predict the poll from its

4
lagged self (yt+L ≈ b0 + b1 yt ), which has r = 80.4%.

3
Adding the sentiment score to historical poll information as
a bivariate model (yt+L ≈ b0 + b1 yt + aMAt..t−k+1 ), yields

2
a very small improvement (r = 81.0%).
Inspecting the rolling forecasts and text model coefficient

1
a is revealing. In 2008 and early 2009, text sentiment is

Frac. Messages

0.15
a poor predictor of consumer confidence; for example, it
fails to reflect a hump in the polls in August and Septem-
ber 2008. The model learns a coefficient near zero (even

0.00
negative), and makes predictions similar to the poll’s self-

% Support Obama (Election)
predictions, which is possible since the poll’s most recent

% Pres. Job Approval
55

70
values are absorbed into the bias term of the text-only model.
However, starting in mid-2009, text sentiment becomes a

50

60
much better predictor, as it captures the general rise in con-
sumer confidence starting then (see Figure 6). This sug-

45

50
gests qualitatively different phenomena are being captured
by the text sentiment measure at different times. From the

40

40
perspective of time series modeling, future work should in-
vestigate techniques for deciding the importance of different

2008−01
2008−02
2008−03
2008−04
2008−05
2008−06
2008−07
2008−08
2008−09
2008−10
2008−11
2008−12
2009−01
2009−02
2009−03
2009−04
2009−05
2009−06
2009−07
2009−08
2009−09
2009−10
2009−11
2009−12
historical signals and time periods, such as vector autore-
gressions (e.g. Hamilton 1994).
It is possible that the effectiveness of text changes over
this time period for reasons described earlier: Twitter itself
changed substantially over this time period. In 2008, the site Figure 9: The sentiment ratio for obama (15-day window),
had far fewer users who were probably less representative and fraction of all Twitter messages containing obama (day-
of the general population, and were using the site differently by-day, no smoothing), compared to election polls (2008)
than users would later. and job approval polls (2009).

Obama 2009 Job Approval and 2008 Elections We also found that the topic frequencies correlate with
We analyze the sentiment ratio for obama and compared polls much more than the sentiment scores. First note that
it to two series of polls, presidential job approval in 2009, the message volume for obama, shown in Figure 9, has the
and presidential election polls in 2008, as seen in Figure 9. usual daily spikes like other words on Twitter shown in Fig-
The job approval poll is the most straightforward, being a ure 4. Some of these spikes are very dramatic; for example,
steady decline since the start of the Obama presidency, per- on November 5th, nearly 15% of all Twitter messages (in
haps with some stabilization in September or so. The sen- our sample) mentioned the word obama.
timent ratio also generally declines during this period, with Furthermore, the obama message volume substantially
r = 72.5% for k = 15. correlates to the poll numbers. Even the raw volume has a
However, in 2008 the sentiment ratio does not substan- 52% correlation to the polls, and the 15-day window version
tially correlate to the election polls (r = −8%); we compare is up to r = 79%. Simple attention seems to be associated
to the percent of support for Obama, averaged over a 7-day with popularity, at least for Obama. But the converse is not
window of tracking polls: the same information displayed true for mccain; this word’s 15-day message volume also
in Figure 3). Lindsay (2008) found that his daily senti- correlates to higher Obama ratings in the polls (r = 74%).
ment score was a leading indicator to one particular tracking A simple explanation may be that frequencies of either term
poll (Rasmussen) over a 100-day period from June-October mccain or obama are general indicators of elections news
2008. Our measure also roughly correlates to the same data, and events, and most 2008 elections news and events were
though less strongly (r = 44% vs. r = 57%) and only at favorable toward or good for Obama. Certainly, topic fre-
different lag parameters. quency may not have a straightforward relationship to pub-
The elections setting may be structurally more complex lic opinion in a more general text-driven methodology for
than presidential job approval. In many of the tracking polls, public opinion measurement, but given the marked effects it
people can choose to answer any Obama, McCain, unde- has in these data, it is worthy of further exploration.
cided, not planning to vote, and third-party candidates. Fur-
thermore, the name of every candidate has its own sentiment Conclusion
ratio scores in the data. We might expect the sentiment for In the paper we find that a relatively simple sentiment de-
mccain to be vary inversely with obama, but they in fact tector based on Twitter data replicates consumer confidence
slightly correlate. It is also unclear how they should interact and presidential job approval polls. While the results do not
as part of a model of voter preferences. come without caution, it is encouraging that expensive and

Page 7 of 8

time-intensive polling can be supplemented or supplanted Gilbert, E., and Karahalios, K. 2010. Widespread worry
with the simple-to-gather text data that is generated from and the stock market. In Proceedings of the International
online social networking. The results suggest that more ad- Conference on Weblogs and Social Media.
vanced NLP techniques to improve opinion estimation may Greenspan, A. 2002. Remarks at the Bay Area coun-
be very useful. cil conference, San Francisco, California. http:
The textual analysis could be substantially improved. Be- //www.federalreserve.gov/boarddocs/
sides the clear need for a more well-suited lexicon, the speeches/2002/20020111/default.htm.
modes of communication should be considered. When mes-
Hamilton, J. D. 1994. Time Series Analysis. Princeton
sages are retweets (forwarded messages), should they be
University Press.
counted? What about news headlines? Note that Twitter is
rapidly changing, and the experiments on recent (2009) data Hopkins, D., and King, G. 2010. A method of automated
performed best, which suggests that it is evolving in a direc- nonparametric content analysis for social science. Ameri-
tion compatible with our approach, which uses no Twitter- can Journal of Political Science 54(1):229–247.
specific features at all. Koppel, M., and Shtrimberg, I. 2004. Good news or bad
In this work, we treat polls as a gold standard. Of course, news? Let the market decide. In AAAI Spring Symposium
they are noisy indicators of the truth — as is evident in Fig- on Exploring Attitude and Affect in Text: Theories and Ap-
ure 3 — just like extracted textual signals. Future work plications.
should seek to understand how these different signals reflect Krosnick, J. A.; Judd, C. M.; and Wittenbrink, B. 2005.
public opinion either as a hidden variable, or as measured The measurement of attitudes. The Handbook of Attitudes
from more reliable sources like face-to-face interviews. 2176.
Many techniques from traditional survey methodology Lavrenko, V.; Schmill, M.; Lawrie, D.; Ogilvie, P.; Jensen,
can also be used again for automatic opinion measurement. D.; and Allan, J. 2000. Mining of concurrent text and time
For example, polls routinely use stratified sampling and series. In Proceedings of the 6th ACM SIGKDD Int’l Con-
weighted designs to ask questions of a representative sam- ference on Knowledge Discovery and Data Mining Work-
ple of the population. Given that many social media sites shop on Text Mining.
include user demographic information, such a design is a
sensible next step. Lindsay, R. 2008. Predicting polls with Lexicon.
Eventually, we see this research progressing to align with http://languagewrong.tumblr.com/post/
the more general goal of query-driven sentiment analysis 55722687/predicting-polls-with-lexicon.
where one can ask more varied questions of what people are Ludvigson, S. C. 2004. Consumer confidence and con-
thinking based on text they are already writing. Modeling sumer spending. The Journal of Economic Perspectives
traditional survey data is a useful application of sentiment 18(2):29–50.
analysis. But it is also a stepping stone toward larger and Manning, C. D.; Raghavan, P.; and Sch¨ tze, H. 2008. In-
u
more sophisticated applications. troduction to Information Retrieval. Cambridge University
Press, 1st edition.
Acknowledgments Mei, Q.; Ling, X.; Wondra, M.; Su, H.; and Zhai, C. X.
This work is supported by the Center for Applied Research 2007. Topic sentiment mixture: modeling facets and opin-
in Technology at the Tepper School of Business, and the ions in weblogs. In Proceedings of the 16th International
Berkman Faculty Development Fund at Carnegie Mellon conference on World Wide Web.
University. We would like to thank the reviewers for help- Ounis, I.; MacDonald, C.; and Soboroff, I. 2008. On the
ful suggestions, Charles Franklin for advice in interpreting TREC blog track. In Proceedings of the International Con-
election polling data, and Brendan Meeder for contribution ference on Weblogs and Social Media.
of the Twitter scrape.
Pang, B., and Lee, L. 2008. Opinion Mining and Sentiment
References Analysis. Now Publishers Inc.
Antweiler, W., and Frank, M. Z. 2004. Is all that talk just Velikovich, L.; Blair-Goldensohn, S.; Hannan, K.; and Mc-
noise? the information content of internet stock message Donald, R. 2010. The viability of web-dervied polarity
boards. Journal of Finance 59(3):1259–1294. lexicons. In Proceedings of Human Language Technolo-
gies: The 11th Annual Conference of the North American
Chang, L. C., and Krosnick, J. A. 2003. National surveys Chapter of the Association for Computational Linguistics.
via RDD telephone interviewing vs. the internet: Compar-
ing sample representativeness and response quality. Un- Wilcox, J. 2007. Forecasting components of consumption
published manuscript. with components of consumer sentiment. Business Eco-
nomics 42(4):2232.
Das, S. R., and Chen, M. Y. 2007. Yahoo! for Amazon:
Sentiment extraction from small talk on the web. Manage- Wilson, T.; Wiebe, J.; and Hoffmann, P. 2005. Recog-
ment Science 53(9):1375–1388. nizing contextual polarity in phrase-level sentiment analy-
sis. In Proceedings of the Conference on Human Language
Dodds, P. S., and Danforth, C. M. 2009. Measuring the Technology and Empirical Methods in Natural Language
happiness of Large-Scale written expression: Songs, blogs, Processing.
and presidents. Journal of Happiness Studies 116.

Page 8 of 8

From Tweets To Polls Linking Text Sentiment To Public Opinion Time Series

More Related Content

Similar to From Tweets To Polls Linking Text Sentiment To Public Opinion Time Series (20)

Recently uploaded (20)

From Tweets To Polls Linking Text Sentiment To Public Opinion Time Series