Introduction to automated text analyses in the Political Sciences

Introduction to
automated text analyses
in the political sciences
Atelier documents, archives, discours 2018-2019
Université de Lausanne
20.11.2018
1 pm – 5 pm
DR. CHRISTIAN RAUH
www.christian-rauh.eu

WZB
christian-rauh.eu
Our plan for today
 Promises and pitfalls of automated text analysis
(or: automated content analysis, text mining, corpus analytics, ...)
 Your expectations?
 Teaching goal: Enable informed decisions on whether and which
automated content analysis methods are suitable for your research
 Discuss the intuition and pragmatic challenges of the most
common political science text analysis methods,
o Corpus construction and discovery
o Dictionary-based analyses
o Text scaling procedures
o Briefly: topic models, natural language processing, machine learning
 Running example and tutorials implemented in R
Climate change in United Nations General Assembly speeches
 Please bear your research questions in mind, apply the discussed
ideas to them, and interrupt me whenever something is unclear!
2

WZB
Part 1
Qualitative, quantitative, and automated
3120 most frequent words in Krippendorf 2004: chps. 1-2
Size ~ Relative term frequency

WZB
Content analysis is ...
− the analysis of (text) documents
 Politics usually happens through written or spoken text
 Which documents matter for your research question?
 Do you cover all of them or only a sample?
− the analysis of messages
 Sender → Text → Recipient
 What is your object of inference?
− always context-dependent!
 All texts are produced for a purpose
 How does this purpose relate to your inferences?
 Which assumptions do you apply when interpreting the texts?
4

WZB
Content analysis in between ...
−Positivist & interpretative approaches
to scientific inquiry
−Qualitative & quantitative approaches
to social science measurement
5

WZB
Content analysis as a methodology
− Content analysts vs. newspaper readers
− Reliability, replicability, validity
 Specify assumptions and benchmarks you apply to texts
 Detail interpretation / coding / categorization schemes
 If possible: Validate with external data / information
− Unobstrusive / non-reactive measurement
6

WZB
Content analysis:
A working definition
“Content analysis is a research technique for
making replicable and valid inferences from
text (or other meaningful matter) to the
contexts of their use.”
7
Source: Krippendorf 2004: p. 18

WZB
Why automate?
 Political texts increasingly available in digitized formats
 The challenge: Volume!
o Risk of sampling bias
o Human coding is time- and resource intensive
− The promise: Automated analyses retrieve theoretically relevant
concepts from complete text corpora at comparatively low cost
− Automated text analyses...
... rely on quantitative representations of source texts
... often apply statistical models based on assumptions about text generation
... are extremely reliable, but have to be validated
... cannot replace careful and close reading of source texts!
8

WZB
Four principles of automated text analysis
(Grimmer and Stewart 2013)
1. All quantitative models of language are wrong – but some are
useful (sometimes)
2. Automated text analyses augment and amplify human
interpretation but do not replace it
3. There is no globally best method for automated text analysis
4. Validate, validate, validate!
 Applicability of an automated content analysis can only be
judged against your particular research question and theory!
9

WZB
Part 2
Corpus construction and discovery
10

WZB
Acquiring documents
 Which kinds of texts are suitable for automated content analysis?
o Unit of analysis is usually on document level (other units possible)
o Focussed documents preferable (depending on your theoretical concepts)
o Sufficient number of words required (depending on the applied method)
 Typical text sources
o Existing corpora: Other social science projects or linguistic resources
o Online databases: e.g. LexisNexis, Factiva, Gale Cengage (newspapers and
press agencies); governments, parliaments, international orgs; etc...
o Web scraping: Press releases, news sites, blogs, Twitter, etc.
(see Munzert et al., 2015, Wiley)
o Scan / OCR of printed matter
 Store documents with consistent formats and document names
o Plain txt files work best (converters freely available)
o UTF-8 encoding standard for Latin alphabet
11

WZB
Pre-processing: Turning text into data
(Typical, but not universally applicable steps!)
 Remove...
... document “boilerplate” (info not part of the analysed message)
... punctuation, capitalization, numbers
... very common and very uncommon terms (“stop words”, <1% of docs)
 Lemmatization / Stemming
o Words referring to the same concept mapped to a single root
o {economy, economic , economically} → economi
 Turn documents into “bags of words”
o Discards the order in which words occur!
o Unigrams, bigrams ... n-grams
 Document frequency matrix
12

WZB
Source: python-course.eu
“Bags of words”
(illustration w/out stemming)
13

WZB
Source: http://web.eecs.utk.edu/~mberry/order/node4.html
Document frequency matrix (illustration)
14

WZB
The potential of discovery
 Even if you do not want to apply statistical analyses to your
corpus, a look at the aggregated term frequencies may:
o ... show unknown temporal patterns
o ... provide contextual information for specific concepts
(co-locations, keyword-in-context, synonyms, ...)
o ... guide selection of individual texts for further
human interpretation and coding
o ... give you an aggregated perspective on the discourse
helping to contextualize individual documents therein
15

WZB
Introducing our running example
 Speeches in the United Nations General Assembly
Based on the UNGD corpus assembled by Baturo, Dasandi, Mikhaylov (2017, R&P)
 What is the context of these documents we need to have in
mind along the sender-message-recipient framework?
o Who speaks when? With what purpose?
o The examples will (try to) make inferences about the ‘senders’,
assuming that speeches reflect state positions along the words used
 Climate change as the political issue of interest
o Identified by speeches referring literally to
‘climate(-| )change’ or ‘global(-| )warming’
o 100-term window around these references to see how and what
national delegates say about or associate with climate change
o For a ‘real’ analysis, more fine-tuning will most likely be needed,
bear with me...
16

WZB
22
Island states
All others

WZB
Part 3
Dictionary-based text analysis
23

WZB
Basic idea of
dictionary-based text analyses
 Presence of or rate at which a set of predefined key words
occurs in a document is used to classify or scale
the document into/along theoretically relevant categories
+ Intuitive and easy to apply
+ Replicable and expandable to various theoretical concepts
24

WZB
Example I: Debating the EU in national
parliaments (Rauh & De Wilde 2018, EJPR)
 Question: Do national parliamentary debates enhance the
public accountability of EU decision-making?
(If so, they should mirror EU authority, decision-making, public demand, and
feature a balance of government and opposition...but party strategies…)
 Text analysis approach
o Build a full-text corpus of parliamentary debates in various EU
member states (get it here: www.bit.ly/ParlSpeech)
o Find typical ways of referencing the EU polity, politics, and policies
on n-gram level (= reading lots of speeches!)
o Generalize these examples by regular expressions
and build a respective dictionary
o Count, normalize and aggregate EU references to party-month level
o Relate this to relevant external data
25

WZB
Rauh & De Wilde (2018): Text data
26

christian-rauh.eu
Rauh & De Wilde (2018): Dictionary
(English version, other languages more complex)
27

christian-rauh.eu
Rauh & De Wilde (2018):
Descriptive results
28

christian-rauh.eu
Rauh & De Wilde (2018): Multivariate results
29

WZB
Example II: NGOs in the public discourse
on the WTO (Rauh & Bödeker 2014 mimeo)
 Question: Which (type of) non-governmental actors participate
in the public discourse on the World Trade Organization?
 Approach
o Financial Times (UK), Straits Times (S‘pore), New York Times (US)
o Download, parse and clean all 11.388 articles from LexisNexis that
mention the WTO in headline or lead (1985-2012)
o Generate encompassing list of all transnational NGOs
(Sources: WTO stakeholder directory and UN ECOSOC database)
o Tag and count the occurrence of each of these NGOs in the articles
o Manually classify tagged NGOs as ‘business’ or ‘public interest’
o Aggregate and analyse the data
30

WZB
Typical dictionary application:
Sentiment analysis
− Typical application: Sentiment analyses
o Do the analysed message convey information positively or
negatively (tone)?
o Sentiment dictionary: List of terms with individual tone
scores; usually ranging between -1 (negative) and 1 (positive)
o Sentiment at document level: rate at which positively or
negatively connoted words occur (often: relative to the overall
number of terms in document)
34

WZB
Exemplary sentiment dictionary
(Young and Soroka 2011)
35
Positive connotation Negative connotation
ALLEVIAT* ABSURD*
BENEFIT* BELLIGEREN*
COOPERAT* CONFRONT*
DESERV* CONTAGIOUS*
EXCITE* FALTER*
FAIR* HELPLESS*
OUTSTAND* IDEOLOGUE*
PERFECT* LOSE*
RESOLV* NEGLECT*
USEFUL* SCANDAL*
Exemplary terms (stemmed) from the
Lexicoder Sentiment dictionary
All in all, the LSD has 4,567 unique entries

WZB
A sentiment analysis
applied to our running example
 How positively or negatively do national delegates in the
United Nations General assembly speak about climate change?
 And: Does this meaningfully capture expressed political
positions on climate change issues?
 Approach
 Apply the Lexicoder sentiment dictionary with the respective
functions in the quanteda R package to the corpus created above
 Normalized sentiment score in 100-term window around climate
change references
36

WZB
Pitfalls of dictionary approaches
− Validity of the derived measures not granted
o Can the theoretical concepts/objects of interest be captured
at term/document level?
o Do term level scores closely align with the typical word usage in
the analysed context?
 Use ‘off-the-shelf ‘ term lists developed in other contexts
only with extreme caution
 Ideally: Develop your own dictionaries tailored to your
research question
 Apply/calculate context-specific baselines
 In any case: Validate your results!
(e.g. against human coders or external data related to your concepts)
 Validated sentiment dictionary for English political language: Young and Soroka (2001, PC)
 Validated sentiment dictionary for German political language: Rauh (2018, JTIP)
40

WZB
Automated scaling of texts
− Scaling techniques …
… automatically distribute documents across a latent (underlying) scale
(dimension)
… are used to infer the position of a document’s author
… were mainly developed in studying the ideological positions that drive
party manifestos or political speeches (left-right dimension)
… are increasingly applied to other questions such as lobbying success
− Basic idea
Estimate text positions by focussing on language that discriminates most
strongly among the texts (i.e. give strong weight to terms that occur very
frequently in some texts but only very infrequently in others)
42

WZB
Prominent PolSci scaling approaches
 Unsupervised scaling: Wordfish (Slapin and Proksch 2008)
o Assumes that there is only exactly one dimension structuring the
text corpus!
o Algorithm weights term frequencies so that that there is a
maximum distance between the texts in the corpus
o Rare terms influence the results strongly
o Resulting positions can only be interpreted relative to each other
o Content of the scale has to be interpreted ex-post
 Supervised scaling: Wordscores (Laver, Benoit and Garry 2003)
o Researcher supplies reference texts with ‘known’ values
across the latent scale
o Algorithm retrieves and weights the relative term frequencies
in these texts
o Virgin texts are then positioned on the latent dimension along the
weights of the terms they contain
43

WZB
Source: The Monkey Cage / Benjamin Lauderdale
Wordfish example
44

WZB
Applying Wordfish
to our running example
 What differentiates national delegates in the United Nations
General Assembly according to the relative frequency of
words they use when speaking about climate change?
 Approach
 Apply the Wordfish algorithm (as implemented in quanteda) to the
corpus of 100-term window around climate change references
aggregated to country (! pre-processing!)
 Scrutinize term weights (‘betas’) and document positions (‘thetas’)
45

WZB
Applying Wordscores
to our running example
 In how far do speeches of national delegates in the UNGA use
language of climate sceptics or climate activists?
 Approach
o Corpus of 3000+ reference texts: scrape climate-change related news (!)
from websites of The Heartland Institute (climate change sceptics or
deniers; reference score: -1) and The Ecologist (climate activists; +1)
o Train a Wordscores model via quanteda on this corpus
and analyze the resulting term weights
o Scale UNGA speeches (pooled by country) along this model and see
whether we find something meaningful
49

WZB
Pitfalls of automated scaling
− Scaling works only:
o with documents that are very focussed on the theorized
dimension (cf. party manifestos vs. newspaper articles)
o if documents come from the same context in which the
language is used identically (political speeches vs. news outlets?)
 Scaling procedures make strong assumptions!
 Scaling procedures require particularly careful validation!
53

WZB
Part 5
Machine learning and topic models (briefly)
54

WZB
Supervised machine learning - intuition
 Basic idea of supervised classification
Algorithm ‘learns’ from (a few) human-coded documents before it
automatically classifies (many) ‘virgin’ texts
 Achieved along four (iterative) steps:
1. Construct a training and a test set from your documents
o Human coders apply a coding scheme to two subsets of docs
(-> session 2)
o Size depends on doc length, unique language, number of categories etc.
but usually a small fraction of the overall corpus is enough
2. ‘Learn’ classifier function from the training set
o Training documents used to find a statistical function that best predicts
the human-coded categories along the document-term frequencies
o Different algorithms come with different assumptions
o The RTextTools package implements different algorithms, e.g.
55

WZB
Supervised machine learning - intuition
3. Validate the classifier in the test set
o Use the classifier function from the training set to predict the
categories of documents in the test set
o Does you classifier live up to the ‘gold standard’ of human coding?
o If precision is insufficient, go back to step 1: Either your coding scheme
has to be re-worked, or the training set has to be expanded
4. Classify the ‘virgin’ texts
o If precision is satisfying, you can in a reliable and valid manner classify
all remaining documents of so-far unknown categories
 Validation part of the method!
 Required size of human-coded sets decreases with lesser
categories, more discriminatory language and longer documents
 ‘Representative’ samples of training and test documents needed
56

WZB
Unsupervised learning
 General idea ….
Algorithm ‘learns’ both categories and categorization from the
distribution of characteristics in the supplied data
 … applied to text analysis
o Which words tend to co-occur? Which clusters can be optimized?
How can documents be distributed over clusters in a statistically
optimal way?
o Researcher does not supply any theoretical categories a priori
(only abortion criteria, e.g. number of clusters, in some approaches)
o Results can only be interpreted ex post
 Validation
o Assessing semantic validity requires much contextual knowledge!
o Models not generalizable beyond the data and parameters supplied!
57

WZB
A prominent unsupervised approach:
Topic models (e.g. Blei 2012)
 Typical application
Identification and distribution of abstract ‘topics in large amounts of
‘documents’ without prior knowledge/assumptions on these topics
 Assumptions
o Topics are defined by frequency distributions of co-occurring terms
o Text were generated by firstly choosing topic composition (possibly
several per text) and only secondly by a respective choice of words
 Estimation
o Algorithm reverse-engineers this assumed text creation process by
asking: Which latent topic distribution would explain the observed
word frequency distribution best?
o Cluster-analysis on term level, probabilistic distribution of documents
58

WZB
Th logic of topic models presented by
their inventor (Blei 2012)
59

WZB
Typical output of topic models
(Blei 2012)
60

WZB
Pitfalls of topic models
 Caution!
o High interpretation demand after the analysis!
o Results are not very robust and strongly depend on the specific
data set and model parameters (seed and number of topics)
o What exactly is a topic (cf. “frame”, “issue”, “narrative”; “event” )?
 Useful tool to explore very large collections and
to narrow down more targeted samples
 Systematic analysis and especially comparisons across topics
only with greatest caution (robustness and model fit of topic
models is a current research frontier).
61

WZB
Outlook and conclusions
62

WZB
What we could not speak about…
 Wort vector models
Representation of words in high-dimensional spaces allows
analysing proximity of concepts and evolution of narratives
over time …
 Text similarity / plagiarism measures …
Analysing changes of word order e.g. highly useful to study
consecutive drafts of policies, treaties, etc (e.g. Rauh, 2018) …
 Part-of-speech tagging and grammatical parsing
Retaining grammatical structure (contrast to bags of words)
allows more targeted study of subject-object relations (e.g.
predicting conflict intensity from newswires, Schrodt 2011)…
63

WZB
Promises and pitfalls of
automated content analyses
+ A more complete and reliable analysis of social phenomena
o Analysis of very large document sets achievable at low cost
o Reduced / removed sampling bias
+/- Human resources remain significant
o Dictionary development, coding of reference texts and
especially validation requires intense human engagement
- Context dependency more pronounced
o Quantitative representations of language cannot abstract from
varying contexts (human coders can)
- Reliability is partially traded against validity
o Power of automated analyses declines quickly with the
complexity of theoretical concepts
64

WZB
Conclusions
 Automated text analyses are a powerful, yet not a definitive
tool for content analysis in the Political and Social Sciences
 The computer allows us to digest larger amounts of
information, uncovers patterns on much more aggregated
levels, but interpretation, contextualisation, and validation
remain key responsibility of the researcher!
65
Thank you for your attention!
Slides and tutorials available at www.christian-rauh.eu/teaching

Introduction to automated text analyses in the Political Sciences

More Related Content

What's hot (12)

Similar to Introduction to automated text analyses in the Political Sciences (20)

Recently uploaded (20)

Introduction to automated text analyses in the Political Sciences