SlideShare a Scribd company logo
Introduction to
automated text analyses
in the political sciences
Atelier documents, archives, discours 2018-2019
Université de Lausanne
20.11.2018
1 pm – 5 pm
DR. CHRISTIAN RAUH
www.christian-rauh.eu
WZB
christian-rauh.eu
Our plan for today
 Promises and pitfalls of automated text analysis
(or: automated content analysis, text mining, corpus analytics, ...)
 Your expectations?
 Teaching goal: Enable informed decisions on whether and which
automated content analysis methods are suitable for your research
 Discuss the intuition and pragmatic challenges of the most
common political science text analysis methods,
o Corpus construction and discovery
o Dictionary-based analyses
o Text scaling procedures
o Briefly: topic models, natural language processing, machine learning
 Running example and tutorials implemented in R
Climate change in United Nations General Assembly speeches
 Please bear your research questions in mind, apply the discussed
ideas to them, and interrupt me whenever something is unclear!
2
WZB
Part 1
Qualitative, quantitative, and automated
3120 most frequent words in Krippendorf 2004: chps. 1-2
Size ~ Relative term frequency
WZB
Content analysis is ...
− the analysis of (text) documents
 Politics usually happens through written or spoken text
 Which documents matter for your research question?
 Do you cover all of them or only a sample?
− the analysis of messages
 Sender → Text → Recipient
 What is your object of inference?
− always context-dependent!
 All texts are produced for a purpose
 How does this purpose relate to your inferences?
 Which assumptions do you apply when interpreting the texts?
4
WZB
Content analysis in between ...
−Positivist & interpretative approaches
to scientific inquiry
−Qualitative & quantitative approaches
to social science measurement
5
WZB
Content analysis as a methodology
− Content analysts vs. newspaper readers
− Reliability, replicability, validity
 Specify assumptions and benchmarks you apply to texts
 Detail interpretation / coding / categorization schemes
 If possible: Validate with external data / information
− Unobstrusive / non-reactive measurement
6
WZB
Content analysis:
A working definition
“Content analysis is a research technique for
making replicable and valid inferences from
text (or other meaningful matter) to the
contexts of their use.”
7
Source: Krippendorf 2004: p. 18
WZB
Why automate?
 Political texts increasingly available in digitized formats
 The challenge: Volume!
o Risk of sampling bias
o Human coding is time- and resource intensive
− The promise: Automated analyses retrieve theoretically relevant
concepts from complete text corpora at comparatively low cost
− Automated text analyses...
... rely on quantitative representations of source texts
... often apply statistical models based on assumptions about text generation
... are extremely reliable, but have to be validated
... cannot replace careful and close reading of source texts!
8
WZB
Four principles of automated text analysis
(Grimmer and Stewart 2013)
1. All quantitative models of language are wrong – but some are
useful (sometimes)
2. Automated text analyses augment and amplify human
interpretation but do not replace it
3. There is no globally best method for automated text analysis
4. Validate, validate, validate!
 Applicability of an automated content analysis can only be
judged against your particular research question and theory!
9
WZB
Part 2
Corpus construction and discovery
10
WZB
Acquiring documents
 Which kinds of texts are suitable for automated content analysis?
o Unit of analysis is usually on document level (other units possible)
o Focussed documents preferable (depending on your theoretical concepts)
o Sufficient number of words required (depending on the applied method)
 Typical text sources
o Existing corpora: Other social science projects or linguistic resources
o Online databases: e.g. LexisNexis, Factiva, Gale Cengage (newspapers and
press agencies); governments, parliaments, international orgs; etc...
o Web scraping: Press releases, news sites, blogs, Twitter, etc.
(see Munzert et al., 2015, Wiley)
o Scan / OCR of printed matter
 Store documents with consistent formats and document names
o Plain txt files work best (converters freely available)
o UTF-8 encoding standard for Latin alphabet
11
WZB
Pre-processing: Turning text into data
(Typical, but not universally applicable steps!)
 Remove...
... document “boilerplate” (info not part of the analysed message)
... punctuation, capitalization, numbers
... very common and very uncommon terms (“stop words”, <1% of docs)
 Lemmatization / Stemming
o Words referring to the same concept mapped to a single root
o {economy, economic , economically} → economi
 Turn documents into “bags of words”
o Discards the order in which words occur!
o Unigrams, bigrams ... n-grams
 Document frequency matrix
12
WZB
Source: python-course.eu
“Bags of words”
(illustration w/out stemming)
13
WZB
Source: http://web.eecs.utk.edu/~mberry/order/node4.html
Document frequency matrix (illustration)
14
WZB
The potential of discovery
 Even if you do not want to apply statistical analyses to your
corpus, a look at the aggregated term frequencies may:
o ... show unknown temporal patterns
o ... provide contextual information for specific concepts
(co-locations, keyword-in-context, synonyms, ...)
o ... guide selection of individual texts for further
human interpretation and coding
o ... give you an aggregated perspective on the discourse
helping to contextualize individual documents therein
15
WZB
Introducing our running example
 Speeches in the United Nations General Assembly
Based on the UNGD corpus assembled by Baturo, Dasandi, Mikhaylov (2017, R&P)
 What is the context of these documents we need to have in
mind along the sender-message-recipient framework?
o Who speaks when? With what purpose?
o The examples will (try to) make inferences about the ‘senders’,
assuming that speeches reflect state positions along the words used
 Climate change as the political issue of interest
o Identified by speeches referring literally to
‘climate(-| )change’ or ‘global(-| )warming’
o 100-term window around these references to see how and what
national delegates say about or associate with climate change
o For a ‘real’ analysis, more fine-tuning will most likely be needed,
bear with me...
16
WZB
17
WZB
18
WZB
19
WZB
20
WZB
21
WZB
22
Island states
All others
WZB
Part 3
Dictionary-based text analysis
23
WZB
Basic idea of
dictionary-based text analyses
 Presence of or rate at which a set of predefined key words
occurs in a document is used to classify or scale
the document into/along theoretically relevant categories
+ Intuitive and easy to apply
+ Replicable and expandable to various theoretical concepts
24
WZB
Example I: Debating the EU in national
parliaments (Rauh & De Wilde 2018, EJPR)
 Question: Do national parliamentary debates enhance the
public accountability of EU decision-making?
(If so, they should mirror EU authority, decision-making, public demand, and
feature a balance of government and opposition...but party strategies…)
 Text analysis approach
o Build a full-text corpus of parliamentary debates in various EU
member states (get it here: www.bit.ly/ParlSpeech)
o Find typical ways of referencing the EU polity, politics, and policies
on n-gram level (= reading lots of speeches!)
o Generalize these examples by regular expressions
and build a respective dictionary
o Count, normalize and aggregate EU references to party-month level
o Relate this to relevant external data
25
WZB
Rauh & De Wilde (2018): Text data
26
christian-rauh.eu
Rauh & De Wilde (2018): Dictionary
(English version, other languages more complex)
27
christian-rauh.eu
Rauh & De Wilde (2018):
Descriptive results
28
christian-rauh.eu
Rauh & De Wilde (2018): Multivariate results
29
WZB
Example II: NGOs in the public discourse
on the WTO (Rauh & Bödeker 2014 mimeo)
 Question: Which (type of) non-governmental actors participate
in the public discourse on the World Trade Organization?
 Approach
o Financial Times (UK), Straits Times (S‘pore), New York Times (US)
o Download, parse and clean all 11.388 articles from LexisNexis that
mention the WTO in headline or lead (1985-2012)
o Generate encompassing list of all transnational NGOs
(Sources: WTO stakeholder directory and UN ECOSOC database)
o Tag and count the occurrence of each of these NGOs in the articles
o Manually classify tagged NGOs as ‘business’ or ‘public interest’
o Aggregate and analyse the data
30
WZB
Example II (cont.)
31
WZB
Example II (cont.)
32
WZB
Example II (cont.)
33
WZB
Typical dictionary application:
Sentiment analysis
− Typical application: Sentiment analyses
o Do the analysed message convey information positively or
negatively (tone)?
o Sentiment dictionary: List of terms with individual tone
scores; usually ranging between -1 (negative) and 1 (positive)
o Sentiment at document level: rate at which positively or
negatively connoted words occur (often: relative to the overall
number of terms in document)
34
WZB
Exemplary sentiment dictionary
(Young and Soroka 2011)
35
Positive connotation Negative connotation
ALLEVIAT* ABSURD*
BENEFIT* BELLIGEREN*
COOPERAT* CONFRONT*
DESERV* CONTAGIOUS*
EXCITE* FALTER*
FAIR* HELPLESS*
OUTSTAND* IDEOLOGUE*
PERFECT* LOSE*
RESOLV* NEGLECT*
USEFUL* SCANDAL*
Exemplary terms (stemmed) from the
Lexicoder Sentiment dictionary
All in all, the LSD has 4,567 unique entries
WZB
A sentiment analysis
applied to our running example
 How positively or negatively do national delegates in the
United Nations General assembly speak about climate change?
 And: Does this meaningfully capture expressed political
positions on climate change issues?
 Approach
 Apply the Lexicoder sentiment dictionary with the respective
functions in the quanteda R package to the corpus created above
 Normalized sentiment score in 100-term window around climate
change references
36
WZB
37
WZB
38
WZB
39
WZB
Pitfalls of dictionary approaches
− Validity of the derived measures not granted
o Can the theoretical concepts/objects of interest be captured
at term/document level?
o Do term level scores closely align with the typical word usage in
the analysed context?
 Use ‘off-the-shelf ‘ term lists developed in other contexts
only with extreme caution
 Ideally: Develop your own dictionaries tailored to your
research question
 Apply/calculate context-specific baselines
 In any case: Validate your results!
(e.g. against human coders or external data related to your concepts)
 Validated sentiment dictionary for English political language: Young and Soroka (2001, PC)
 Validated sentiment dictionary for German political language: Rauh (2018, JTIP)
40
WZB
Part 4
Text scaling
41
WZB
Automated scaling of texts
− Scaling techniques …
… automatically distribute documents across a latent (underlying) scale
(dimension)
… are used to infer the position of a document’s author
… were mainly developed in studying the ideological positions that drive
party manifestos or political speeches (left-right dimension)
… are increasingly applied to other questions such as lobbying success
− Basic idea
Estimate text positions by focussing on language that discriminates most
strongly among the texts (i.e. give strong weight to terms that occur very
frequently in some texts but only very infrequently in others)
42
WZB
Prominent PolSci scaling approaches
 Unsupervised scaling: Wordfish (Slapin and Proksch 2008)
o Assumes that there is only exactly one dimension structuring the
text corpus!
o Algorithm weights term frequencies so that that there is a
maximum distance between the texts in the corpus
o Rare terms influence the results strongly
o Resulting positions can only be interpreted relative to each other
o Content of the scale has to be interpreted ex-post
 Supervised scaling: Wordscores (Laver, Benoit and Garry 2003)
o Researcher supplies reference texts with ‘known’ values
across the latent scale
o Algorithm retrieves and weights the relative term frequencies
in these texts
o Virgin texts are then positioned on the latent dimension along the
weights of the terms they contain
43
WZB
Source: The Monkey Cage / Benjamin Lauderdale
Wordfish example
44
WZB
Applying Wordfish
to our running example
 What differentiates national delegates in the United Nations
General Assembly according to the relative frequency of
words they use when speaking about climate change?
 And: Does this meaningfully capture expressed political
positions on climate change issues?
 Approach
 Apply the Wordfish algorithm (as implemented in quanteda) to the
corpus of 100-term window around climate change references
aggregated to country (! pre-processing!)
 Scrutinize term weights (‘betas’) and document positions (‘thetas’)
45
WZB
46
WZB
47
WZB
48
WZB
Applying Wordscores
to our running example
 In how far do speeches of national delegates in the UNGA use
language of climate sceptics or climate activists?
 And: Does this meaningfully capture expressed political
positions on climate change issues?
 Approach
o Corpus of 3000+ reference texts: scrape climate-change related news (!)
from websites of The Heartland Institute (climate change sceptics or
deniers; reference score: -1) and The Ecologist (climate activists; +1)
o Train a Wordscores model via quanteda on this corpus
and analyze the resulting term weights
o Scale UNGA speeches (pooled by country) along this model and see
whether we find something meaningful
49
WZB
50
WZB
51
WZB
52
WZB
Pitfalls of automated scaling
− Scaling works only:
o with documents that are very focussed on the theorized
dimension (cf. party manifestos vs. newspaper articles)
o if documents come from the same context in which the
language is used identically (political speeches vs. news outlets?)
 Scaling procedures make strong assumptions!
 Scaling procedures require particularly careful validation!
53
WZB
Part 5
Machine learning and topic models (briefly)
54
WZB
Supervised machine learning - intuition
 Basic idea of supervised classification
Algorithm ‘learns’ from (a few) human-coded documents before it
automatically classifies (many) ‘virgin’ texts
 Achieved along four (iterative) steps:
1. Construct a training and a test set from your documents
o Human coders apply a coding scheme to two subsets of docs
(-> session 2)
o Size depends on doc length, unique language, number of categories etc.
but usually a small fraction of the overall corpus is enough
2. ‘Learn’ classifier function from the training set
o Training documents used to find a statistical function that best predicts
the human-coded categories along the document-term frequencies
o Different algorithms come with different assumptions
o The RTextTools package implements different algorithms, e.g.
55
WZB
Supervised machine learning - intuition
3. Validate the classifier in the test set
o Use the classifier function from the training set to predict the
categories of documents in the test set
o Does you classifier live up to the ‘gold standard’ of human coding?
o If precision is insufficient, go back to step 1: Either your coding scheme
has to be re-worked, or the training set has to be expanded
4. Classify the ‘virgin’ texts
o If precision is satisfying, you can in a reliable and valid manner classify
all remaining documents of so-far unknown categories
 Validation part of the method!
 Required size of human-coded sets decreases with lesser
categories, more discriminatory language and longer documents
 ‘Representative’ samples of training and test documents needed
56
WZB
Unsupervised learning
 General idea ….
Algorithm ‘learns’ both categories and categorization from the
distribution of characteristics in the supplied data
 … applied to text analysis
o Which words tend to co-occur? Which clusters can be optimized?
How can documents be distributed over clusters in a statistically
optimal way?
o Researcher does not supply any theoretical categories a priori
(only abortion criteria, e.g. number of clusters, in some approaches)
o Results can only be interpreted ex post
 Validation
o Assessing semantic validity requires much contextual knowledge!
o Models not generalizable beyond the data and parameters supplied!
57
WZB
A prominent unsupervised approach:
Topic models (e.g. Blei 2012)
 Typical application
Identification and distribution of abstract ‘topics in large amounts of
‘documents’ without prior knowledge/assumptions on these topics
 Assumptions
o Topics are defined by frequency distributions of co-occurring terms
o Text were generated by firstly choosing topic composition (possibly
several per text) and only secondly by a respective choice of words
 Estimation
o Algorithm reverse-engineers this assumed text creation process by
asking: Which latent topic distribution would explain the observed
word frequency distribution best?
o Cluster-analysis on term level, probabilistic distribution of documents
58
WZB
Th logic of topic models presented by
their inventor (Blei 2012)
59
WZB
Typical output of topic models
(Blei 2012)
60
WZB
Pitfalls of topic models
 Caution!
o High interpretation demand after the analysis!
o Results are not very robust and strongly depend on the specific
data set and model parameters (seed and number of topics)
o What exactly is a topic (cf. “frame”, “issue”, “narrative”; “event” )?
 Useful tool to explore very large collections and
to narrow down more targeted samples
 Systematic analysis and especially comparisons across topics
only with greatest caution (robustness and model fit of topic
models is a current research frontier).
61
WZB
Outlook and conclusions
62
WZB
What we could not speak about…
 Wort vector models
Representation of words in high-dimensional spaces allows
analysing proximity of concepts and evolution of narratives
over time …
 Text similarity / plagiarism measures …
Analysing changes of word order e.g. highly useful to study
consecutive drafts of policies, treaties, etc (e.g. Rauh, 2018) …
 Part-of-speech tagging and grammatical parsing
Retaining grammatical structure (contrast to bags of words)
allows more targeted study of subject-object relations (e.g.
predicting conflict intensity from newswires, Schrodt 2011)…
63
WZB
Promises and pitfalls of
automated content analyses
+ A more complete and reliable analysis of social phenomena
o Analysis of very large document sets achievable at low cost
o Reduced / removed sampling bias
+/- Human resources remain significant
o Dictionary development, coding of reference texts and
especially validation requires intense human engagement
- Context dependency more pronounced
o Quantitative representations of language cannot abstract from
varying contexts (human coders can)
- Reliability is partially traded against validity
o Power of automated analyses declines quickly with the
complexity of theoretical concepts
64
WZB
Conclusions
 Automated text analyses are a powerful, yet not a definitive
tool for content analysis in the Political and Social Sciences
 The computer allows us to digest larger amounts of
information, uncovers patterns on much more aggregated
levels, but interpretation, contextualisation, and validation
remain key responsibility of the researcher!
65
Thank you for your attention!
Slides and tutorials available at www.christian-rauh.eu/teaching

More Related Content

PDF
14. Michael Oakes (UoW) Natural Language Processing for Translation
PDF
17. Anne Schuman (USAAR) Terminology and Ontologies 2
PPTX
3. introduction to text mining
PDF
16. Anne Schumann (USAAR) Terminology and Ontologies 1
POT
Processing Parallel Text Corpora for Three South African Language Pairs in th...
PDF
Lecture 2: Computational Semantics
PDF
Open learning- Text analysis basics
PPTX
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval
14. Michael Oakes (UoW) Natural Language Processing for Translation
17. Anne Schuman (USAAR) Terminology and Ontologies 2
3. introduction to text mining
16. Anne Schumann (USAAR) Terminology and Ontologies 1
Processing Parallel Text Corpora for Three South African Language Pairs in th...
Lecture 2: Computational Semantics
Open learning- Text analysis basics
Keystone Summer School 2015: Mauro Dragoni, Ontologies For Information Retrieval

What's hot (12)

PPTX
Detecting and Describing Historical Periods in a Large Corpora
PDF
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
PDF
7 probability and statistics an introduction
PPTX
Text data mining1
PPT
Dimensions of Media Object Comprehensibility
PPTX
Topic modeling using big data analytics
PDF
OUTDATED Text Mining 5/5: Information Extraction
PDF
Ontology learning
ODP
Corpora, Blogs and Linguistic Variation (Paderborn)
PPTX
Introduction to Text Mining and Topic Modelling
PPT
Week12
PDF
Semantics and Computational Semantics
Detecting and Describing Historical Periods in a Large Corpora
Language Combinatorics: A Sentence Pattern Extraction Architecture Based on C...
7 probability and statistics an introduction
Text data mining1
Dimensions of Media Object Comprehensibility
Topic modeling using big data analytics
OUTDATED Text Mining 5/5: Information Extraction
Ontology learning
Corpora, Blogs and Linguistic Variation (Paderborn)
Introduction to Text Mining and Topic Modelling
Week12
Semantics and Computational Semantics
Ad

Similar to Introduction to automated text analyses in the Political Sciences (20)

PDF
TOPIC BASED ANALYSIS OF TEXT CORPORA
PDF
Topic models, vector semantics and applications
PDF
Data triangulation on newspapers articles using different softwarei
PDF
content analysis and discourse analysis
PDF
Semantic engagement handouts
PDF
Standards, technology and europe
PDF
Standards, terminology and Europe
PPT
Lri Owl And Ontologies 04 04
PPTX
Content analysis
PPTX
Content analysis
PDF
Argument Structures Of Political Debates
PDF
Bauman and Miller_creating-framework-global-refugee-policy-2012
DOCX
Wicked Problem Urban Street Planning in London.docx
PPTX
Document similarity
PPTX
Teaching How to Use Discourse Analysis.pptx
PDF
Science communication workshop at PBE2021
PPTX
Introduction to Nvivo
PDF
Bornmann, L., Leydesdorff, L. & Krampen, G. (2012). Which are the »best« citi...
PPT
NLP Introduction.ppt machine learning presentation
PPTX
Corpus linguistics, ch6
TOPIC BASED ANALYSIS OF TEXT CORPORA
Topic models, vector semantics and applications
Data triangulation on newspapers articles using different softwarei
content analysis and discourse analysis
Semantic engagement handouts
Standards, technology and europe
Standards, terminology and Europe
Lri Owl And Ontologies 04 04
Content analysis
Content analysis
Argument Structures Of Political Debates
Bauman and Miller_creating-framework-global-refugee-policy-2012
Wicked Problem Urban Street Planning in London.docx
Document similarity
Teaching How to Use Discourse Analysis.pptx
Science communication workshop at PBE2021
Introduction to Nvivo
Bornmann, L., Leydesdorff, L. & Krampen, G. (2012). Which are the »best« citi...
NLP Introduction.ppt machine learning presentation
Corpus linguistics, ch6
Ad

Recently uploaded (20)

PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
2. Earth - The Living Planet earth and life
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PDF
An interstellar mission to test astrophysical black holes
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
Sciences of Europe No 170 (2025)
PPTX
Microbiology with diagram medical studies .pptx
Phytochemical Investigation of Miliusa longipes.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
. Radiology Case Scenariosssssssssssssss
neck nodes and dissection types and lymph nodes levels
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Introduction to Fisheries Biotechnology_Lesson 1.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
2. Earth - The Living Planet earth and life
HPLC-PPT.docx high performance liquid chromatography
Classification Systems_TAXONOMY_SCIENCE8.pptx
POSITIONING IN OPERATION THEATRE ROOM.ppt
lecture 2026 of Sjogren's syndrome l .pdf
2. Earth - The Living Planet Module 2ELS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
An interstellar mission to test astrophysical black holes
7. General Toxicologyfor clinical phrmacy.pptx
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
Sciences of Europe No 170 (2025)
Microbiology with diagram medical studies .pptx

Introduction to automated text analyses in the Political Sciences

  • 1. Introduction to automated text analyses in the political sciences Atelier documents, archives, discours 2018-2019 Université de Lausanne 20.11.2018 1 pm – 5 pm DR. CHRISTIAN RAUH www.christian-rauh.eu
  • 2. WZB christian-rauh.eu Our plan for today  Promises and pitfalls of automated text analysis (or: automated content analysis, text mining, corpus analytics, ...)  Your expectations?  Teaching goal: Enable informed decisions on whether and which automated content analysis methods are suitable for your research  Discuss the intuition and pragmatic challenges of the most common political science text analysis methods, o Corpus construction and discovery o Dictionary-based analyses o Text scaling procedures o Briefly: topic models, natural language processing, machine learning  Running example and tutorials implemented in R Climate change in United Nations General Assembly speeches  Please bear your research questions in mind, apply the discussed ideas to them, and interrupt me whenever something is unclear! 2
  • 3. WZB Part 1 Qualitative, quantitative, and automated 3120 most frequent words in Krippendorf 2004: chps. 1-2 Size ~ Relative term frequency
  • 4. WZB Content analysis is ... − the analysis of (text) documents  Politics usually happens through written or spoken text  Which documents matter for your research question?  Do you cover all of them or only a sample? − the analysis of messages  Sender → Text → Recipient  What is your object of inference? − always context-dependent!  All texts are produced for a purpose  How does this purpose relate to your inferences?  Which assumptions do you apply when interpreting the texts? 4
  • 5. WZB Content analysis in between ... −Positivist & interpretative approaches to scientific inquiry −Qualitative & quantitative approaches to social science measurement 5
  • 6. WZB Content analysis as a methodology − Content analysts vs. newspaper readers − Reliability, replicability, validity  Specify assumptions and benchmarks you apply to texts  Detail interpretation / coding / categorization schemes  If possible: Validate with external data / information − Unobstrusive / non-reactive measurement 6
  • 7. WZB Content analysis: A working definition “Content analysis is a research technique for making replicable and valid inferences from text (or other meaningful matter) to the contexts of their use.” 7 Source: Krippendorf 2004: p. 18
  • 8. WZB Why automate?  Political texts increasingly available in digitized formats  The challenge: Volume! o Risk of sampling bias o Human coding is time- and resource intensive − The promise: Automated analyses retrieve theoretically relevant concepts from complete text corpora at comparatively low cost − Automated text analyses... ... rely on quantitative representations of source texts ... often apply statistical models based on assumptions about text generation ... are extremely reliable, but have to be validated ... cannot replace careful and close reading of source texts! 8
  • 9. WZB Four principles of automated text analysis (Grimmer and Stewart 2013) 1. All quantitative models of language are wrong – but some are useful (sometimes) 2. Automated text analyses augment and amplify human interpretation but do not replace it 3. There is no globally best method for automated text analysis 4. Validate, validate, validate!  Applicability of an automated content analysis can only be judged against your particular research question and theory! 9
  • 10. WZB Part 2 Corpus construction and discovery 10
  • 11. WZB Acquiring documents  Which kinds of texts are suitable for automated content analysis? o Unit of analysis is usually on document level (other units possible) o Focussed documents preferable (depending on your theoretical concepts) o Sufficient number of words required (depending on the applied method)  Typical text sources o Existing corpora: Other social science projects or linguistic resources o Online databases: e.g. LexisNexis, Factiva, Gale Cengage (newspapers and press agencies); governments, parliaments, international orgs; etc... o Web scraping: Press releases, news sites, blogs, Twitter, etc. (see Munzert et al., 2015, Wiley) o Scan / OCR of printed matter  Store documents with consistent formats and document names o Plain txt files work best (converters freely available) o UTF-8 encoding standard for Latin alphabet 11
  • 12. WZB Pre-processing: Turning text into data (Typical, but not universally applicable steps!)  Remove... ... document “boilerplate” (info not part of the analysed message) ... punctuation, capitalization, numbers ... very common and very uncommon terms (“stop words”, <1% of docs)  Lemmatization / Stemming o Words referring to the same concept mapped to a single root o {economy, economic , economically} → economi  Turn documents into “bags of words” o Discards the order in which words occur! o Unigrams, bigrams ... n-grams  Document frequency matrix 12
  • 13. WZB Source: python-course.eu “Bags of words” (illustration w/out stemming) 13
  • 15. WZB The potential of discovery  Even if you do not want to apply statistical analyses to your corpus, a look at the aggregated term frequencies may: o ... show unknown temporal patterns o ... provide contextual information for specific concepts (co-locations, keyword-in-context, synonyms, ...) o ... guide selection of individual texts for further human interpretation and coding o ... give you an aggregated perspective on the discourse helping to contextualize individual documents therein 15
  • 16. WZB Introducing our running example  Speeches in the United Nations General Assembly Based on the UNGD corpus assembled by Baturo, Dasandi, Mikhaylov (2017, R&P)  What is the context of these documents we need to have in mind along the sender-message-recipient framework? o Who speaks when? With what purpose? o The examples will (try to) make inferences about the ‘senders’, assuming that speeches reflect state positions along the words used  Climate change as the political issue of interest o Identified by speeches referring literally to ‘climate(-| )change’ or ‘global(-| )warming’ o 100-term window around these references to see how and what national delegates say about or associate with climate change o For a ‘real’ analysis, more fine-tuning will most likely be needed, bear with me... 16
  • 24. WZB Basic idea of dictionary-based text analyses  Presence of or rate at which a set of predefined key words occurs in a document is used to classify or scale the document into/along theoretically relevant categories + Intuitive and easy to apply + Replicable and expandable to various theoretical concepts 24
  • 25. WZB Example I: Debating the EU in national parliaments (Rauh & De Wilde 2018, EJPR)  Question: Do national parliamentary debates enhance the public accountability of EU decision-making? (If so, they should mirror EU authority, decision-making, public demand, and feature a balance of government and opposition...but party strategies…)  Text analysis approach o Build a full-text corpus of parliamentary debates in various EU member states (get it here: www.bit.ly/ParlSpeech) o Find typical ways of referencing the EU polity, politics, and policies on n-gram level (= reading lots of speeches!) o Generalize these examples by regular expressions and build a respective dictionary o Count, normalize and aggregate EU references to party-month level o Relate this to relevant external data 25
  • 26. WZB Rauh & De Wilde (2018): Text data 26
  • 27. christian-rauh.eu Rauh & De Wilde (2018): Dictionary (English version, other languages more complex) 27
  • 28. christian-rauh.eu Rauh & De Wilde (2018): Descriptive results 28
  • 29. christian-rauh.eu Rauh & De Wilde (2018): Multivariate results 29
  • 30. WZB Example II: NGOs in the public discourse on the WTO (Rauh & Bödeker 2014 mimeo)  Question: Which (type of) non-governmental actors participate in the public discourse on the World Trade Organization?  Approach o Financial Times (UK), Straits Times (S‘pore), New York Times (US) o Download, parse and clean all 11.388 articles from LexisNexis that mention the WTO in headline or lead (1985-2012) o Generate encompassing list of all transnational NGOs (Sources: WTO stakeholder directory and UN ECOSOC database) o Tag and count the occurrence of each of these NGOs in the articles o Manually classify tagged NGOs as ‘business’ or ‘public interest’ o Aggregate and analyse the data 30
  • 34. WZB Typical dictionary application: Sentiment analysis − Typical application: Sentiment analyses o Do the analysed message convey information positively or negatively (tone)? o Sentiment dictionary: List of terms with individual tone scores; usually ranging between -1 (negative) and 1 (positive) o Sentiment at document level: rate at which positively or negatively connoted words occur (often: relative to the overall number of terms in document) 34
  • 35. WZB Exemplary sentiment dictionary (Young and Soroka 2011) 35 Positive connotation Negative connotation ALLEVIAT* ABSURD* BENEFIT* BELLIGEREN* COOPERAT* CONFRONT* DESERV* CONTAGIOUS* EXCITE* FALTER* FAIR* HELPLESS* OUTSTAND* IDEOLOGUE* PERFECT* LOSE* RESOLV* NEGLECT* USEFUL* SCANDAL* Exemplary terms (stemmed) from the Lexicoder Sentiment dictionary All in all, the LSD has 4,567 unique entries
  • 36. WZB A sentiment analysis applied to our running example  How positively or negatively do national delegates in the United Nations General assembly speak about climate change?  And: Does this meaningfully capture expressed political positions on climate change issues?  Approach  Apply the Lexicoder sentiment dictionary with the respective functions in the quanteda R package to the corpus created above  Normalized sentiment score in 100-term window around climate change references 36
  • 40. WZB Pitfalls of dictionary approaches − Validity of the derived measures not granted o Can the theoretical concepts/objects of interest be captured at term/document level? o Do term level scores closely align with the typical word usage in the analysed context?  Use ‘off-the-shelf ‘ term lists developed in other contexts only with extreme caution  Ideally: Develop your own dictionaries tailored to your research question  Apply/calculate context-specific baselines  In any case: Validate your results! (e.g. against human coders or external data related to your concepts)  Validated sentiment dictionary for English political language: Young and Soroka (2001, PC)  Validated sentiment dictionary for German political language: Rauh (2018, JTIP) 40
  • 42. WZB Automated scaling of texts − Scaling techniques … … automatically distribute documents across a latent (underlying) scale (dimension) … are used to infer the position of a document’s author … were mainly developed in studying the ideological positions that drive party manifestos or political speeches (left-right dimension) … are increasingly applied to other questions such as lobbying success − Basic idea Estimate text positions by focussing on language that discriminates most strongly among the texts (i.e. give strong weight to terms that occur very frequently in some texts but only very infrequently in others) 42
  • 43. WZB Prominent PolSci scaling approaches  Unsupervised scaling: Wordfish (Slapin and Proksch 2008) o Assumes that there is only exactly one dimension structuring the text corpus! o Algorithm weights term frequencies so that that there is a maximum distance between the texts in the corpus o Rare terms influence the results strongly o Resulting positions can only be interpreted relative to each other o Content of the scale has to be interpreted ex-post  Supervised scaling: Wordscores (Laver, Benoit and Garry 2003) o Researcher supplies reference texts with ‘known’ values across the latent scale o Algorithm retrieves and weights the relative term frequencies in these texts o Virgin texts are then positioned on the latent dimension along the weights of the terms they contain 43
  • 44. WZB Source: The Monkey Cage / Benjamin Lauderdale Wordfish example 44
  • 45. WZB Applying Wordfish to our running example  What differentiates national delegates in the United Nations General Assembly according to the relative frequency of words they use when speaking about climate change?  And: Does this meaningfully capture expressed political positions on climate change issues?  Approach  Apply the Wordfish algorithm (as implemented in quanteda) to the corpus of 100-term window around climate change references aggregated to country (! pre-processing!)  Scrutinize term weights (‘betas’) and document positions (‘thetas’) 45
  • 49. WZB Applying Wordscores to our running example  In how far do speeches of national delegates in the UNGA use language of climate sceptics or climate activists?  And: Does this meaningfully capture expressed political positions on climate change issues?  Approach o Corpus of 3000+ reference texts: scrape climate-change related news (!) from websites of The Heartland Institute (climate change sceptics or deniers; reference score: -1) and The Ecologist (climate activists; +1) o Train a Wordscores model via quanteda on this corpus and analyze the resulting term weights o Scale UNGA speeches (pooled by country) along this model and see whether we find something meaningful 49
  • 53. WZB Pitfalls of automated scaling − Scaling works only: o with documents that are very focussed on the theorized dimension (cf. party manifestos vs. newspaper articles) o if documents come from the same context in which the language is used identically (political speeches vs. news outlets?)  Scaling procedures make strong assumptions!  Scaling procedures require particularly careful validation! 53
  • 54. WZB Part 5 Machine learning and topic models (briefly) 54
  • 55. WZB Supervised machine learning - intuition  Basic idea of supervised classification Algorithm ‘learns’ from (a few) human-coded documents before it automatically classifies (many) ‘virgin’ texts  Achieved along four (iterative) steps: 1. Construct a training and a test set from your documents o Human coders apply a coding scheme to two subsets of docs (-> session 2) o Size depends on doc length, unique language, number of categories etc. but usually a small fraction of the overall corpus is enough 2. ‘Learn’ classifier function from the training set o Training documents used to find a statistical function that best predicts the human-coded categories along the document-term frequencies o Different algorithms come with different assumptions o The RTextTools package implements different algorithms, e.g. 55
  • 56. WZB Supervised machine learning - intuition 3. Validate the classifier in the test set o Use the classifier function from the training set to predict the categories of documents in the test set o Does you classifier live up to the ‘gold standard’ of human coding? o If precision is insufficient, go back to step 1: Either your coding scheme has to be re-worked, or the training set has to be expanded 4. Classify the ‘virgin’ texts o If precision is satisfying, you can in a reliable and valid manner classify all remaining documents of so-far unknown categories  Validation part of the method!  Required size of human-coded sets decreases with lesser categories, more discriminatory language and longer documents  ‘Representative’ samples of training and test documents needed 56
  • 57. WZB Unsupervised learning  General idea …. Algorithm ‘learns’ both categories and categorization from the distribution of characteristics in the supplied data  … applied to text analysis o Which words tend to co-occur? Which clusters can be optimized? How can documents be distributed over clusters in a statistically optimal way? o Researcher does not supply any theoretical categories a priori (only abortion criteria, e.g. number of clusters, in some approaches) o Results can only be interpreted ex post  Validation o Assessing semantic validity requires much contextual knowledge! o Models not generalizable beyond the data and parameters supplied! 57
  • 58. WZB A prominent unsupervised approach: Topic models (e.g. Blei 2012)  Typical application Identification and distribution of abstract ‘topics in large amounts of ‘documents’ without prior knowledge/assumptions on these topics  Assumptions o Topics are defined by frequency distributions of co-occurring terms o Text were generated by firstly choosing topic composition (possibly several per text) and only secondly by a respective choice of words  Estimation o Algorithm reverse-engineers this assumed text creation process by asking: Which latent topic distribution would explain the observed word frequency distribution best? o Cluster-analysis on term level, probabilistic distribution of documents 58
  • 59. WZB Th logic of topic models presented by their inventor (Blei 2012) 59
  • 60. WZB Typical output of topic models (Blei 2012) 60
  • 61. WZB Pitfalls of topic models  Caution! o High interpretation demand after the analysis! o Results are not very robust and strongly depend on the specific data set and model parameters (seed and number of topics) o What exactly is a topic (cf. “frame”, “issue”, “narrative”; “event” )?  Useful tool to explore very large collections and to narrow down more targeted samples  Systematic analysis and especially comparisons across topics only with greatest caution (robustness and model fit of topic models is a current research frontier). 61
  • 63. WZB What we could not speak about…  Wort vector models Representation of words in high-dimensional spaces allows analysing proximity of concepts and evolution of narratives over time …  Text similarity / plagiarism measures … Analysing changes of word order e.g. highly useful to study consecutive drafts of policies, treaties, etc (e.g. Rauh, 2018) …  Part-of-speech tagging and grammatical parsing Retaining grammatical structure (contrast to bags of words) allows more targeted study of subject-object relations (e.g. predicting conflict intensity from newswires, Schrodt 2011)… 63
  • 64. WZB Promises and pitfalls of automated content analyses + A more complete and reliable analysis of social phenomena o Analysis of very large document sets achievable at low cost o Reduced / removed sampling bias +/- Human resources remain significant o Dictionary development, coding of reference texts and especially validation requires intense human engagement - Context dependency more pronounced o Quantitative representations of language cannot abstract from varying contexts (human coders can) - Reliability is partially traded against validity o Power of automated analyses declines quickly with the complexity of theoretical concepts 64
  • 65. WZB Conclusions  Automated text analyses are a powerful, yet not a definitive tool for content analysis in the Political and Social Sciences  The computer allows us to digest larger amounts of information, uncovers patterns on much more aggregated levels, but interpretation, contextualisation, and validation remain key responsibility of the researcher! 65 Thank you for your attention! Slides and tutorials available at www.christian-rauh.eu/teaching