Recap Supervised Machine Learning Exercise Next meetings
Big Data and Automated Content Analysis
Week 8 – Wednesday
»Supervised Machine Learning«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
18 May 2016
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Today
1 Recap: Types of Automated Content Analysis
2 Supervised Machine Learning
You have done it before!
Applications
An implementation
3 Exercise
4 Next meetings
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Recap: Types of Automated Content Analysis
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
• Bottom-up: frequency counts, co-occurrence networks, LDA
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
From concrete to abstract, from manifest to latent
Our approaches until know
• Top-down: counting predefined list of words or predefined
regular expressions
• Bottom-up: frequency counts, co-occurrence networks, LDA
But there is some middle-ground: Predefined categories, but no
explicit rules.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Enter supervised machine learning
Methodological approach
deductive inductive
Typical research interests
and content features
Common statistical
procedures
visibility analysis
sentiment analysis
subjectivity analysis
Counting and
Dictionary
Supervised
Machine Learning
Unsupervised
Machine Learning
frames
topics
gender bias
frames
topics
string comparisons
counting
support vector machines
naive Bayes
principal component analysis
cluster analysis
latent dirichlet allocation
semantic network analysis
Boumans, J.W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content
analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4, 1. 8–23.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Top-down vs. bottom-up
Recap: supervised vs. unsupervised
Unsupervised
• No manually coded data
• We want to identify patterns
or to make groups of most
similar cases
Example: We have a dataset of
Facebook-massages on an
organizations’ page. We use clustering
to group them and later interpret these
clusters (e.g., as complaints, questions,
praise, . . . )
Supervised
• We code a small dataset by
hand and use it to “train” a
machine
• The machine codes the rest
Example: We have 2,000 of these
messages grouped into such categories
by human coders. We then use this
data to group all remaining messages
as well.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Supervised Machine Learning
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
You have done it before!
Regression
1 Based on your data, you estimate some regression equation
yi = α + β1xi1 + · · · + βpxip + εi
2 Even if you have some new unseen data, you can estimate
your expected outcome ˆy!
3 Example: You estimated a regression equation where y is
newspaper reading in days/week:
y = −.8 + .4 × man + .08 × age
4 You could now calculate ˆy for a man of 20 years and a woman
of 40 years – even if no such person exists in your dataset:
ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2
ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
This is
Supervised Machine Learning!
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
You have done it before!
. . . but. . .
• We will only use half (or another fraction) of our data to estimate the
model, so that we can use the other half to check if our
predictions match the manual coding (“labeled
data”,“annotated data” in SML-lingo)
• e.g., 2000 labeled cases, 1000 for training, 1000 for testing —
if successful, run on 100,000 unlabeled cases
• We use many more independent variables (“features”)
• Typically, IVs are word frequencies (often weighted, e.g.
tf×idf) (⇒BOW-representation)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
Applications
In other fields
A lot of different applications
• from recognizing hand-written characters to recommendation
systems
In our field
It starts to get popular to measure latent variables
• frames
• topics
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
SML to code frames and topics
Some work by Burscher and colleagues
• Humans can code generic frames (human-interest, economic,
. . . )
• Humans can code topics from a pre-defined list
• But it is very hard to formulate an explicit rule
(as in: code as ’Human Interest’ if regular expression R is
matched)
⇒ This is where you need supervised machine learning!
Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to
code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication
Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527
Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues:
Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1),
122–131.
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
BDACA1516s2 - Lecture8
http://commons.wikimedia.org/wiki/File:Precisionrecall.svg
Some measures of accuracy
• Recall
• Precision
• F1 = 2 · precision·recall
precision+recall
• AUC (Area under curve)
[0, 1], 0.5 = random
guessing
Recap Supervised Machine Learning Exercise Next meetings
Applications
What does this mean for our research?
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
Applications
What does this mean for our research?
It we have 2,000 documents with manually coded frames and
topics. . .
• we can use them to train a SML classifier
• which can code an unlimited number of new documents
• with an acceptable accuracy
Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of
automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
An implementation
Let’s say we have a list of tuples with movie reviews and their
rating:
1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...]
And a second list with an identical structure:
1 test=[("Not that good",-1),("Nice film",1), ... ...]
Both are drawn from the same population, it is pure chance
whether a specific review is on the one list or the other.
Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Training a A Naïve Bayes Classifier
1 from sklearn.naive_bayes import MultinomialNB
2 from sklearn.feature_extraction.text import CountVectorizer
3 from sklearn import metrics
4
5 # This is just an efficient way of computing word counts
6 vectorizer = CountVectorizer(stop_words=’english’)
7 train_features = vectorizer.fit_transform([r[0] for r in reviews])
8 test_features = vectorizer.transform([r[0] for r in test])
9
10 # Fit a naive bayes model to the training data.
11 nb = MultinomialNB()
12 nb.fit(train_features, [r[1] for r in reviews])
13
14 # Now we can use the model to predict classifications for our test
features.
15 predictions = nb.predict(test_features)
16 actual=[r[1] for r in test]
17
18 # Compute the error.
19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label
=1)
20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr)))
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
And it works!
Using 50,000 IMDB movies that are classified as either negative or
positive,
• I created a list with 25,000 training tuples and another one
with 25,000 test tuples and
• trained a classifier
• that achieved an AUC of .82.
Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T.,
Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of
the Association for Computational Linguistics (ACL 2011)
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Playing around with new data
1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is
awsome. I liked this movie a lot, fantastic actors","I would not
recomment it to anyone.", "Enjoyed it a lot"])
2 predictions = nb.predict(newdata)
3 print(predictions)
This returns, as you would expect and hope:
1 [-1 1 -1 1]
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
But we can do even better
We can use different vectorizers and different classifiers.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Different vectorizers
• CountVectorizer (=simple word counts)
• TfidfVectorizer (word counts (“term frequency”) weighted by
number of documents in which the word occurs at all
(“inverse document frequency”))
• additional options: stopwords, thresholds for minimum
frequencies etc.
Big Data and Automated Content Analysis Damian Trilling
Recap Supervised Machine Learning Exercise Next meetings
An implementation
Different classifiers
• Naïve Bayes
• Logistic Regression
• Support Vector Machine (SVM)
• . . .
Big Data and Automated Content Analysis Damian Trilling
Let’s look at the source code together and find out which setup
performs best.
Recap Supervised Machine Learning Exercise Next meetings
Next (last. . . ) meetings
Monday
pandas & matplotlib — doing statistics in Python
Wednesday
OpenLab
Questions regarding final project
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot (20)

PDF
Analyzing social media with Python and other tools (2/4)
PPTX
Using the search engine as recommendation engine
PDF
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Analyzing social media with Python and other tools (2/4)
Using the search engine as recommendation engine
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Ad

Similar to BDACA1516s2 - Lecture8 (20)

PDF
DutchMLSchool. Supervised vs Unsupervised Learning
PDF
ML Basic Concepts.pdf
PPTX
Machine learning ppt unit one syllabuspptx
PPT
slides
PPT
slides
PDF
VSSML18. Clustering and Latent Dirichlet Allocation
PPTX
Introduction to Machine Learning
PDF
Machine learning Algorithms
PDF
classification_clean.pdf
PPTX
Supervised_Learning_Presentation for advanced learner
PPTX
Machine learning introduction
PDF
BigML Education - Supervised vs Unsupervised
PDF
BSSML16 L4. Association Discovery and Topic Modeling
PPTX
Machine Learning - Challenges, Learnings & Opportunities
PDF
know Machine Learning Basic Concepts.pdf
PDF
Machine Learning - Empatika Open
PDF
Artificial Intelligence: an introduction.pdf
PDF
Machine Learning ebook.pdf
PDF
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
DutchMLSchool. Supervised vs Unsupervised Learning
ML Basic Concepts.pdf
Machine learning ppt unit one syllabuspptx
slides
slides
VSSML18. Clustering and Latent Dirichlet Allocation
Introduction to Machine Learning
Machine learning Algorithms
classification_clean.pdf
Supervised_Learning_Presentation for advanced learner
Machine learning introduction
BigML Education - Supervised vs Unsupervised
BSSML16 L4. Association Discovery and Topic Modeling
Machine Learning - Challenges, Learnings & Opportunities
know Machine Learning Basic Concepts.pdf
Machine Learning - Empatika Open
Artificial Intelligence: an introduction.pdf
Machine Learning ebook.pdf
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
Ad

More from Department of Communication Science, University of Amsterdam (9)

PDF
Media diets in an age of apps and social media: Dealing with a third layer of...
PDF
Conceptualizing and measuring news exposure as network of users and news items
PDF
Data Science: Case "Political Communication 2/2"
PDF
Data Science: Case "Political Communication 1/2"
PPTX
Media diets in an age of apps and social media: Dealing with a third layer of...
Conceptualizing and measuring news exposure as network of users and news items
Data Science: Case "Political Communication 2/2"
Data Science: Case "Political Communication 1/2"

Recently uploaded (20)

PDF
IP : I ; Unit I : Preformulation Studies
PDF
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
PDF
Journal of Dental Science - UDMY (2020).pdf
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
semiconductor packaging in vlsi design fab
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
HVAC Specification 2024 according to central public works department
PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PDF
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
PDF
FORM 1 BIOLOGY MIND MAPS and their schemes
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PDF
Literature_Review_methods_ BRACU_MKT426 course material
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PPTX
Computer Architecture Input Output Memory.pptx
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
My India Quiz Book_20210205121199924.pdf
PDF
Hazard Identification & Risk Assessment .pdf
PDF
International_Financial_Reporting_Standa.pdf
IP : I ; Unit I : Preformulation Studies
BP 505 T. PHARMACEUTICAL JURISPRUDENCE (UNIT 2).pdf
Journal of Dental Science - UDMY (2020).pdf
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
semiconductor packaging in vlsi design fab
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
HVAC Specification 2024 according to central public works department
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2013).pdf
Environmental Education MCQ BD2EE - Share Source.pdf
LEARNERS WITH ADDITIONAL NEEDS ProfEd Topic
FORM 1 BIOLOGY MIND MAPS and their schemes
Unit 4 Computer Architecture Multicore Processor.pptx
Literature_Review_methods_ BRACU_MKT426 course material
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
AI-driven educational solutions for real-life interventions in the Philippine...
Computer Architecture Input Output Memory.pptx
A powerpoint presentation on the Revised K-10 Science Shaping Paper
My India Quiz Book_20210205121199924.pdf
Hazard Identification & Risk Assessment .pdf
International_Financial_Reporting_Standa.pdf

BDACA1516s2 - Lecture8

  • 1. Recap Supervised Machine Learning Exercise Next meetings Big Data and Automated Content Analysis Week 8 – Wednesday »Supervised Machine Learning« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 18 May 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Recap Supervised Machine Learning Exercise Next meetings Today 1 Recap: Types of Automated Content Analysis 2 Supervised Machine Learning You have done it before! Applications An implementation 3 Exercise 4 Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 3. Recap Supervised Machine Learning Exercise Next meetings Recap: Types of Automated Content Analysis Big Data and Automated Content Analysis Damian Trilling
  • 4. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know Big Data and Automated Content Analysis Damian Trilling
  • 5. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions Big Data and Automated Content Analysis Damian Trilling
  • 6. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions • Bottom-up: frequency counts, co-occurrence networks, LDA Big Data and Automated Content Analysis Damian Trilling
  • 7. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up From concrete to abstract, from manifest to latent Our approaches until know • Top-down: counting predefined list of words or predefined regular expressions • Bottom-up: frequency counts, co-occurrence networks, LDA But there is some middle-ground: Predefined categories, but no explicit rules. Big Data and Automated Content Analysis Damian Trilling
  • 8. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Enter supervised machine learning Methodological approach deductive inductive Typical research interests and content features Common statistical procedures visibility analysis sentiment analysis subjectivity analysis Counting and Dictionary Supervised Machine Learning Unsupervised Machine Learning frames topics gender bias frames topics string comparisons counting support vector machines naive Bayes principal component analysis cluster analysis latent dirichlet allocation semantic network analysis Boumans, J.W., & Trilling, D. (2016). Taking stock of the toolkit: An overview of relevant automated content analysis approaches and techniques for digital journalism scholars. Digital Journalism, 4, 1. 8–23. Big Data and Automated Content Analysis Damian Trilling
  • 9. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Big Data and Automated Content Analysis Damian Trilling
  • 10. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 11. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Big Data and Automated Content Analysis Damian Trilling
  • 12. Recap Supervised Machine Learning Exercise Next meetings Top-down vs. bottom-up Recap: supervised vs. unsupervised Unsupervised • No manually coded data • We want to identify patterns or to make groups of most similar cases Example: We have a dataset of Facebook-massages on an organizations’ page. We use clustering to group them and later interpret these clusters (e.g., as complaints, questions, praise, . . . ) Supervised • We code a small dataset by hand and use it to “train” a machine • The machine codes the rest Example: We have 2,000 of these messages grouped into such categories by human coders. We then use this data to group all remaining messages as well. Big Data and Automated Content Analysis Damian Trilling
  • 13. Recap Supervised Machine Learning Exercise Next meetings Supervised Machine Learning Big Data and Automated Content Analysis Damian Trilling
  • 14. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Big Data and Automated Content Analysis Damian Trilling
  • 15. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression Big Data and Automated Content Analysis Damian Trilling
  • 16. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi Big Data and Automated Content Analysis Damian Trilling
  • 17. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! Big Data and Automated Content Analysis Damian Trilling
  • 18. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age Big Data and Automated Content Analysis Damian Trilling
  • 19. Recap Supervised Machine Learning Exercise Next meetings You have done it before! You have done it before! Regression 1 Based on your data, you estimate some regression equation yi = α + β1xi1 + · · · + βpxip + εi 2 Even if you have some new unseen data, you can estimate your expected outcome ˆy! 3 Example: You estimated a regression equation where y is newspaper reading in days/week: y = −.8 + .4 × man + .08 × age 4 You could now calculate ˆy for a man of 20 years and a woman of 40 years – even if no such person exists in your dataset: ˆyman20 = −.8 + .4 × 1 + .08 × 20 = 1.2 ˆywoman40 = −.8 + .4 × 0 + .08 × 40 = 2.4 Big Data and Automated Content Analysis Damian Trilling
  • 20. Recap Supervised Machine Learning Exercise Next meetings You have done it before! This is Supervised Machine Learning! Big Data and Automated Content Analysis Damian Trilling
  • 21. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) Big Data and Automated Content Analysis Damian Trilling
  • 22. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases Big Data and Automated Content Analysis Damian Trilling
  • 23. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) Big Data and Automated Content Analysis Damian Trilling
  • 24. Recap Supervised Machine Learning Exercise Next meetings You have done it before! . . . but. . . • We will only use half (or another fraction) of our data to estimate the model, so that we can use the other half to check if our predictions match the manual coding (“labeled data”,“annotated data” in SML-lingo) • e.g., 2000 labeled cases, 1000 for training, 1000 for testing — if successful, run on 100,000 unlabeled cases • We use many more independent variables (“features”) • Typically, IVs are word frequencies (often weighted, e.g. tf×idf) (⇒BOW-representation) Big Data and Automated Content Analysis Damian Trilling
  • 25. Recap Supervised Machine Learning Exercise Next meetings Applications Applications Big Data and Automated Content Analysis Damian Trilling
  • 26. Recap Supervised Machine Learning Exercise Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems Big Data and Automated Content Analysis Damian Trilling
  • 27. Recap Supervised Machine Learning Exercise Next meetings Applications Applications In other fields A lot of different applications • from recognizing hand-written characters to recommendation systems In our field It starts to get popular to measure latent variables • frames • topics Big Data and Automated Content Analysis Damian Trilling
  • 28. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list Big Data and Automated Content Analysis Damian Trilling
  • 29. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) Big Data and Automated Content Analysis Damian Trilling
  • 30. Recap Supervised Machine Learning Exercise Next meetings Applications SML to code frames and topics Some work by Burscher and colleagues • Humans can code generic frames (human-interest, economic, . . . ) • Humans can code topics from a pre-defined list • But it is very hard to formulate an explicit rule (as in: code as ’Human Interest’ if regular expression R is matched) ⇒ This is where you need supervised machine learning! Burscher, B., Odijk, D., Vliegenthart, R., De Rijke, M., & De Vreese, C. H. (2014). Teaching the computer to code frames in news: Comparing two supervised machine learning approaches to frame analysis. Communication Methods and Measures, 8(3), 190–206. doi:10.1080/19312458.2014.937527 Burscher, B., Vliegenthart, R., & De Vreese, C. H. (2015). Using supervised machine learning to code policy issues: Can classifiers generalize across contexts? Annals of the American Academy of Political and Social Science, 659(1), 122–131. Big Data and Automated Content Analysis Damian Trilling
  • 34. http://commons.wikimedia.org/wiki/File:Precisionrecall.svg Some measures of accuracy • Recall • Precision • F1 = 2 · precision·recall precision+recall • AUC (Area under curve) [0, 1], 0.5 = random guessing
  • 35. Recap Supervised Machine Learning Exercise Next meetings Applications What does this mean for our research? Big Data and Automated Content Analysis Damian Trilling
  • 36. Recap Supervised Machine Learning Exercise Next meetings Applications What does this mean for our research? It we have 2,000 documents with manually coded frames and topics. . . • we can use them to train a SML classifier • which can code an unlimited number of new documents • with an acceptable accuracy Some easier tasks even need only 500 training documents, see Hopkins, D. J., & King, G. (2010). A method of automated nonparametric content analysis for social science. American Journal of Political Science, 54(1), 229–247. Big Data and Automated Content Analysis Damian Trilling
  • 37. Recap Supervised Machine Learning Exercise Next meetings An implementation An implementation Let’s say we have a list of tuples with movie reviews and their rating: 1 reviews=[("This is a great movie",1),("Bad movie",-1), ... ...] And a second list with an identical structure: 1 test=[("Not that good",-1),("Nice film",1), ... ...] Both are drawn from the same population, it is pure chance whether a specific review is on the one list or the other. Based on an example from http://blog.dataquest.io/blog/naive-bayes-movies/ Big Data and Automated Content Analysis Damian Trilling
  • 38. Recap Supervised Machine Learning Exercise Next meetings An implementation Training a A Naïve Bayes Classifier 1 from sklearn.naive_bayes import MultinomialNB 2 from sklearn.feature_extraction.text import CountVectorizer 3 from sklearn import metrics 4 5 # This is just an efficient way of computing word counts 6 vectorizer = CountVectorizer(stop_words=’english’) 7 train_features = vectorizer.fit_transform([r[0] for r in reviews]) 8 test_features = vectorizer.transform([r[0] for r in test]) 9 10 # Fit a naive bayes model to the training data. 11 nb = MultinomialNB() 12 nb.fit(train_features, [r[1] for r in reviews]) 13 14 # Now we can use the model to predict classifications for our test features. 15 predictions = nb.predict(test_features) 16 actual=[r[1] for r in test] 17 18 # Compute the error. 19 fpr, tpr, thresholds = metrics.roc_curve(actual, predictions, pos_label =1) 20 print("Multinomal naive bayes AUC: {0}".format(metrics.auc(fpr, tpr))) Big Data and Automated Content Analysis Damian Trilling
  • 39. Recap Supervised Machine Learning Exercise Next meetings An implementation And it works! Using 50,000 IMDB movies that are classified as either negative or positive, • I created a list with 25,000 training tuples and another one with 25,000 test tuples and • trained a classifier • that achieved an AUC of .82. Dataset obtained from http://ai.stanford.edu/~amaas/data/sentiment, Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011) Big Data and Automated Content Analysis Damian Trilling
  • 40. Recap Supervised Machine Learning Exercise Next meetings An implementation Playing around with new data 1 newdata=vectorizer.transform(["What a crappy movie! It sucks!", "This is awsome. I liked this movie a lot, fantastic actors","I would not recomment it to anyone.", "Enjoyed it a lot"]) 2 predictions = nb.predict(newdata) 3 print(predictions) This returns, as you would expect and hope: 1 [-1 1 -1 1] Big Data and Automated Content Analysis Damian Trilling
  • 41. Recap Supervised Machine Learning Exercise Next meetings An implementation But we can do even better We can use different vectorizers and different classifiers. Big Data and Automated Content Analysis Damian Trilling
  • 42. Recap Supervised Machine Learning Exercise Next meetings An implementation Different vectorizers • CountVectorizer (=simple word counts) • TfidfVectorizer (word counts (“term frequency”) weighted by number of documents in which the word occurs at all (“inverse document frequency”)) • additional options: stopwords, thresholds for minimum frequencies etc. Big Data and Automated Content Analysis Damian Trilling
  • 43. Recap Supervised Machine Learning Exercise Next meetings An implementation Different classifiers • Naïve Bayes • Logistic Regression • Support Vector Machine (SVM) • . . . Big Data and Automated Content Analysis Damian Trilling
  • 44. Let’s look at the source code together and find out which setup performs best.
  • 45. Recap Supervised Machine Learning Exercise Next meetings Next (last. . . ) meetings Monday pandas & matplotlib — doing statistics in Python Wednesday OpenLab Questions regarding final project Big Data and Automated Content Analysis Damian Trilling