SlideShare a Scribd company logo
Co-occurrences Networks Other co-occurrence based methods Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Co-occurring words«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
9 May 2016
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Other co-occurrence based methods
PCA
LDA
3 Next meetings, & final project
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Integrating word counts and network analysis:
Word co-occurrences
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF file offers all of this and looks like this:
Big Data and Automated Content Analysis Damian Trilling
1 nodedef>name VARCHAR, width DOUBLE
2 coffee,3
3 beer,2
4 i,4
5 and,1
6 with,1
7 friend,1
8 having,1
9 like,3
10 am,1
11 my,1
12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE
13 coffee,beer,1
14 i,beer,2
15 and,beer,1
16 with,friend,1
17 coffee,with,1
18 i,and,1
19 having,friend,1
20 like,beer,2
21 am,friend,1
22 i,am,1
23 i,coffee,3
24 i,with,1
25 am,having,1
26 i,having,1
27 coffee,and,1
28 like,coffee,2
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF file (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF file in Gephi for visualization and/or network
analysis
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF file in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
A real-life example
Trilling, D. (2015). Two different debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review,33,
259–276. doi: 10.1177/0894439314537886
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reflected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis
Big Data and Automated Content Analysis Damian Trilling
02000400060008000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150
start
end
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa affair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture7
Other (non-networkbased, statistical) co-occurrence based
methods
Enter unsupervised machine learning
Enter unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Think of regression: You measured x1,
x2, x3 and you want to predict y,
which you also measured
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.
Unsupervised machine learning
You have no labels. (You did not
measure y)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Some terminology
Unsupervised machine learning
You have no labels.
Again, you already know some
techniques to find out how x1,
x2,. . . x_i co-occur from other
courses:
• Principal Component Analysis
(PCA)
• Cluster analysis
• . . .
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
In fact, PCA is used everywhere, even in image compression
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
Principal Component Analysis? How does that fit in here?
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
These can be simple counts, but also more advanced metrics, like
tf-idf scores (where you weigh the frequency by the number of
documents in which it occurs), cosine distances, etc.
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool
• probably extremely skewed distributions
• some problematic assumptions: does the goal of PCA, to find
a solution in which one word loads on one component match
real life, where a word can belong to several topics or frames?
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . . Tk
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a specific probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA (likek PCA) is a Bag-of-Words (BOW)
approach
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
1 sudo pip3 install gensim
Furthermore, let us assume you have a list of lists of words (!)
called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.
Big Data and Automated Content Analysis Damian Trilling
1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")
Co-occurrences Networks Other co-occurrence based methods Next meetings
LDA
Output: Topics (below) & topic scores (next slide)
1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname
2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister
3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische
4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad
5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal
6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar
7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk
8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro
9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele
10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel
11 ...
Big Data and Automated Content Analysis Damian Trilling
BDACA1516s2 - Lecture7
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Big Data and Automated Content Analysis Damian Trilling
Co-occurrences Networks Other co-occurrence based methods Next meetings
Next meetings
Wednesday, 11–5
Lab session
Conduct an analysis based on word co-occurrences (Chapter 8
and/or 9.2). Install Gephi in advance!
No meeting on Monday (Pentecost)
Wednesday, 18–5
Supervised machine learning
Big Data and Automated Content Analysis Damian Trilling

More Related Content

What's hot (20)

PDF
Analyzing social media with Python and other tools (2/4)
PDF
Analyzing social media with Python and other tools (4/4)
DOCX
Algorithm
PPTX
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
PDF
Distributed Natural Language Processing Systems in Python
Analyzing social media with Python and other tools (2/4)
Analyzing social media with Python and other tools (4/4)
Algorithm
VLDB 2015 Tutorial: On Uncertain Graph Modeling and Queries
Distributed Natural Language Processing Systems in Python
Ad

Viewers also liked (13)

PPTX
Viennapharmacia ltd
PDF
Reunió pedagògica 1r 2016 2017 definitiva
PPT
e0101
PDF
Taller de refuerzo 1
PPTX
Presentacion
PPTX
People, Culture, & Perceptions
PDF
Conceptualizing and measuring news exposure as network of users and news items
PPTX
6.1 the roman republic
PDF
Andean Summit Mini Agenda
PPTX
Ejercicios de retroalimentacion
PPTX
Tipos de cromosomas
Viennapharmacia ltd
Reunió pedagògica 1r 2016 2017 definitiva
e0101
Taller de refuerzo 1
Presentacion
People, Culture, & Perceptions
Conceptualizing and measuring news exposure as network of users and news items
6.1 the roman republic
Andean Summit Mini Agenda
Ejercicios de retroalimentacion
Tipos de cromosomas
Ad

Similar to BDACA1516s2 - Lecture7 (20)

PPTX
Co word analysis
PDF
1026 telling story from text 2
PPT
PDF
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
PPTX
Implicit Sentiment Mining in Twitter Streams
PDF
Bisociative Knowledge Discovery Michael R Berthold
PPTX
Extracting Semantic
PPTX
Topical_Facets
PPT
PPTX
Proposal defense
PDF
Text Analysis Methods for Digital Humanities
PPTX
Natural Language Processing in R (rNLP)
PPTX
Introduction to Text Mining
PPTX
Tg noh jeju_workshop
PPTX
Extracting Semantic User Networks from Informal Communication Exchanges
PDF
II-SDV 2013 The Analytics Challenges Posed by Big Data
PPTX
Content Analysis
PDF
Twitter data analysis using R
PPTX
Trend Analysis
PPTX
Interpreting Embeddings with Comparison
Co word analysis
1026 telling story from text 2
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Implicit Sentiment Mining in Twitter Streams
Bisociative Knowledge Discovery Michael R Berthold
Extracting Semantic
Topical_Facets
Proposal defense
Text Analysis Methods for Digital Humanities
Natural Language Processing in R (rNLP)
Introduction to Text Mining
Tg noh jeju_workshop
Extracting Semantic User Networks from Informal Communication Exchanges
II-SDV 2013 The Analytics Challenges Posed by Big Data
Content Analysis
Twitter data analysis using R
Trend Analysis
Interpreting Embeddings with Comparison

More from Department of Communication Science, University of Amsterdam (8)

PDF
Media diets in an age of apps and social media: Dealing with a third layer of...
PDF
Data Science: Case "Political Communication 2/2"
PDF
Data Science: Case "Political Communication 1/2"
PPTX

Recently uploaded (20)

PDF
IGGE1 Understanding the Self1234567891011
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
Classroom Observation Tools for Teachers
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PPTX
Lesson notes of climatology university.
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PDF
Empowerment Technology for Senior High School Guide
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PPTX
History, Philosophy and sociology of education (1).pptx
PPTX
Introduction to Building Materials
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PPTX
Final Presentation General Medicine 03-08-2024.pptx
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PDF
Complications of Minimal Access Surgery at WLH
IGGE1 Understanding the Self1234567891011
UNIT III MENTAL HEALTH NURSING ASSESSMENT
Classroom Observation Tools for Teachers
Hazard Identification & Risk Assessment .pdf
Final Presentation General Medicine 03-08-2024.pptx
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
Lesson notes of climatology university.
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
Empowerment Technology for Senior High School Guide
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
History, Philosophy and sociology of education (1).pptx
Introduction to Building Materials
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
LDMMIA Reiki Yoga Finals Review Spring Summer
Final Presentation General Medicine 03-08-2024.pptx
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Supply Chain Operations Speaking Notes -ICLT Program
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Digestion and Absorption of Carbohydrates, Proteina and Fats
Complications of Minimal Access Surgery at WLH

BDACA1516s2 - Lecture7

  • 1. Co-occurrences Networks Other co-occurrence based methods Next meetings Big Data and Automated Content Analysis Week 7 – Monday »Co-occurring words« Damian Trilling d.c.trilling@uva.nl @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 9 May 2016 Big Data and Automated Content Analysis Damian Trilling
  • 2. Co-occurrences Networks Other co-occurrence based methods Next meetings Today 1 Integrating word counts and network analysis: Word co-occurrences The idea A real-life example 2 Other co-occurrence based methods PCA LDA 3 Next meetings, & final project Big Data and Automated Content Analysis Damian Trilling
  • 3. Co-occurrences Networks Other co-occurrence based methods Next meetings Integrating word counts and network analysis: Word co-occurrences Big Data and Automated Content Analysis Damian Trilling
  • 4. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Simple word count We already know this. 1 from collections import Counter 2 tekst="this is a test where many test words occur several times this is because it is a test yes indeed it is" 3 c=Counter(tekst.split()) 4 print "The top 5 are: " 5 for woord,aantal in c.most_common(5): 6 print (aantal,woord) Big Data and Automated Content Analysis Damian Trilling
  • 5. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Simple word count The output: 1 The top 5 are: 2 4 is 3 3 test 4 2 a 5 2 this 6 2 it Big Data and Automated Content Analysis Damian Trilling
  • 6. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? Big Data and Automated Content Analysis Damian Trilling
  • 7. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea What if we could. . . . . . count the frequency of combinations of words? As in: Which words do typical occur together in the same tweet (or paragraph, or sentence, . . . ) Big Data and Automated Content Analysis Damian Trilling
  • 8. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea We can — with the combinations() function 1 >>> from itertools import combinations 2 >>> words="Hoi this is a test test test a test it is".split() 3 >>> print ([e for e in combinations(words,2)]) 4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’ it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test ’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’ test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’) , (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is ’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’) , (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’ test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’ test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)] Big Data and Automated Content Analysis Damian Trilling
  • 9. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Count co-occurrences 1 from collections import defaultdict 2 from itertools import combinations 3 4 tweets=["i am having coffee with my friend","i like coffee","i like coffee and beer","beer i like"] 5 cooc=defaultdict(int) 6 7 for tweet in tweets: 8 words=tweet.split() 9 for a,b in set(combinations(words,2)): 10 if (b,a) in cooc: 11 a,b = b,a 12 if a!=b: 13 cooc[(a,b)]+=1 14 15 for combi in sorted(cooc,key=cooc.get,reverse=True): 16 print (cooc[combi],"t",combi) Big Data and Automated Content Analysis Damian Trilling
  • 10. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Count co-occurrences The output: 1 3 (’i’, ’coffee’) 2 3 (’i’, ’like’) 3 2 (’i’, ’beer’) 4 2 (’like’, ’beer’) 5 2 (’like’, ’coffee’) 6 1 (’coffee’, ’beer’) 7 1 (’and’, ’beer’) 8 ... 9 ... 10 ... Big Data and Automated Content Analysis Damian Trilling
  • 11. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea From a list of co-occurrences to a network Let’s conceptualize each word as a node and each cooccurrence as an edge • node weight = word frequency • edge weight = number of coocurrences A GDF file offers all of this and looks like this: Big Data and Automated Content Analysis Damian Trilling
  • 12. 1 nodedef>name VARCHAR, width DOUBLE 2 coffee,3 3 beer,2 4 i,4 5 and,1 6 with,1 7 friend,1 8 having,1 9 like,3 10 am,1 11 my,1 12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE 13 coffee,beer,1 14 i,beer,2 15 and,beer,1 16 with,friend,1 17 coffee,with,1 18 i,and,1 19 having,friend,1 20 like,beer,2 21 am,friend,1 22 i,am,1 23 i,coffee,3 24 i,with,1 25 am,having,1 26 i,having,1 27 coffee,and,1 28 like,coffee,2
  • 13. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea How to represent the cooccurrences graphically? A two-step approach 1 Save as a GDF file (the format seems easy to understand, so we could write a function for this in Python) 2 Open the GDF file in Gephi for visualization and/or network analysis Big Data and Automated Content Analysis Damian Trilling
  • 14. Co-occurrences Networks Other co-occurrence based methods Next meetings The idea Gephi • Install (NOT in the VM) from https://gephi.org • By problems on MacOS, see what I wrote about Gephi here: http://www.damiantrilling.net/ setting-up-my-new-macbook/ • I made a screencast on how to visualize the GDF file in Gephi: https://streamingmedia.uva.nl/asset/detail/ t2KWKVZtQWZIe2Cj8qXcW5KF • Further: see the materials I mailed to you Big Data and Automated Content Analysis Damian Trilling
  • 15. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example A real-life example Trilling, D. (2015). Two different debates? Investigating the relationship between a political debate on TV and simultaneous comments on Twitter. Social Science Computer Review,33, 259–276. doi: 10.1177/0894439314537886 Big Data and Automated Content Analysis Damian Trilling
  • 16. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Commenting the TV debate on Twitter The viewers • Commenting television programs on social networks has become a regular pattern of behavior (Courtois & d’Heer, 2012) • User comments have shown to reflect the structure of the debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009) • Topic and speaker effect more influential than, e.g., rhetorical skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014) Big Data and Automated Content Analysis Damian Trilling
  • 17. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Research Questions To which extent are the statements politicians make during a TV debate reflected in online live discussions of the debate? RQ1 Which topics are emphasized by the candidates? RQ2 Which topics are emphasized by the Twitter users? RQ3 With which topics are the two candidates associated on Twitter? Big Data and Automated Content Analysis Damian Trilling
  • 18. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method Big Data and Automated Content Analysis Damian Trilling
  • 19. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 Big Data and Automated Content Analysis Damian Trilling
  • 20. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Method The data • debate transcript • tweets containing #tvduell • N = 120, 557 tweets by N = 24, 796 users • 22-9-2013, 20.30-22.00 The analysis • Series of self-written Python scripts: 1 preprocessing (stemming, stopword removal) 2 word counts 3 word log likelihood (corpus comparison) • Stata: regression analysis Big Data and Automated Content Analysis Damian Trilling
  • 21. 02000400060008000 −60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150 start end
  • 22. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Relationship between words on TV and on Twitter 0246810 ln(wordonTwitter+1) 0 1 2 3 ln (word on TV +1) Big Data and Automated Content Analysis Damian Trilling
  • 23. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Word frequency TV ⇒ word frequency Twitter Model 1 Model 2 Model 3 ln(Twitter +1) ln(Twitter +1) ln(Twitter +1) together w/ M. together w/ S. b (SE) b(SE) b(SE) beta beta beta ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) *** .21 .26 .14 ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) *** .17 .15 .24 intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) *** R2 .100 .115 .100 b M. & S. differ? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) = p <.001 p <.001 63.38 p <.001 M = Merkel; S = Steinbrück Big Data and Automated Content Analysis Damian Trilling
  • 24. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Most distinctive words on TV LL word Frequency Merkel Frequency Steinbrüc 27,73 merkel 0 20 19,41 arbeitsplatz [job] 14 0 15,25 steinbruck 11 0 9,70 koalition [coaltion] 7 0 9,70 international 7 0 9,70 gemeinsam [together] 7 0 8,55 griechenland [Greece] 10 1 8,32 investi [investment] 6 0 6,93 uberzeug [belief] 5 0 6,93 okonom [economic] 0 5 Big Data and Automated Content Analysis Damian Trilling
  • 25. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Most distinctive words on Twitter LL word Frequency Merkel Frequency Ste 32443,39 merkel 29672 0 30751,65 steinbrueck 0 17780 1507,08 kett [necklace] 1628 34 1241,14 vertrau [trust] 1240 12 863,84 fdp [a coalition partner] 985 29 775,93 nsa 1809 298 626,49 wikipedia 40 502 574,65 twittert [tweets] 40 469 544,87 koalition [coalition] 864 77 517,99 gold 669 34 Big Data and Automated Content Analysis Damian Trilling
  • 26. Co-occurrences Networks Other co-occurrence based methods Next meetings A real-life example Putting the pieces together Merkel • necklace • trust (sarcastic) • nsa affair • coalition partners Steinbrück • suggestion to look sth. up on Wikipedia • tweets from his account during the debate Big Data and Automated Content Analysis Damian Trilling
  • 28. Other (non-networkbased, statistical) co-occurrence based methods
  • 30. Enter unsupervised machine learning (something you aready did in your Bachelor – no kidding.)
  • 31. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Big Data and Automated Content Analysis Damian Trilling
  • 32. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Think of regression: You measured x1, x2, x3 and you want to predict y, which you also measured Big Data and Automated Content Analysis Damian Trilling
  • 33. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Unsupervised machine learning You have no labels. Big Data and Automated Content Analysis Damian Trilling
  • 34. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Supervised machine learning You have a dataset with both predictor and outcome (dependent and independent variables) — a labeled dataset. Unsupervised machine learning You have no labels. (You did not measure y) Big Data and Automated Content Analysis Damian Trilling
  • 35. Co-occurrences Networks Other co-occurrence based methods Next meetings Some terminology Unsupervised machine learning You have no labels. Again, you already know some techniques to find out how x1, x2,. . . x_i co-occur from other courses: • Principal Component Analysis (PCA) • Cluster analysis • . . . Big Data and Automated Content Analysis Damian Trilling
  • 36. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? Big Data and Automated Content Analysis Damian Trilling
  • 37. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? In fact, PCA is used everywhere, even in image compression Big Data and Automated Content Analysis Damian Trilling
  • 38. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA Principal Component Analysis? How does that fit in here? PCA in ACA • Find out what word cooccur (inductive frame analysis) • Basically, transform each document in a vector of word frequencies and do a PCA Big Data and Automated Content Analysis Damian Trilling
  • 39. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... Big Data and Automated Content Analysis Damian Trilling
  • 40. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA A so-called term-document-matrix 1 w1,w2,w3,w4,w5,w6 ... 2 text1, 2, 0, 0, 1, 2, 3 ... 3 text2, 0, 0, 1, 2, 3, 4 ... 4 text3, 9, 0, 1, 1, 0, 0 ... 5 ... These can be simple counts, but also more advanced metrics, like tf-idf scores (where you weigh the frequency by the number of documents in which it occurs), cosine distances, etc. Big Data and Automated Content Analysis Damian Trilling
  • 41. Co-occurrences Networks Other co-occurrence based methods Next meetings PCA PCA: implications and problems • given a term-document matrix, easy to do with any tool • probably extremely skewed distributions • some problematic assumptions: does the goal of PCA, to find a solution in which one word loads on one component match real life, where a word can belong to several topics or frames? Big Data and Automated Content Analysis Damian Trilling
  • 42. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Enter topic modeling with Latent Dirichlet Allocation (LDA) Big Data and Automated Content Analysis Damian Trilling
  • 43. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA LDA, what’s that? No mathematical details here, but the general idea • There are k topics, T1. . . Tk • Each document Di consists of a mixture of these topics, e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk • On the next level, each topic consists of a specific probability distribution of words • Thus, based on the frequencies of words in Di , one can infer its distribution of topics • Note that LDA (likek PCA) is a Bag-of-Words (BOW) approach Big Data and Automated Content Analysis Damian Trilling
  • 44. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Doing a LDA in Python You can use gensim (Řehůřek & Sojka, 2010) for this. 1 sudo pip3 install gensim Furthermore, let us assume you have a list of lists of words (!) called texts: 1 articles=[’The tax deficit is higher than expected. This said xxx ...’, ’Germany won the World Cup. After a’] 2 texts=[art.split() for art in articles] which looks like this: 1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’, ’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’ After’, ’a’]] Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA. Big Data and Automated Content Analysis Damian Trilling
  • 45. 1 from gensim import corpora, models 2 3 NTOPICS = 100 4 LDAOUTPUTFILE="topicscores.tsv" 5 6 # Create a BOW represenation of the texts 7 id2word = corpora.Dictionary(texts) 8 mm =[id2word.doc2bow(text) for text in texts] 9 10 # Train the LDA models. 11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics= NTOPICS, alpha="auto") 12 13 # Print the topics. 14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5): 15 print ("n",top) 16 17 print ("nFor further analysis, a dataset with the topic score for each document is saved to",LDAOUTPUTFILE) 18 19 scoresperdoc=lda.inference(mm) 20 21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo: 22 for row in scoresperdoc[0]: 23 fo.write("t".join(["{:0.3f}".format(score) for score in row])) 24 fo.write("n")
  • 46. Co-occurrences Networks Other co-occurrence based methods Next meetings LDA Output: Topics (below) & topic scores (next slide) 1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese + 0.023*overname 2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033* minister 3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland + 0.038*russische 4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek + 0.027*raad 5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal 6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015* jaar 7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar + 0.025*werk 8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro 9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024* financiele 10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025* personeel 11 ... Big Data and Automated Content Analysis Damian Trilling
  • 48. Co-occurrences Networks Other co-occurrence based methods Next meetings Next meetings Big Data and Automated Content Analysis Damian Trilling
  • 49. Co-occurrences Networks Other co-occurrence based methods Next meetings Next meetings Wednesday, 11–5 Lab session Conduct an analysis based on word co-occurrences (Chapter 8 and/or 9.2). Install Gephi in advance! No meeting on Monday (Pentecost) Wednesday, 18–5 Supervised machine learning Big Data and Automated Content Analysis Damian Trilling