BDACA1516s2 - Lecture7

Co-occurrences Networks Other co-occurrence based methods Next meetings
Big Data and Automated Content Analysis
Week 7 – Monday
»Co-occurring words«
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
9 May 2016
Big Data and Automated Content Analysis Damian Trilling

Today
1 Integrating word counts and network analysis: Word
co-occurrences
The idea
A real-life example
2 Other co-occurrence based methods
PCA
LDA
3 Next meetings, & ﬁnal project

Integrating word counts and network analysis:
Word co-occurrences

The idea
Simple word count
We already know this.
1 from collections import Counter
2 tekst="this is a test where many test words occur several times this is
because it is a test yes indeed it is"
3 c=Counter(tekst.split())
4 print "The top 5 are: "
5 for woord,aantal in c.most_common(5):
6 print (aantal,woord)

The idea
Simple word count
The output:
1 The top 5 are:
2 4 is
3 3 test
4 2 a
5 2 this
6 2 it

The idea
What if we could. . .
. . . count the frequency of combinations of words?

The idea
What if we could. . .
. . . count the frequency of combinations of words?
As in: Which words do typical occur together in the same
tweet (or paragraph, or sentence, . . . )

The idea
We can — with the combinations() function
1 >>> from itertools import combinations
2 >>> words="Hoi this is a test test test a test it is".split()
3 >>> print ([e for e in combinations(words,2)])
4 [(’Hoi’, ’this’), (’Hoi’, ’is’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’,
’test’), (’Hoi’, ’test’), (’Hoi’, ’a’), (’Hoi’, ’test’), (’Hoi’, ’
it’), (’Hoi’, ’is’), (’this’, ’is’), (’this’, ’a’), (’this’, ’test
’), (’this’, ’test’), (’this’, ’test’), (’this’, ’a’), (’this’, ’
test’), (’this’, ’it’), (’this’, ’is’), (’is’, ’a’), (’is’, ’test’)
, (’is’, ’test’), (’is’, ’test’), (’is’, ’a’), (’is’, ’test’), (’is
’, ’it’), (’is’, ’is’), (’a’, ’test’), (’a’, ’test’), (’a’, ’test’)
, (’a’, ’a’), (’a’, ’test’), (’a’, ’it’), (’a’, ’is’), (’test’, ’
test’), (’test’, ’test’), (’test’, ’a’), (’test’, ’test’), (’test’,
’it’), (’test’, ’is’), (’test’, ’test’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’test’, ’a’), (’test’, ’
test’), (’test’, ’it’), (’test’, ’is’), (’a’, ’test’), (’a’, ’it’),
(’a’, ’is’), (’test’, ’it’), (’test’, ’is’), (’it’, ’is’)]

The idea
Count co-occurrences
1 from collections import defaultdict
2 from itertools import combinations
3
4 tweets=["i am having coffee with my friend","i like coffee","i like
coffee and beer","beer i like"]
5 cooc=defaultdict(int)
6
7 for tweet in tweets:
8 words=tweet.split()
9 for a,b in set(combinations(words,2)):
10 if (b,a) in cooc:
11 a,b = b,a
12 if a!=b:
13 cooc[(a,b)]+=1
14
15 for combi in sorted(cooc,key=cooc.get,reverse=True):
16 print (cooc[combi],"t",combi)

The idea
Count co-occurrences
The output:
1 3 (’i’, ’coffee’)
2 3 (’i’, ’like’)
3 2 (’i’, ’beer’)
4 2 (’like’, ’beer’)
5 2 (’like’, ’coffee’)
6 1 (’coffee’, ’beer’)
7 1 (’and’, ’beer’)
8 ...
9 ...
10 ...

The idea
From a list of co-occurrences to a network
Let’s conceptualize each word as a node and each
cooccurrence as an edge
• node weight = word frequency
• edge weight = number of coocurrences
A GDF ﬁle oﬀers all of this and looks like this:

1 nodedef>name VARCHAR, width DOUBLE
2 coffee,3
3 beer,2
4 i,4
5 and,1
6 with,1
7 friend,1
8 having,1
9 like,3
10 am,1
11 my,1
12 edgedef>node1 VARCHAR,node2 VARCHAR, weight DOUBLE
13 coffee,beer,1
14 i,beer,2
15 and,beer,1
16 with,friend,1
17 coffee,with,1
18 i,and,1
19 having,friend,1
20 like,beer,2
21 am,friend,1
22 i,am,1
23 i,coffee,3
24 i,with,1
25 am,having,1
26 i,having,1
27 coffee,and,1
28 like,coffee,2

The idea
How to represent the cooccurrences graphically?
A two-step approach
1 Save as a GDF ﬁle (the format seems easy to understand, so
we could write a function for this in Python)
2 Open the GDF ﬁle in Gephi for visualization and/or network
analysis

The idea
Gephi
• Install (NOT in the VM) from https://gephi.org
• By problems on MacOS, see what I wrote about Gephi here:
http://www.damiantrilling.net/
setting-up-my-new-macbook/
• I made a screencast on how to visualize the GDF ﬁle in Gephi:
https://streamingmedia.uva.nl/asset/detail/
t2KWKVZtQWZIe2Cj8qXcW5KF
• Further: see the materials I mailed to you

A real-life example
A real-life example
Trilling, D. (2015). Two diﬀerent debates? Investigating the
relationship between a political debate on TV and simultaneous
comments on Twitter. Social Science Computer Review,33,
259–276. doi: 10.1177/0894439314537886

A real-life example
Commenting the TV debate on Twitter
The viewers
• Commenting television programs on social networks has
become a regular pattern of behavior (Courtois & d’Heer, 2012)
• User comments have shown to reflect the structure of the
debate (Shamma, Churchill, & Kennedy, 2010; Shamma, Kennedy, & Churchill, 2009)
• Topic and speaker effect more influential than, e.g., rhetorical
skills (Nagel, Maurer, & Reinemann, 2012; De Mooy & Maier, 2014)

A real-life example
Research Questions
To which extent are the statements politicians make during a
TV debate reﬂected in online live discussions of the debate?
RQ1 Which topics are emphasized by the candidates?
RQ2 Which topics are emphasized by the Twitter users?
RQ3 With which topics are the two candidates associated
on Twitter?

A real-life example
Method

A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00

A real-life example
Method
The data
• debate transcript
• tweets containing
#tvduell
• N = 120, 557 tweets
by N = 24, 796 users
• 22-9-2013,
20.30-22.00
The analysis
• Series of self-written Python
scripts:
1 preprocessing (stemming,
stopword removal)
2 word counts
3 word log likelihood (corpus
comparison)
• Stata: regression analysis

02000400060008000
−60 −50 −40 −30 −20 −10 10 20 30 40 50 60 70 80 100 110 120 130 140 150
start
end

A real-life example
Relationship between words on TV and on Twitter
0246810
ln(wordonTwitter+1)
0 1 2 3
ln (word on TV +1)

A real-life example
Word frequency TV ⇒ word frequency Twitter
Model 1 Model 2 Model 3
ln(Twitter +1) ln(Twitter +1) ln(Twitter +1)
together w/ M. together w/ S.
b (SE) b(SE) b(SE)
beta beta beta
ln (TV M. +1) 1.59 (.052) *** 1.54 (.041) *** .77 (.037) ***
.21 .26 .14
ln (TV S. +1) 1.29 (.051) *** .88 (.041) *** 1.25 (.037) ***
.17 .15 .24
intercept 1.64 (.008) *** .87 (.007) *** .60 (.006) ***
R2 .100 .115 .100
b M. & S. diﬀer? F(1, 21408) = 12.29 F(1, 21408) = 96.69 F(1, 21408) =
p <.001 p <.001 63.38
p <.001
M = Merkel; S = Steinbrück

A real-life example
Most distinctive words on TV
LL word Frequency Merkel Frequency Steinbrüc
27,73 merkel 0 20
19,41 arbeitsplatz [job] 14 0
15,25 steinbruck 11 0
9,70 koalition [coaltion] 7 0
9,70 international 7 0
9,70 gemeinsam [together] 7 0
8,55 griechenland [Greece] 10 1
8,32 investi [investment] 6 0
6,93 uberzeug [belief] 5 0
6,93 okonom [economic] 0 5

A real-life example
Most distinctive words on Twitter
LL word Frequency Merkel Frequency Ste
32443,39 merkel 29672 0
30751,65 steinbrueck 0 17780
1507,08 kett [necklace] 1628 34
1241,14 vertrau [trust] 1240 12
863,84 fdp [a coalition partner] 985 29
775,93 nsa 1809 298
626,49 wikipedia 40 502
574,65 twittert [tweets] 40 469
544,87 koalition [coalition] 864 77
517,99 gold 669 34

A real-life example
Putting the pieces together
Merkel
• necklace
• trust (sarcastic)
• nsa aﬀair
• coalition partners
Steinbrück
• suggestion to look sth. up
on Wikipedia
• tweets from his account
during the debate

Other (non-networkbased, statistical) co-occurrence based
methods

Enter unsupervised machine learning

Enter unsupervised machine learning
(something you aready did in your Bachelor – no kidding.)

Some terminology
Supervised machine learning
You have a dataset with both
predictor and outcome
(dependent and independent
variables) — a labeled dataset.

Some terminology
Think of regression: You measured x1,
x2, x3 and you want to predict y,
which you also measured

Some terminology
Unsupervised machine learning
You have no labels.

Some terminology
You have no labels. (You did not
measure y)

Some terminology
You have no labels.
Again, you already know some
techniques to ﬁnd out how x1,
x2,. . . x_i co-occur from other
courses:
• Principal Component Analysis
(PCA)
• Cluster analysis
• . . .

PCA
Principal Component Analysis? How does that ﬁt in here?

PCA
In fact, PCA is used everywhere, even in image compression

PCA
PCA in ACA
• Find out what word cooccur (inductive frame analysis)
• Basically, transform each document in a vector of word
frequencies and do a PCA

PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...

PCA
A so-called term-document-matrix
1 w1,w2,w3,w4,w5,w6 ...
2 text1, 2, 0, 0, 1, 2, 3 ...
3 text2, 0, 0, 1, 2, 3, 4 ...
4 text3, 9, 0, 1, 1, 0, 0 ...
5 ...
These can be simple counts, but also more advanced metrics, like
tf-idf scores (where you weigh the frequency by the number of
documents in which it occurs), cosine distances, etc.

PCA
PCA: implications and problems
• given a term-document matrix, easy to do with any tool
• probably extremely skewed distributions
• some problematic assumptions: does the goal of PCA, to ﬁnd
a solution in which one word loads on one component match
real life, where a word can belong to several topics or frames?

LDA
Enter topic modeling with Latent Dirichlet Allocation (LDA)

LDA
LDA, what’s that?
No mathematical details here, but the general idea
• There are k topics, T1. . . Tk
• Each document Di consists of a mixture of these topics,
e.g.80%T1, 15%T2, 0%T3, . . . 5%Tk
• On the next level, each topic consists of a speciﬁc probability
distribution of words
• Thus, based on the frequencies of words in Di , one can infer
its distribution of topics
• Note that LDA (likek PCA) is a Bag-of-Words (BOW)
approach

LDA
Doing a LDA in Python
You can use gensim (Řehůřek & Sojka, 2010) for this.
1 sudo pip3 install gensim
Furthermore, let us assume you have a list of lists of words (!)
called texts:
1 articles=[’The tax deficit is higher than expected. This said xxx ...’,
’Germany won the World Cup. After a’]
2 texts=[art.split() for art in articles]
which looks like this:
1 [[’The’, ’tax’, ’deficit’, ’is’, ’higher’, ’than’, ’expected.’, ’This’,
’said’, ’xxx’, ’...’], [’Germany’, ’won’, ’the’, ’World’, ’Cup.’, ’
After’, ’a’]]
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. Proceedings of the
LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. Valletta, Malta: ELRA.

1 from gensim import corpora, models
2
3 NTOPICS = 100
4 LDAOUTPUTFILE="topicscores.tsv"
5
6 # Create a BOW represenation of the texts
7 id2word = corpora.Dictionary(texts)
8 mm =[id2word.doc2bow(text) for text in texts]
9
10 # Train the LDA models.
11 lda = models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=
NTOPICS, alpha="auto")
12
13 # Print the topics.
14 for top in lda.print_topics(num_topics=NTOPICS, num_words=5):
15 print ("n",top)
16
17 print ("nFor further analysis, a dataset with the topic score for each
document is saved to",LDAOUTPUTFILE)
18
19 scoresperdoc=lda.inference(mm)
20
21 with open(LDAOUTPUTFILE,"w",encoding="utf-8") as fo:
22 for row in scoresperdoc[0]:
23 fo.write("t".join(["{:0.3f}".format(score) for score in row]))
24 fo.write("n")

LDA
Output: Topics (below) & topic scores (next slide)
1 0.069*fusie + 0.058*brussel + 0.045*europesecommissie + 0.036*europese +
0.023*overname
2 0.109*bank + 0.066*britse + 0.041*regering + 0.035*financien + 0.033*
minister
3 0.114*nederlandse + 0.106*nederland + 0.070*bedrijven + 0.042*rusland +
0.038*russische
4 0.093*nederlandsespoorwegen + 0.074*den + 0.036*jaar + 0.029*onderzoek +
0.027*raad
5 0.099*banen + 0.045*jaar + 0.045*productie + 0.036*ton + 0.029*aantal
6 0.041*grote + 0.038*bedrijven + 0.027*ondernemers + 0.023*goed + 0.015*
jaar
7 0.108*werknemers + 0.037*jongeren + 0.035*werkgevers + 0.029*jaar +
0.025*werk
8 0.171*bank + 0.122* + 0.041*klanten + 0.035*verzekeraar + 0.028*euro
9 0.162*banken + 0.055*bank + 0.039*centrale + 0.027*leningen + 0.024*
financiele
10 0.052*post + 0.042*media + 0.038*nieuwe + 0.034*netwerk + 0.025*
personeel
11 ...

Next meetings

Next meetings
Wednesday, 11–5
Lab session
Conduct an analysis based on word co-occurrences (Chapter 8
and/or 9.2). Install Gephi in advance!
No meeting on Monday (Pentecost)
Wednesday, 18–5

BDACA1516s2 - Lecture7

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to BDACA1516s2 - Lecture7 (20)

More from Department of Communication Science, University of Amsterdam (8)

Recently uploaded (20)

BDACA1516s2 - Lecture7