SlideShare a Scribd company logo
Mining of Massive Datasets
Jure Leskovec, Anand Rajaraman, Jeff Ullman
Stanford University
http://www.mmds.org
Note to other teachers and users of these slides: We would be delighted if you found this our
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
2
Nodes: FootballTeams
Edges: Games played
Can we identify
node groups?
(communities,
modules, clusters)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
3
NCAA conferences
Nodes: FootballTeams
Edges: Games played
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
4
Can we identify
functional modules?
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
5
Functional modules
Nodes: Proteins
Edges: Physical interactions
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
6
Can we identify
social communities?
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
7
High school Summer
internship
Stanford (Squash)
Stanford (Basketball)
Social communities
Nodes: Facebook Users
Edges: Friendships
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Non-overlapping vs. overlapping communities
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
9
Network Adjacency matrix
Nodes
Nodes
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
What is the structure of community overlaps:
Edge density in the overlaps is higher!
10
Communities as “tiles”
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
11
This is what we want!
Communities
in a network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
1) Given a model, we generate the network:
 2) Given a network, find the “best” model
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12
C
A
B
D E
H
F
G
C
A
B
D E
H
F
G
Generative
model for
networks
Generative
model for
networks
Goal: Define a model that can generate
networks
 The model will have a set of “parameters” that we
will later want to estimate (and detect communities)
 Q: Given a set of nodes, how do communities
“generate” edges of the network?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13
C
A
B
D E
H
F
G
Generative
model for
networks
Generative model B(V, C, M, {pc}) for graphs:
 Nodes V, Communities C, Memberships M
 Each community c has a single probability pc
 Later we fit the model to networks to detect
communities
14
Model
Network
Communities, C
Nodes,V
Model
pA pB
Memberships, M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
AGM generates the links: For each
 For each pair of nodes in community ,
we connect them with prob. 
 The overall edge probability is:
15
Model
∏
∩
∈
−
−
=
v
u M
M
c
c
p
v
u
P )
1
(
1
)
,
(
Network
Communities, C
Nodes,V
Community Affiliations
pA pB
Memberships, M
If ,  share no communities:  ,   
Think of this as an “OR” function: If at least 1 community says “YES” we create an edge
 … set of communities
node  belongs to
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
16
Model
Network
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
AGM can express a
variety of community
structures:
Non-overlapping,
Overlapping, Nested
17
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
ch10-graphs2.pdf
Detecting communities with AGM:
19
C
A
B
D E
H
F
G
Given a Graph , find the Model
1) Affiliation graph M
2) Number of communities C
3) Parameters pc
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Maximum Likelihood Principle (MLE):
 Given: Data
 Assumption: Data is generated by some model 
 … model
 … model parameters
 Want to estimate  :
 The probability that our model (with parameters )
generated the data
 Now let’s find the most likely model that could have
generated the data: arg max

 
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
Imagine we are given a set of coin flips
 Task: Figure out the bias of a coin!
 Data: Sequence of coin flips:  , , , , , , , 
 Model:  return 1 with prob. Θ, else return 0
 What is  ? Assuming coin flips are independent
 So,     ∗   ∗   … ∗  
 What is   ? Simple,   
 Then,   
 
 For example:
   .  . #
  

$
 .  %
 What did we learn? Our data was most
likely generated by coin with bias  /$
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21

∗
 /$
How do we do MLE for graphs?
 Model generates a probabilistic adjacency matrix
 We then flip all the entries of the probabilistic
matrix to obtain the binary adjacency matrix 
 The likelihood of AGM generating graph G:
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22
0 0.10 0.10 0.04
0.10 0 0.02 0.06
0.10 0.02 0 0.06
0.04 0.06 0.06 0
0 1 0 0
1 0 1 1
0 1 0 1
0 1 1 0
For every pair
of nodes , 
AGM gives the
prob.  of
them being
linked
Flip
biased
coins
))
,
(
1
(
)
,
(
)
|
(
)
,
(
)
,
(
v
u
P
v
u
P
G
P
E
v
u
E
v
u
−
Π
Π
=
Θ
∉
∈
Given graph G(V,E) and Θ, we calculate
likelihood that Θ generated G: P(G|Θ)
0 0.9 0.9 0
0.9 0 0.9 0
0.9 0.9 0 0.9
0 0 0.9 0
Θ=B(V, C, M, {pc})
0 1 1 0
1 0 1 0
1 1 0 1
0 0 1 0
G
P(G|Θ)
))
,
(
1
(
)
,
(
)
|
(
)
,
(
)
,
(
v
u
P
v
u
P
G
P
E
v
u
E
v
u
−
Π
Π
=
Θ
∉
∈
G
23
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
A B
Our goal: Find  ' (, ), , )  such
that:
 How do we find ' (, ), , )  that
maximizes the likelihood?
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24
Θ
P( | )
AGM
arg max
Θ
Our goal is to find ' (, ), , ) such that:
arg max
* (,), , ) 
+  ,  +    , 
∉-

,∈-
 Problem: Finding B means finding the
bipartite affiliation network.
 There is no nice way to do this.
 Fitting ' (, ), , )  is too hard,
let’s change the model (so it is easier to fit)!
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
Relaxation: Memberships have strengths
 /: The membership strength of node 
to community  (/  : no membership)
 Each community  links nodes independently:
 ,     123 / ⋅ /
26
/
u v
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Community membership strength matrix /
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
/ 
j
Communities
Nodes
/ …
strength of ’s
membership to 
/ … vector of
community
membership
strengths of 
  ,     123 / ⋅ /
 Probability of connection is
proportional to the product of
strengths
 Notice: If one node doesn’t belong to the
community (567  0) then  ,   
 Prob. that at least one common
community ) links the nodes:
  ,     ∏   ) , 
)
Community  links nodes ,  independently:
 ,     123 / ⋅ /
 Then prob. at least one common ) links them:
 ,     ∏   ) , 
)
   123  ∑ /) ⋅ /)
) 
   123 / ⋅ /
;

 Example / matrix:
28
/ :
/ :
Then: / ⋅ /
;  . #
And:  ,     = . #  . 
But:  , ?  . $$
 , ?  
/? :
Node community
membership strengths
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
0 1.2 0 0.2
0.5 0 0 0.8
0 1.8 1 0
Task: Given a network @ (, -, estimate /
 Find / that maximizes the likelihood:
ABC DA=/ +  , 
,∈-
 +    ,  
, ∉-
 where:  ,     123 / ⋅ /
;
 Many times we take the logarithm of the likelihood,
and call it log-likelihood: E /  FGH  @|/
 Goal: Find / that maximizes E /:
29
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Compute gradient of a single row / of /:
 Coordinate gradient ascent:
 Iterate over the rows of /:
 Compute gradient JE / of row  (while keeping others fixed)
 Update the row /: / ← / L M NE /
 Project / back to a non-negative vector: If /) O : /)  
 This is slow! Computing JE / takes linear time! 30
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
P .. Set out
outgoing neighbors
However, we notice:
 We cache ∑ /

 So, computing ∑ /
∉P  now takes linear time
in the degree |P  | of 
 In networks degree of a node is much smaller to the total
number of nodes in the network, so this is a significant
speedup!
31
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
BigCLAM takes 5 minutes for 300k node nets
 Other methods take 10 days
 Can process networks with 100M edges!
32
0
2000
4000
6000
8000
10000
0 100 200 300
Time
(Sec.)
Number of nodes (× 103
)
Link Clustering
Clique Percolation
MMSB
BigCLAM
Parallel BigCLAM
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
ch10-graphs2.pdf
34
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Extension:
Make community membership edges directed!
 Outgoing membership: Nodes “sends” edges
 Incoming membership: Node “receives” edges
35
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
Everything is almost the same except now
we have 2 matrices: / and Q
 /… out-going community memberships
 Q… in-coming community memberships
 Edge prob.:  ,     = /Q
;

 Network log-likelihood:
which we optimize the same way as before
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37
/ Q
38
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
Overlapping Community Detection at Scale: A Nonnegative Matrix
Factorization Approach by J. Yang, J. Leskovec. ACM International
Conference on Web Search and Data Mining (WSDM), 2013.
 Detecting Cohesive and 2-mode Communities in Directed and
Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM
International Conference on Web Search and Data Mining (WSDM),
2014.
 Community Detection in Networks with Node Attributes by J. Yang,
J. McAuley, J. Leskovec. IEEE International Conference On Data
Mining (ICDM), 2013.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39

More Related Content

PPTX
ch10-graphs2.pptx
PPT
Graph mining
PDF
Graph Machine Learning - Past, Present, and Future -
PPTX
ch03-lsh.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Universi...
PPTX
ch07-clustering (1bvnbvnvmnbmnmbmm,).pptx
PPTX
Ch03 Mining Massive Data Sets stanford
PPT
Lect12 graph mining
PDF
Statistical inference of generative network models - Tiago P. Peixoto
ch10-graphs2.pptx
Graph mining
Graph Machine Learning - Past, Present, and Future -
ch03-lsh.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Universi...
ch07-clustering (1bvnbvnvmnbmnmbmm,).pptx
Ch03 Mining Massive Data Sets stanford
Lect12 graph mining
Statistical inference of generative network models - Tiago P. Peixoto

Similar to ch10-graphs2.pdf (20)

PPT
Clique-based Network Clustering
PPTX
Learning multifractal structure in large networks (Purdue ML Seminar)
PPTX
ch04-streams1.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Uni...
PDF
Netwoks icml09
PPSX
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
PPT
Mediapresentation file for social media.
PDF
Graph Analysis Beyond Linear Algebra
PDF
Modeling and mining complex networks with feature-rich nodes.
PDF
Mining of massive datasets
PDF
MACHINE LEARNING CLASSIFICATION USING MOTIF BASED GRAPH DATABASES CREATED FRO...
PDF
MACHINE LEARNING CLASSIFICATION USING MOTIF BASED GRAPH DATABASES CREATED FRO...
PDF
Network analysis for computational biology
PDF
network mining and representation learning
PPT
Survey on Frequent Pattern Mining on Graph Data - Slides
PPTX
Graph Representation Learning
PPTX
Graph mining and development using .pptx
PDF
08 Exponential Random Graph Models (2016)
PDF
08 Exponential Random Graph Models (ERGM)
Clique-based Network Clustering
Learning multifractal structure in large networks (Purdue ML Seminar)
ch04-streams1.pptxCS246: Mining Massive Datasets Jure Leskovec, Stanford Uni...
Netwoks icml09
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Mediapresentation file for social media.
Graph Analysis Beyond Linear Algebra
Modeling and mining complex networks with feature-rich nodes.
Mining of massive datasets
MACHINE LEARNING CLASSIFICATION USING MOTIF BASED GRAPH DATABASES CREATED FRO...
MACHINE LEARNING CLASSIFICATION USING MOTIF BASED GRAPH DATABASES CREATED FRO...
Network analysis for computational biology
network mining and representation learning
Survey on Frequent Pattern Mining on Graph Data - Slides
Graph Representation Learning
Graph mining and development using .pptx
08 Exponential Random Graph Models (2016)
08 Exponential Random Graph Models (ERGM)
Ad

Recently uploaded (20)

PDF
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PPTX
TNA_Presentation-1-Final(SAVE)) (1).pptx
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
advance database management system book.pdf
PPTX
History, Philosophy and sociology of education (1).pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PDF
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
PPTX
20th Century Theater, Methods, History.pptx
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
IGGE1 Understanding the Self1234567891011
Τίμαιος είναι φιλοσοφικός διάλογος του Πλάτωνα
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
LDMMIA Reiki Yoga Finals Review Spring Summer
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
B.Sc. DS Unit 2 Software Engineering.pptx
TNA_Presentation-1-Final(SAVE)) (1).pptx
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
advance database management system book.pdf
History, Philosophy and sociology of education (1).pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
CISA (Certified Information Systems Auditor) Domain-Wise Summary.pdf
20th Century Theater, Methods, History.pptx
Chinmaya Tiranga quiz Grand Finale.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
A powerpoint presentation on the Revised K-10 Science Shaping Paper
IGGE1 Understanding the Self1234567891011
Ad

ch10-graphs2.pdf

  • 1. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
  • 2. 2 Nodes: FootballTeams Edges: Games played Can we identify node groups? (communities, modules, clusters) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 3. 3 NCAA conferences Nodes: FootballTeams Edges: Games played J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 4. 4 Can we identify functional modules? Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 5. 5 Functional modules Nodes: Proteins Edges: Physical interactions J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 6. 6 Can we identify social communities? Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 7. 7 High school Summer internship Stanford (Squash) Stanford (Basketball) Social communities Nodes: Facebook Users Edges: Friendships J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 8. Non-overlapping vs. overlapping communities J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8
  • 9. 9 Network Adjacency matrix Nodes Nodes J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 10. What is the structure of community overlaps: Edge density in the overlaps is higher! 10 Communities as “tiles” J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 11. 11 This is what we want! Communities in a network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 12. 1) Given a model, we generate the network: 2) Given a network, find the “best” model J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 12 C A B D E H F G C A B D E H F G Generative model for networks Generative model for networks
  • 13. Goal: Define a model that can generate networks The model will have a set of “parameters” that we will later want to estimate (and detect communities) Q: Given a set of nodes, how do communities “generate” edges of the network? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 13 C A B D E H F G Generative model for networks
  • 14. Generative model B(V, C, M, {pc}) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability pc Later we fit the model to networks to detect communities 14 Model Network Communities, C Nodes,V Model pA pB Memberships, M J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 15. AGM generates the links: For each For each pair of nodes in community , we connect them with prob. The overall edge probability is: 15 Model ∏ ∩ ∈ − − = v u M M c c p v u P ) 1 ( 1 ) , ( Network Communities, C Nodes,V Community Affiliations pA pB Memberships, M If , share no communities: , Think of this as an “OR” function: If at least 1 community says “YES” we create an edge … set of communities node belongs to J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 16. 16 Model Network J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 17. AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 17 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 19. Detecting communities with AGM: 19 C A B D E H F G Given a Graph , find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters pc J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 20. Maximum Likelihood Principle (MLE): Given: Data Assumption: Data is generated by some model … model … model parameters Want to estimate : The probability that our model (with parameters ) generated the data Now let’s find the most likely model that could have generated the data: arg max J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 20
  • 21. Imagine we are given a set of coin flips Task: Figure out the bias of a coin! Data: Sequence of coin flips: , , , , , , , Model: return 1 with prob. Θ, else return 0 What is ? Assuming coin flips are independent So, ∗ ∗ … ∗ What is ? Simple, Then, For example: . . # $ . % What did we learn? Our data was most likely generated by coin with bias /$ J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 21 ∗ /$
  • 22. How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix The likelihood of AGM generating graph G: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 22 0 0.10 0.10 0.04 0.10 0 0.02 0.06 0.10 0.02 0 0.06 0.04 0.06 0.06 0 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 For every pair of nodes , AGM gives the prob. of them being linked Flip biased coins )) , ( 1 ( ) , ( ) | ( ) , ( ) , ( v u P v u P G P E v u E v u − Π Π = Θ ∉ ∈
  • 23. Given graph G(V,E) and Θ, we calculate likelihood that Θ generated G: P(G|Θ) 0 0.9 0.9 0 0.9 0 0.9 0 0.9 0.9 0 0.9 0 0 0.9 0 Θ=B(V, C, M, {pc}) 0 1 1 0 1 0 1 0 1 1 0 1 0 0 1 0 G P(G|Θ) )) , ( 1 ( ) , ( ) | ( ) , ( ) , ( v u P v u P G P E v u E v u − Π Π = Θ ∉ ∈ G 23 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org A B
  • 24. Our goal: Find ' (, ), , ) such that: How do we find ' (, ), , ) that maximizes the likelihood? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 24 Θ P( | ) AGM arg max Θ
  • 25. Our goal is to find ' (, ), , ) such that: arg max * (,), , ) + , + , ∉- ,∈- Problem: Finding B means finding the bipartite affiliation network. There is no nice way to do this. Fitting ' (, ), , ) is too hard, let’s change the model (so it is easier to fit)! J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 25
  • 26. Relaxation: Memberships have strengths /: The membership strength of node to community (/ : no membership) Each community links nodes independently: , 123 / ⋅ / 26 / u v J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 27. Community membership strength matrix / J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27 / j Communities Nodes / … strength of ’s membership to / … vector of community membership strengths of , 123 / ⋅ / Probability of connection is proportional to the product of strengths Notice: If one node doesn’t belong to the community (567 0) then , Prob. that at least one common community ) links the nodes: , ∏ ) , )
  • 28. Community links nodes , independently: , 123 / ⋅ / Then prob. at least one common ) links them: , ∏ ) , ) 123 ∑ /) ⋅ /) ) 123 / ⋅ / ; Example / matrix: 28 / : / : Then: / ⋅ / ; . # And: , = . # . But: , ? . $$ , ? /? : Node community membership strengths J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 0 1.2 0 0.2 0.5 0 0 0.8 0 1.8 1 0
  • 29. Task: Given a network @ (, -, estimate / Find / that maximizes the likelihood: ABC DA=/ + , ,∈- + , , ∉- where: , 123 / ⋅ / ; Many times we take the logarithm of the likelihood, and call it log-likelihood: E / FGH @|/ Goal: Find / that maximizes E /: 29 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 30. Compute gradient of a single row / of /: Coordinate gradient ascent: Iterate over the rows of /: Compute gradient JE / of row (while keeping others fixed) Update the row /: / ← / L M NE / Project / back to a non-negative vector: If /) O : /) This is slow! Computing JE / takes linear time! 30 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org P .. Set out outgoing neighbors
  • 31. However, we notice: We cache ∑ / So, computing ∑ / ∉P now takes linear time in the degree |P | of In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup! 31 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 32. BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! 32 0 2000 4000 6000 8000 10000 0 100 200 300 Time (Sec.) Number of nodes (× 103 ) Link Clustering Clique Percolation MMSB BigCLAM Parallel BigCLAM J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 34. 34 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 35. Extension: Make community membership edges directed! Outgoing membership: Nodes “sends” edges Incoming membership: Node “receives” edges 35 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 36. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 36
  • 37. Everything is almost the same except now we have 2 matrices: / and Q /… out-going community memberships Q… in-coming community memberships Edge prob.: , = /Q ; Network log-likelihood: which we optimize the same way as before J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 37 / Q
  • 38. 38 J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
  • 39. Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2013. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), 2014. Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 39