Presentation on Graph Clustering (vldb 09)

Graph Clustering
Based on Structural/Attribute Similarities
Yang Zhou, Hong Cheng, Jeffrey Xu Yu

Proc. Of the VLDB Endowment, France, 2009

Thursday, August 16, 2012

Presenter
Waqas Nawaz

Data Knowledge and Engineering Lab, Kyung Hee University Korea

Agenda

3/8
Data and Knowledge Engineering Lab 2

Introduction
 X = {x1, … , xN}: a set of data points
 S = (sij)i,j=1,…,N: the similarity matrix in which each element indicates the similarity sij
between two data points xi and xj

 The goal of clustering is to divide the data points into several groups such that
points in the same group are similar and points in different groups are dissimilar.

 Modeling the dataset as a graph

 The clustering problem in graph perspective is then formulated as a partition of
the graph such that nodes in the same sub-graph are densely
connected/homogeneous and sparsely connected /heterogeneous to the rest of
the graph.

 Distances and similarities are reverse to each other. In the following, only talk
about similarities, everything also works with distances.

3/8

Motivation

 The identification of clusters, well-connected components in a
graph, which is useful in many applications from biological
function prediction to social community detection

Attribute of Authors

from manyeyes.alphaworks.ibm.com
3/8

Objective

 A desired clustering of attributed graph should achieve a good
balance between the following:

 Structural cohesiveness: Vertices within one cluster are close to each
other in terms of structure, while vertices between clusters are
distant from each other

 Attribute homogeneity: Vertices within one cluster have similar
attribute values, while vertices between clusters have quite different
attribute values

Structural
Cohesiveness Attribute
Homogeneity

3/8

Related Work

 Structure Based Clustering
 Normalized cuts [Shi and Malik, TPAMI 2000]
 Modularity [Newman and Girvan, Phys. Rev. 2004]
 SCAN [Xu et al., KDD'07]
The clusters generated have a rather random distribution of vertex
properties within clusters

 Attribute Based Clustering
 K-SNAP [Tian et al., SIGMOD’08]
 Attributes compatible grouping
The clusters generated have a rather loose intra-cluster structure

Is there any way to consider both factors (Structure and Attribute)
simultaneously while Clustering…? YES

3/8

Graph Clustering with Structure & Attribute (1/11)

 Structure-based Clustering
 Vertices with heterogeneous values in a cluster

 Attribute-based Clustering
 Lose much structure information

 Structural/Attribute Cluster
 Vertices with homogeneous values in a cluster
 Keep most structure information

3/8

r1. XML
 Example: A Coauthor Network

Attribute-based Cluster
Structural Clustering
Structural/Attribute Cluster
r3. XML, Skyline r2. XML

r4. XML

r5. XML
r6. XML
r9. Skyline

r10. Skyline r11. Skyline r7. XML r8. XML

3/8


 Proposed iDEA: Flow Diagram

G Transform vertex attributes
Desired
to attribute edges
Clusters

Clustering
Ga
on G

Mapping onto the A unified distance
original graph Clustering on edges
on Ga

3/8


 Attribute Augmented Coauthor Graph with Topics
r1. XML

r3. XML, Skyline r2. XML

r4. XML

r5. XML
r6. XML
r9. Skyline

r10. Skyline r11. Skyline r7. XML r8. XML

Original Modified
Then we use neighborhood random walk distance on the augmented
graph to combine structural and attribute similarities
3/8

Neighborhood Random Walk (1/2)

A B C A B C

A A
B B
C C

Adjacency matrix A Transition matrix P

B B
1 1
1 1/2
1 1
A A

1 1/2 C
C

3/8

Neighborhood Random Walk (2/2)

t=0 t=1
B
1
1/2 B
1
A 1
1/2
1
1/2 A
C
1/2 C
t=2
B
1 t=3
1/2 B
1
A 1
1/2
1
1/2 C A

1/2 C

3/8


 The Kinds of Vertices and Edges
 Two kinds of vertices
• The Structure Vertex Set V
• The Attribute Vertex Set Va

 Two kinds of edges
• The structure edges E
• The attribute edges Ea

 The attribute augmented graph

3/8


 New Clustering Framework
Calculate the distance

Initialize the cluster centroids

Assign vertices to a cluster

Update the cluster centroids

Adjust edge weights automatically

Re-calculate the distance matrix
The objective function converges

3/8


 Transition Probability Matrix on Attribute Augmented Graph

 PV: probabilities from structure vertices to structure vertices
 A: probabilities from structure vertices to attribute vertices
 B: probabilities from attribute vertices to structure vertices
 O: probabilities from attributes to attributes, all entries are zero

3/8


 A Unified Distance Measure
 The unified neighborhood random walk distance:

 The matrix form of the neighborhood random walk distance:

 Cluster Centroid Initialization
 Identify good initial centroids from the density point of view
[Hinneburg and Keim, AAAI 1998]

 Influence function of vi on vj

 Density function of vi

3/8


 Clustering Process (K-means framework)
 Assign each vertex vi V to its closest centroid c* :

 Update the centroid with the most centrally located vertex in
each cluster:
• Compute the “average point” vi of a cluster Vi

• Find the new centroid whose random walk distance vector is the closest to
the cluster average

3/8


 Edge Weight Definition
 Different types of edges may have different degrees of importance
• Structure edge weight 0 fixed to 1.0 in the whole clustering process
• Attribute edge weight i for i 1,2,...,m
• All weights are initialized to 1.0, but will be automatically updated during clustering

“Topic” has a
more important
role than “age”

3/8


 Weight Self-Adjustment
 A vote mechanism determines whether two vertices share an
attribute value:

 Weight Increment:

 How the weight adjustment affects clustering convergence?
• Objective Function

• Demonstrate that the weights are adjusted towards the direction of
clustering convergence when we iteratively refine the clusters.

3/8

Experimental Evaluation (1/5)

 Datasets
 Political Blogs Dataset: 1490 vertices, 19090 edges, one
attribute political leaning
 DBLP Dataset: 5000 vertices, 16010 edges, two attributes
prolific and topic

 Methods
 K-SNAP [Tian et al., SIGMOD'08]: attribute only
 S-Cluster structure-based clustering
 W-Cluster weighted function
 SA-Cluster proposed method

3/8


 Evaluation Metrics
 Density: intra-cluster structural cohesiveness

 Entropy: intra-cluster attribute homogeneity

3/8


 Cluster Quality Evaluation

3/8


 Clustering Convergence

3/8

Conclusion
 Studied the problem of clustering graph with multiple
attributes on the attribute augmented graph

 A unified neighborhood random walk distance measures vertex
closeness on an attribute augmented graph

 Theoretical analysis to quantitatively estimate the
contributions of attribute similarity

 Automatically adjust the degree of contributions of different
attributes towards the direction of clustering convergence

3/8

Critical Review
 In literature, many algorithms have been proposed by various
authors, however they consider structural or attribute aspect
for finding similarities among nodes in the graph

 In this paper, both aspects are considered simultaneously
which reflect the true nature of the cluster or similarity among
different objects

 It utilizes the concept of Random Walk on the graph which
requires matrix manipulation (i.e. multiplication) so it become
unrealistic for huge dataset

 Due to iterative calculation of the similarity , it can not be
scalable to huge network (graph dataset)
3/8

Feasible Improvements
 Iterative nature of the similarity calculation should be avoided
by incorporating other feasible methods for relevancy check

 It can be scalable to the network where the nodes are not
densely connected with each other. In this way, they have less
degree and similarity calculation can be done easily

 Augmentation process can be remodeled/avoided to reduce
the space complexity and time consumption

3/8

Questions

Suggestions…!
3/8

Presentation on Graph Clustering (vldb 09)

More Related Content

What's hot (20)

Similar to Presentation on Graph Clustering (vldb 09) (20)

More from Waqas Nawaz (13)

Recently uploaded (20)

Presentation on Graph Clustering (vldb 09)