Graph and Density Based Clustering

By – AYUSH
Netaji Subhash engineering college, kolkata

Introduction
 The method of identifying similar groups of data in a
dataset is called clustering.
 It is one of the most popular techniques in data science.
 Entities in each group are comparatively more similar to
entities of that group than those of the other groups.
 In this presentation, I will be taking you through the
types of clustering, different clustering algorithms and a
brief view of two of the most commonly used clustering
methods i.e.,
Graph Based Clustering and Density Based Clustering.

Graph Theory :
 Graph Theory can be used for getting thorough information
about the inside structure of the data set in terms of :
- cliques (subgraph of graph such that all vertices in subgraph are
completely connected)
- clusters (highly connected group of nodes)
- centrality (measure of importance of a node in the network)
- outliers (unimportant nodes)
 Applications :
- Social Graphs (drawing edges between us and the people
and everything)
- Path Optimization Algorithms (Minimal Spanning Tree, Kruskal’s, Prim’s)
- GPS Navigation Systems (shortest path APIs)

GRAPH BASED CLUSTERING
 Graph-based clustering is a method for identifying
groups of similar cells or samples.
 It makes no prior assumptions about the clusters in the
data.
 This means the number, size, density, and shape of
clusters does not need to be known or assumed prior to
clustering.
 Consequently, graph-based clustering is useful for
identifying clustering in complex data sets such as
scRNA-seq.

IDEA :
• Graph-Based clustering uses the proximity graph
– Start with the proximity matrix
– Consider each point as a node in a graph
– Each edge between two nodes has a weight which is the
proximity between the two points
– Initially the proximity graph is fully connected
– MIN (single-link) and MAX (complete-link) can be viewed as
starting with this graph
• In the simplest case, clusters are connected components in the graph.

HIERARCHICAL METHOD :
1) Determining a minimal spanning tree (MST)
2) Delete branches iteratively
New Connected Components = Cluster
MINIMAL SPANNING TREE :
A minimal spanning tree of a connected graph G = (V,E) is a
connected subgraph with minimal weight that contains all nodes of
G and has no cycles.

Minimal Spanning Trees can be calculated with :-
 Prim’s Algorithm
- Prim's (also known as Jarník's) algorithm is a greedy algorithm that finds a
minimum spanning tree for a weighted undirected graph.
- This means it finds a subset of the edges that forms a tree that includes
every vertex, where the total weight of all the edges in the tree is
minimized.
 Kruskal’s Algorithm
- Kruskal's algorithm is a minimum-spanning-tree algorithm which finds an
edge of the least possible weight that connects any two trees in the forest.
- It is a greedy algorithm in graph theory as it finds a minimum spanning tree
for a connected weighted graph adding increasing cost arcs at each step.

Branch Deletion
Delete Branches – Different Strategies :-
I. Delete the branch with maximum weight.
II. Delete inconsistent branches.
III. Delete by analysis of weights.

SUMMARY :-
In graph based clustering objects are represented as
nodes in a complete or connected graph.
The distance between two objects is given by the weight
of the corresponding branch.
Hierarchical Method :
(1) Determine a minimal spanning tree(MST).
(2) Delete branches iteratively.
Visualization of information in large datasets.

DBSCAN :
 Density based spatial clustering of applications with noise.
 It is one of the most cited clustering algorithms in the literature.
Features : -
• Spatial data
(geomarketing, tomography, satellite images)
• Discovery of clusteres with arbitrary shape
(spherical, drawn out, linear, elongated)
• Good efficiency or large databases
(parallel programming)
• Only two parameters required.
• No prior knowledge of the number of clusters are required.

IDEA :
Clusters have a high density of points.
In the area of noise the density is lower than in any of the
clusters.
Goal :
Formalize the notions of clusters and
noise.

Density based cluster : definition
 Relies on a density-based notion of cluster: A cluster is defined as
a
maximum set of density-connected points.
 A cluster C is a subset of D satisfying
- For all p, q if p is in C, and q is density reachable from p, then
q
is also in C
- For all p, q in C: p is density connected to q

DENSITY BASED CLUSTERING: DATA
● Two Parameters:
- Eps : Maximum radius of the neighbourhood
- MinPts : Minimum number of points in an Eps-neighbourhood of that point
● Neps(p) : {q belongs to D| dist(p,q)<= Eps}

Problem :
 In each cluster there are two kinds of points :
- points inside the cluster (core points)
- points on the border (border points)
 An Eps-neighbourhood of a border point contains significantly less
points than an Eps-neighbourhood of a core point.

IDEA :
For every point p in a cluster C there is a point q ∈
C, so that
1) p is inside the Eps-neighbourhood of q and
2) Neps(q) contains at least MinPts points.

● Directly density-reachable: A point p is directly
density-reachable from point q with regard to Eps and MinPts, if
1) p ∈ to Neps (q) (reachability)
2)|Neps (q)|>= MinPts (core point condition)
DEFINITION :

Density-reachable:
 A point p is density-reachable
from a point q wrt. Eps,
MinPts if there is a chain of
points p1,...,pn,p1= q, pn = p
such that pi+1 is directly
density-reachable from pi.

Density-concerned:
 A point p is density-connected
to a wrt. Eps, MinPts if there is
a point o such that both, p and
q are density-reachable from
O wrt. Eps and MinPts.


DBSCAN (algorithm) :
Start with an arbitrary point p from the database and
retrieve all points density-reachable from p with regard to
Eps and MinPts.
If p is a core point, the procedure yields a cluster with
regards to Eps and MinPts and the point is classified.
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next unclassified point in
the database.

Density based clustering – application

CONCLUSION
Clustering is a descriptive technique.
The solution is not unique and it strongly depends
upon the analyst’s choices.
We described how it is possible to combine different
results in order to obtain stable clusters, not
depending too much on the criteria selected to
analyze data.
Clustering always provides groups, even if there is no
group structure.

REFERENCES :
 A big help from Eric Kropat.
 Wikipedia , Google Searches

Graph and Density Based Clustering

More Related Content

What's hot (20)

Similar to Graph and Density Based Clustering (20)

Recently uploaded (20)

Graph and Density Based Clustering