ODSC India 2018: Topological space creation & Clustering at BigData scale

Topological space creation &
Clustering at BigData scale
Kuldeep Jiwani

Agenda
•Motivation
•Data Geometry: Analytical
• Topological spaces
• Curved spaces
• Manifolds
•Data Geometry: Applied Machine Learning
• Manifolds: Global vs Local
• Reference spaces (Probabilistic spaces)
• Clustering technique
•BigData computations
• Apache Spark code and optimizations

BigData Mining & Clustering
• BigData mining is concerned with
• Discovery of interesting patterns
• Uncovering unknown knowledge present in vast data lakes
• Cluster Analysis is the process of discovering homogeneous
groups called clusters
• Given some measure of similarity between data objects, the
goal in most clustering algorithms is
• Maximizing the homogeneity within each cluster
• Maximizing the heterogeneity between diﬀerent clusters

Geometry and data
• Curse of dimensionality: Our intuitions, which come from a
three-dimensional world (Euclidean), often do not apply in high-
dimensional ones*
• In high dimensions, most of the mass of a multivariate Gaussian
distribution is not near the mean
• But in an increasingly distant “shell” around it
• “Most of the volume of a high-dimensional orange is in the skin, not the
pulp”
• Blessings of non-uniformity: Distribution of natural data is non-
uniform and concentrates around low-dimensional structures*
• The shape (geometry) of the distribution can be exploited for
efficient learning
*Source: "A few useful things to know about Machine Learning" by Pedro Domingos at the University of Washington

Clustering and metric spaces
y
x
r
𝜃
𝜃
r
𝑟 = (𝑥2 + 𝑦2)
𝜃 = tan−1
(
𝑦
𝑥
)
The two clusters are easily separable
in (r, 𝜃) space

Data Geometry approaches
•Analytical
• Theoretical
•Empirical
•Applied Machine Learning ( BigData )

Topological spaces
• Is the most general notion of a mathematical space
• Is defined as a set of points
• Along with a set of neighborhoods for each point
• Satisfying a set of axioms relating points & neighborhood
• All finite and infinite union & intersections of sets belong
to the topology
• It allows definition of concepts such as
• Connectedness
• Compactness
• Dimensionality
• Continuity
• Presence of holes

Metric spaces
• Metric space is a special type of topological space
• With added restriction of a distance function d over metric space M
• For any x, y, z ∈ 𝑀, the following 4 properties should hold:
• 𝑑 𝑥, 𝑦 ≥ 0 Non-negativity
• 𝑑 𝑥, 𝑦 = 0, 𝑖𝑓 𝑥 == 𝑦 Identity of indiscernibles
• 𝑑 𝑥, 𝑦 = 𝑑(𝑦, 𝑥) Symmetry
• 𝑑 𝑥, 𝑧 ≤ 𝑑 𝑥, 𝑦 + 𝑑(𝑦, 𝑧) Triangle inequality

Euclidean space
• Euclidean metric is define by a norm on Rn
• 𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 = i=1
n
(xi − yi)2
• Euclidean space is how the world visually appears to us
• That’s why whenever a metric space is needed to be defined we
either explicitly assume it to be Euclidean or implicitly model it
• It may work well in many scenarios
• But is the assumption of Euclidean metric always valid?
• What if the underlying geometry is non-Euclidean
• What if the Euclidean metric is distorting the actual geometry

Curved spaces
• For points over the surface of
Earth, if we apply Euclidean
geometry, then it will distort the
shortest path (red lines)
• Geodesic distance: measures the
shortest path between two points
along the curved surface,
measured over great circles
• Great circles: having same diameter as
that of the sphere

A complex Geodesic
If an insect is placed on a surface and continually walks "forward",
by definition it will trace out a geodesic.

Geometries and Distances
• Whenever a metric is used to measure distance between two
points
• It means that an assumption has been made about the geometry of the
surface
• As a metric is unique to a geometry
• The problem is that different metrics can be used to estimate
the distance and hence, different geometries can be imposed
• A useful measure for identifying the right geometry amongst
infinite possible geometries between points is Gaussian
Curvature

Gaussian Curvature of a surface
• The Gaussian curvature (K)
informs us how curved a
specific surface is with
respect to a flat surface
• The magnitude of (K) tells us
how much the surface is
bending
• Curvature defines distances
K < 0
K = 0
K > 0
Distance Hyperboloid-n (K = -1)
dh(p, q) = cosh−1
( 1 +
i=1
n
pi
2
1 +
i=1
n
qi
2
−
i=1
n
piqi)
Distance Euclidean-n
𝑑 𝑒 p, q =
i=1
n
(pi − qi)2
Distance Spherical-n
𝑑 𝑠 𝑝, 𝑞 = 𝑟 cos−1
(
𝑖=1
𝑛
𝑝𝑖 𝑞𝑖)

Manifold
• A manifold is a topological space that is locally Euclidean
• Simplest, known example of a manifold is the Atlas of Earth
2 – Dimensional Manifold3 – Dimensional Earth

Manifold
• For every point in a local Euclidean
space there exists a continuous
function, along with an inverse
mapping that maps each point
uniquely in both spaces
(Homeomorphism)
• These are called charts and a
combination of these creates an atlas
• For an atlas, two charts can
overlap on a manifold and their
intersection can map to the same
Euclidean space
• Transition maps are composite
functions which help in this mapping

Manifold: Geometry & Topology
• Geometry and topology both study the properties of
manifolds
• Topology primarily studies those problem that are inherently
global in nature
• Geometry studies properties of manifolds, which do have an
interesting local structure

Riemannian geometry
• It deals with a broad range of
geometries whose metric
properties vary from point to point
• It studies Riemann manifolds
(smooth manifolds) over Reimann
metric
• Reimann metric: an inner product
on the tangent space at each point
that varies smoothly from point to
point
• This was also used in general
theory of relativity

Data geometry: Applied
Machine Learning

Data analysis & Clustering: Case – 1
K-Means

K-Means
Euclidean metric is invariant to rotation, reflection and translation

Euclidean metric still works well on a plane
K-Means

K-Means
Although the overall geometry is curved,
but in parts it can be easily approximated and separated by planes

K-Means (3D)
As curvature increases beyond a point K-Means doesn’t work well

Data analysis & Clustering: Case – 4 (Global Manifold)
K-Means (3D)
Manifold - Global
(MDS)

K-Means (3D)
Manifold - Global
(MDS)

Global Manifold (MDS)
K-Means (2D)

Data analysis & Clustering: Case – 4 (Local Manifold)
K-Means (3D)
Manifold - Local
(LLE)

Manifold: Global vs Local
Global
Manifold
Local
Manifold

Manifold construction: Global vs Local
Global Manifold Local Manifold
Clustering

DBSCAN: Clustering on local neighborhood
• Core points (A)
• Core edges
• Non-core points (B, C)
• Un-clustered points (N)
• Input: Distance matrix
• Critical parameter: Epsilon (ε)
• Radius or length of core edge
• The minimum distance
ε

Reference Spaces
Non-linear metric space
Probability space

Bipartite probability graph
U2
U3
U4
U1
U5
U6
Users Movie
categories
Action
Comedy
Drama
Thriller
U7
0.1 0.2 0.3 0.4
Action
Comedy
Drama
Thriller
0.3 0.2 0.1 0.4
0.1 0.3 0.2 0.4
0.4 0.2 0.3 0.1
0.0 0.3 0.3 0.4
0.5 0.2 0.3 0.0
0.0 0.7 0.3 0.0
U1
U2
U3
U4
U5
U6
U7
26 M EN NY
24 F HI CA
36 M EN PA
32 F FR CA
28 M HI NY
30 F FR CA
33 M EN NY
Age Sex Lang Loc
User Groups

Probability Spaces
A1 A2 A3 A4 A5 … … … AN
p1 p2 p3 … pP
Input data space
N – Dimensional attribute tuples
Probability distribution
P – Dimensional probability vectors
A1 A2 A3 A4 A5 … … … AN
A1 A2 A3 A4 A5 … … … AN
p1 p2 p3 … pP
p1 p2 p3 … pP

Distance metrics over probability spaces
• Various metrics available for measuring distances between
probability distributions:
• KL divergence
• Bhattacharya distance
• Total variation distance
• Similarity measures like SimRank can be computed over bipartite
graph
• True distance metric - Hellinger distance:
1
2 𝑖=1
𝑘
( 𝑝𝑖 − 𝑞𝑖)2
• For two discrete probability distributions:
• P = (p1, p2, …, pk) and Q = (q1, q2, …, qk)
• We can now obtain a distance matrix over probability
distributions

Problem statement so far
• For the purpose of clustering we need to capture local
neighborhood distances
• We need to compute a distance matrix
• Distance matrix is the most general form of capturing
neighborhoods
• Works for any distance metric over high-dimensional data
• Works for metrics over probabilistic spaces
• Key challenges:
• Building distance matrix for BigData
• Finding the epsilon (radius or min. distance) for clustering

Understanding the clustering-epsilon
Frequency distribution (Histogram, density plot)

Understanding the clustering-epsilon
Frequency distribution (Histogram, density plot)
Higher
resolution
density
plots

Method proposed:
Finding optimal clustering-epsilon
• Assumption: Distribution of natural high-dimensional data is
non-uniform and concentrates around low-dimensional
structures
• A significant proportion of data is organised in clusters
• Compute the density function (frequency distribution) of the
entire distance matrix
• The intra-cluster distances of points within a cluster should
lie in a narrow range
• If we model the density function as multi-modal Gaussian,
then the peak of first mode is the clustering-epsilon

Method proposed:
Finding optimal clustering-epsilon
• The problem comes down to finding the most optimal curve
for the Gaussian kernel
• One of the ways to solve it algorithmically
Grid Search
(band_width, grid_size)
rFFT
Silverman
Transform
I-rFFT
Score
(logLoss, stdDev)
Minima
(band_width, grid_size)

BigData: Distance Matrix
computation

Distance Matrix creation
• Inputs: DataFrame / RDD of Feature vectors
• Pairwise Distance function over 2 feature vectors
• BigData assumptions:
• Feature vector could have dimensions from 10 – 1000
• More than 1 million feature vectors
• Computation complexity:
• More than 1012 (106 x 106) distance computation
• High storage requirements for a matrix with more than 1012 entries

Distance Matrix creation: Optimization – 1
Reducing shuffle operations
• An intuitive way of doing pair wise distance computation
val feature_pairs_df =
feature_vec_df.crossJoin(feature_vec_df
.withColumnRenamed("featureVector", "featureVector_2"))
val feature_dist_df =
feature_pairs_df.withColumn("cosineDist",
cosineDist($"featureVector", $"featureVector_2"))
• Drawbacks
• Huge shuffle cost of cross join as total pairs would be more than 1012
• Assuming E executors, at a given point of time only (1012 / E) distances
computed

Reducing shuffle operations
• Optimal way to build a distance matrix
• Instead computing a pair at a time compute an entire row at a time
• Convert data to array of feature vectors
• Compute dot product of each element with transpose of array
• Broadcast the transposed array to do a row computation entirely in-
memory
val feature_array = feature_vec_df.rdd.map(row => (row.
getAs[Long](”id"), row.getAs[Array[Double]]("prob_dist_arr")
)).collect
val broad_feature_array =
spark.sparkContext.broadcast(feature_array)
val distance_matrix =
feature_vec_df.withColumn(”rowDist", getRowDist($”id",
$"prob_dist_arr"))

Reducing feature vector size
• Broadcasting and retaining 1000 dimensional arrays creates
memory overhead
• Majority of high-dimensional feature vectors are sparse
• Convert dense arrays to
org.apache.spark.mllib.linalg.SparseVector
val arrayToSparse = udf( (arr: Seq[Double]) => new
DenseVector(arr.toArray).toSparse )

Complexity tradeoffs: Time vs Space
• Assuming N observations and k as sparsity in feature vectors
• Cross Join complexity (per partition)
• Time complexity: O (N2 * C)
• C – Cost of shuffle and cost of distance function
• Space complexity: O (1) (No in-memory storage needed)
• Broadcast complexity (per partition)
• Time complexity: O (N * k * C)
• C – Cost of shuffle and cost of distance function
• Space complexity: O (N * k)
• Time improvement: O (N / k)
• Typically N > 106 and k < 0.1

Cut-off epsilon
• As the purpose of distance matrix is density based clustering
• We would only need the lower range of distances (as shown previously)
• If we can obtain a gross cut-off value for epsilon
• Then all distances above it can be ignored
• Quick idea:
• Take a random sample of data
• Compute distances
• Obtain a histogram and find the first peak
val sample_df = feature_vec_df.sample(false, 0.01)
val distance_matrix = sample_df
.withColumn(”rowDist", getRowDist($”id", $"prob_dist_arr"))
val histo = distance_matrix .select(col).rdd.map(row =>
row.getDouble(0)).histogram(numBins)

Scala performance issues
• Scala is a very decorative functional language
• But comes along with heavy performance cost
• Don’t use: {Iterators, for, foreach, map}
• Use: Primitive while loop
• Performance improvement in multiple orders for operations over
entire data

Distributed DBSCAN using GraphX/GraphFrames
// Create an initial graph based on raw data
val dist_graph = GraphFrame(vertex_df, edge_eps_df)
// Find the core points in the graph, who have at least numPoints neighbhours
val neighbour_df = dist_graph.outDegrees.filter($"outDegree" >= numPoints)
val core_points_df = vertex_df.join(neighbour_df, Seq("id"))
// Find the core edges that are edges which contain either both core points or one core point and one non-core point
val core_edges_src_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "src"),
Seq("src")).select("src", "dst", dist_label)
val core_edges_dst_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "dst"),
Seq("dst")).select("src", "dst", dist_label)
val core_edges_df = core_edges_src_df.unionAll(core_edges_dst_df).dropDuplicates()
// Create the core graph
val core_graph = GraphFrame(core_points_df, core_edges_df)
// Create check point directory to be used by connected components algorithm
spark.sparkContext.setCheckpointDir(”/tmp/checkPointDir")
// Obtain the clusters via connected components
val connectedComp = core_graph.connectedComponents.run()
val clusters_df = connectedComp.select("id", "component")

Summary
• Understanding the natural geometry of data, helps in
capturing the true information content
• Analytical approach: Figure out the correct geometry
theoretically and choose appropriate distance metric
• If data can be mapped to labels, then use it as a reference
space for applying data geometry and clustering
• Manifolds are an important tool to understand the global
and local structure of data
• For clustering focus on capturing the local neighbourhood
• For BigData processing use efficient distance matrix
computation techniques

Topology Quiz:
Can you differentiate between the 2 images?
If the answer is “YES”: Then you are not a topologist 

This how a topologist views it (Homeomorphism)

THANKS
E-mail: kuldeep.jiwani@gmail.com
LinkedIn: https://www.linkedin.com/in/kuldeep-jiwani-988605/

References
• A Few Useful Things to Know about Machine Learning
• Thinking Outside the Euclidean Box: Riemannian Geometry and
Inter-Temporal Decision-Making
• WikiBooks Topology: https://en.wikibooks.org/wiki/Topology
• Notes on topology:
https://logancollinsblog.com/2017/11/12/notes-on-topology/
• Torus geodesic:
https://commons.wikimedia.org/wiki/File:Insect_on_a_torus_tra
cing_out_a_non-trivial_geodesic.gif
• Coffee mug & torus homeomoephism:
https://commons.wikimedia.org/wiki/File:Mug_and_Torus_morp
h.gif

ODSC India 2018: Topological space creation & Clustering at BigData scale

More Related Content

What's hot (20)

Similar to ODSC India 2018: Topological space creation & Clustering at BigData scale (20)

Recently uploaded (20)