SlideShare a Scribd company logo
Topological space creation &
Clustering at BigData scale
Kuldeep Jiwani
Agenda
•Motivation
•Data Geometry: Analytical
• Topological spaces
• Curved spaces
• Manifolds
•Data Geometry: Applied Machine Learning
• Manifolds: Global vs Local
• Reference spaces (Probabilistic spaces)
• Clustering technique
•BigData computations
• Apache Spark code and optimizations
BigData Mining & Clustering
• BigData mining is concerned with
• Discovery of interesting patterns
• Uncovering unknown knowledge present in vast data lakes
• Cluster Analysis is the process of discovering homogeneous
groups called clusters
• Given some measure of similarity between data objects, the
goal in most clustering algorithms is
• Maximizing the homogeneity within each cluster
• Maximizing the heterogeneity between different clusters
Geometry and data
• Curse of dimensionality: Our intuitions, which come from a
three-dimensional world (Euclidean), often do not apply in high-
dimensional ones*
• In high dimensions, most of the mass of a multivariate Gaussian
distribution is not near the mean
• But in an increasingly distant “shell” around it
• “Most of the volume of a high-dimensional orange is in the skin, not the
pulp”
• Blessings of non-uniformity: Distribution of natural data is non-
uniform and concentrates around low-dimensional structures*
• The shape (geometry) of the distribution can be exploited for
efficient learning
*Source: "A few useful things to know about Machine Learning" by Pedro Domingos at the University of Washington
Clustering and metric spaces
y
x
r
𝜃
𝜃
r
𝑟 = (𝑥2 + 𝑦2)
𝜃 = tan−1
(
𝑦
𝑥
)
The two clusters are easily separable
in (r, 𝜃) space
Data Geometry approaches
•Analytical
• Theoretical
•Empirical
•Applied Machine Learning ( BigData )
Data geometry: Analytical
Topological spaces
• Is the most general notion of a mathematical space
• Is defined as a set of points
• Along with a set of neighborhoods for each point
• Satisfying a set of axioms relating points & neighborhood
• All finite and infinite union & intersections of sets belong
to the topology
• It allows definition of concepts such as
• Connectedness
• Compactness
• Dimensionality
• Continuity
• Presence of holes
Metric spaces
• Metric space is a special type of topological space
• With added restriction of a distance function d over metric space M
• For any x, y, z ∈ 𝑀, the following 4 properties should hold:
• 𝑑 𝑥, 𝑦 ≥ 0 Non-negativity
• 𝑑 𝑥, 𝑦 = 0, 𝑖𝑓 𝑥 == 𝑦 Identity of indiscernibles
• 𝑑 𝑥, 𝑦 = 𝑑(𝑦, 𝑥) Symmetry
• 𝑑 𝑥, 𝑧 ≤ 𝑑 𝑥, 𝑦 + 𝑑(𝑦, 𝑧) Triangle inequality
Euclidean space
• Euclidean metric is define by a norm on Rn
• 𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 = i=1
n
(xi − yi)2
• Euclidean space is how the world visually appears to us
• That’s why whenever a metric space is needed to be defined we
either explicitly assume it to be Euclidean or implicitly model it
• It may work well in many scenarios
• But is the assumption of Euclidean metric always valid?
• What if the underlying geometry is non-Euclidean
• What if the Euclidean metric is distorting the actual geometry
Curved spaces
• For points over the surface of
Earth, if we apply Euclidean
geometry, then it will distort the
shortest path (red lines)
• Geodesic distance: measures the
shortest path between two points
along the curved surface,
measured over great circles
• Great circles: having same diameter as
that of the sphere
A complex Geodesic
If an insect is placed on a surface and continually walks "forward",
by definition it will trace out a geodesic.
Geometries and Distances
• Whenever a metric is used to measure distance between two
points
• It means that an assumption has been made about the geometry of the
surface
• As a metric is unique to a geometry
• The problem is that different metrics can be used to estimate
the distance and hence, different geometries can be imposed
• A useful measure for identifying the right geometry amongst
infinite possible geometries between points is Gaussian
Curvature
Gaussian Curvature of a surface
• The Gaussian curvature (K)
informs us how curved a
specific surface is with
respect to a flat surface
• The magnitude of (K) tells us
how much the surface is
bending
• Curvature defines distances
K < 0
K = 0
K > 0
Distance Hyperboloid-n (K = -1)
dh(p, q) = cosh−1
( 1 +
i=1
n
pi
2
1 +
i=1
n
qi
2
−
i=1
n
piqi)
Distance Euclidean-n
𝑑 𝑒 p, q =
i=1
n
(pi − qi)2
Distance Spherical-n
𝑑 𝑠 𝑝, 𝑞 = 𝑟 cos−1
(
𝑖=1
𝑛
𝑝𝑖 𝑞𝑖)
Manifold
• A manifold is a topological space that is locally Euclidean
• Simplest, known example of a manifold is the Atlas of Earth
2 – Dimensional Manifold3 – Dimensional Earth
Manifold
• For every point in a local Euclidean
space there exists a continuous
function, along with an inverse
mapping that maps each point
uniquely in both spaces
(Homeomorphism)
• These are called charts and a
combination of these creates an atlas
• For an atlas, two charts can
overlap on a manifold and their
intersection can map to the same
Euclidean space
• Transition maps are composite
functions which help in this mapping
Manifold: Geometry & Topology
• Geometry and topology both study the properties of
manifolds
• Topology primarily studies those problem that are inherently
global in nature
• Geometry studies properties of manifolds, which do have an
interesting local structure
Riemannian geometry
• It deals with a broad range of
geometries whose metric
properties vary from point to point
• It studies Riemann manifolds
(smooth manifolds) over Reimann
metric
• Reimann metric: an inner product
on the tangent space at each point
that varies smoothly from point to
point
• This was also used in general
theory of relativity
Data geometry: Applied
Machine Learning
Data analysis & Clustering: Case – 1
K-Means
Data analysis & Clustering: Case – 2
K-Means
Euclidean metric is invariant to rotation, reflection and translation
Data analysis & Clustering: Case – 3
Euclidean metric still works well on a plane
K-Means
Data analysis & Clustering: Case – 4
K-Means
Although the overall geometry is curved,
but in parts it can be easily approximated and separated by planes
Data analysis & Clustering: Case – 5
K-Means (3D)
As curvature increases beyond a point K-Means doesn’t work well
Data analysis & Clustering: Case – 4 (Global Manifold)
K-Means (3D)
Manifold - Global
(MDS)
Data analysis & Clustering: Case – 5
K-Means (3D)
Manifold - Global
(MDS)
Data analysis & Clustering: Case – 5
Global Manifold (MDS)
K-Means (2D)
Data analysis & Clustering: Case – 4 (Local Manifold)
K-Means (3D)
Manifold - Local
(LLE)
Manifold: Global vs Local
Global
Manifold
Local
Manifold
Manifold construction: Global vs Local
Global Manifold Local Manifold
Clustering
DBSCAN: Clustering on local neighborhood
• Core points (A)
• Core edges
• Non-core points (B, C)
• Un-clustered points (N)
• Input: Distance matrix
• Critical parameter: Epsilon (ε)
• Radius or length of core edge
• The minimum distance
ε
Reference Spaces
Non-linear metric space
Probability space
Bipartite probability graph
U2
U3
U4
U1
U5
U6
Users Movie
categories
Action
Comedy
Drama
Thriller
U7
0.1 0.2 0.3 0.4
Action
Comedy
Drama
Thriller
0.3 0.2 0.1 0.4
0.1 0.3 0.2 0.4
0.4 0.2 0.3 0.1
0.0 0.3 0.3 0.4
0.5 0.2 0.3 0.0
0.0 0.7 0.3 0.0
U1
U2
U3
U4
U5
U6
U7
26 M EN NY
24 F HI CA
36 M EN PA
32 F FR CA
28 M HI NY
30 F FR CA
33 M EN NY
Age Sex Lang Loc
User Groups
Probability Spaces
A1 A2 A3 A4 A5 … … … AN
p1 p2 p3 … pP
Input data space
N – Dimensional attribute tuples
Probability distribution
P – Dimensional probability vectors
A1 A2 A3 A4 A5 … … … AN
A1 A2 A3 A4 A5 … … … AN
p1 p2 p3 … pP
p1 p2 p3 … pP
Distance metrics over probability spaces
• Various metrics available for measuring distances between
probability distributions:
• KL divergence
• Bhattacharya distance
• Total variation distance
• Similarity measures like SimRank can be computed over bipartite
graph
• True distance metric - Hellinger distance:
1
2 𝑖=1
𝑘
( 𝑝𝑖 − 𝑞𝑖)2
• For two discrete probability distributions:
• P = (p1, p2, …, pk) and Q = (q1, q2, …, qk)
• We can now obtain a distance matrix over probability
distributions
Problem statement so far
• For the purpose of clustering we need to capture local
neighborhood distances
• We need to compute a distance matrix
• Distance matrix is the most general form of capturing
neighborhoods
• Works for any distance metric over high-dimensional data
• Works for metrics over probabilistic spaces
• Key challenges:
• Building distance matrix for BigData
• Finding the epsilon (radius or min. distance) for clustering
Understanding the clustering-epsilon
Frequency distribution (Histogram, density plot)
Understanding the clustering-epsilon
Frequency distribution (Histogram, density plot)
Higher
resolution
density
plots
Method proposed:
Finding optimal clustering-epsilon
• Assumption: Distribution of natural high-dimensional data is
non-uniform and concentrates around low-dimensional
structures
• A significant proportion of data is organised in clusters
• Compute the density function (frequency distribution) of the
entire distance matrix
• The intra-cluster distances of points within a cluster should
lie in a narrow range
• If we model the density function as multi-modal Gaussian,
then the peak of first mode is the clustering-epsilon
Method proposed:
Finding optimal clustering-epsilon
• The problem comes down to finding the most optimal curve
for the Gaussian kernel
• One of the ways to solve it algorithmically
Grid Search
(band_width, grid_size)
rFFT
Silverman
Transform
I-rFFT
Score
(logLoss, stdDev)
Minima
(band_width, grid_size)
BigData: Distance Matrix
computation
Distance Matrix creation
• Inputs: DataFrame / RDD of Feature vectors
• Pairwise Distance function over 2 feature vectors
• BigData assumptions:
• Feature vector could have dimensions from 10 – 1000
• More than 1 million feature vectors
• Computation complexity:
• More than 1012 (106 x 106) distance computation
• High storage requirements for a matrix with more than 1012 entries
Distance Matrix creation: Optimization – 1
Reducing shuffle operations
• An intuitive way of doing pair wise distance computation
val feature_pairs_df =
feature_vec_df.crossJoin(feature_vec_df
.withColumnRenamed("featureVector", "featureVector_2"))
val feature_dist_df =
feature_pairs_df.withColumn("cosineDist",
cosineDist($"featureVector", $"featureVector_2"))
• Drawbacks
• Huge shuffle cost of cross join as total pairs would be more than 1012
• Assuming E executors, at a given point of time only (1012 / E) distances
computed
Distance Matrix creation: Optimization – 1
Reducing shuffle operations
• Optimal way to build a distance matrix
• Instead computing a pair at a time compute an entire row at a time
• Convert data to array of feature vectors
• Compute dot product of each element with transpose of array
• Broadcast the transposed array to do a row computation entirely in-
memory
val feature_array = feature_vec_df.rdd.map(row => (row.
getAs[Long](”id"), row.getAs[Array[Double]]("prob_dist_arr")
)).collect
val broad_feature_array =
spark.sparkContext.broadcast(feature_array)
val distance_matrix =
feature_vec_df.withColumn(”rowDist", getRowDist($”id",
$"prob_dist_arr"))
Distance Matrix creation: Optimization – 2
Reducing feature vector size
• Broadcasting and retaining 1000 dimensional arrays creates
memory overhead
• Majority of high-dimensional feature vectors are sparse
• Convert dense arrays to
org.apache.spark.mllib.linalg.SparseVector
val arrayToSparse = udf( (arr: Seq[Double]) => new
DenseVector(arr.toArray).toSparse )
Complexity tradeoffs: Time vs Space
• Assuming N observations and k as sparsity in feature vectors
• Cross Join complexity (per partition)
• Time complexity: O (N2 * C)
• C – Cost of shuffle and cost of distance function
• Space complexity: O (1) (No in-memory storage needed)
• Broadcast complexity (per partition)
• Time complexity: O (N * k * C)
• C – Cost of shuffle and cost of distance function
• Space complexity: O (N * k)
• Time improvement: O (N / k)
• Typically N > 106 and k < 0.1
Distance Matrix creation: Optimization – 3
Cut-off epsilon
• As the purpose of distance matrix is density based clustering
• We would only need the lower range of distances (as shown previously)
• If we can obtain a gross cut-off value for epsilon
• Then all distances above it can be ignored
• Quick idea:
• Take a random sample of data
• Compute distances
• Obtain a histogram and find the first peak
val sample_df = feature_vec_df.sample(false, 0.01)
val distance_matrix = sample_df
.withColumn(”rowDist", getRowDist($”id", $"prob_dist_arr"))
val histo = distance_matrix .select(col).rdd.map(row =>
row.getDouble(0)).histogram(numBins)
Distance Matrix creation: Optimization – 4
Scala performance issues
• Scala is a very decorative functional language
• But comes along with heavy performance cost
• Don’t use: {Iterators, for, foreach, map}
• Use: Primitive while loop
• Performance improvement in multiple orders for operations over
entire data
Distributed DBSCAN using GraphX/GraphFrames
// Create an initial graph based on raw data
val dist_graph = GraphFrame(vertex_df, edge_eps_df)
// Find the core points in the graph, who have at least numPoints neighbhours
val neighbour_df = dist_graph.outDegrees.filter($"outDegree" >= numPoints)
val core_points_df = vertex_df.join(neighbour_df, Seq("id"))
// Find the core edges that are edges which contain either both core points or one core point and one non-core point
val core_edges_src_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "src"),
Seq("src")).select("src", "dst", dist_label)
val core_edges_dst_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "dst"),
Seq("dst")).select("src", "dst", dist_label)
val core_edges_df = core_edges_src_df.unionAll(core_edges_dst_df).dropDuplicates()
// Create the core graph
val core_graph = GraphFrame(core_points_df, core_edges_df)
// Create check point directory to be used by connected components algorithm
spark.sparkContext.setCheckpointDir(”/tmp/checkPointDir")
// Obtain the clusters via connected components
val connectedComp = core_graph.connectedComponents.run()
val clusters_df = connectedComp.select("id", "component")
Summary
• Understanding the natural geometry of data, helps in
capturing the true information content
• Analytical approach: Figure out the correct geometry
theoretically and choose appropriate distance metric
• If data can be mapped to labels, then use it as a reference
space for applying data geometry and clustering
• Manifolds are an important tool to understand the global
and local structure of data
• For clustering focus on capturing the local neighbourhood
• For BigData processing use efficient distance matrix
computation techniques
Topology Quiz:
Can you differentiate between the 2 images?
If the answer is “YES”: Then you are not a topologist 
This how a topologist views it (Homeomorphism)
THANKS
E-mail: kuldeep.jiwani@gmail.com
LinkedIn: https://www.linkedin.com/in/kuldeep-jiwani-988605/
References
• A Few Useful Things to Know about Machine Learning
• Thinking Outside the Euclidean Box: Riemannian Geometry and
Inter-Temporal Decision-Making
• WikiBooks Topology: https://en.wikibooks.org/wiki/Topology
• Notes on topology:
https://logancollinsblog.com/2017/11/12/notes-on-topology/
• Torus geodesic:
https://commons.wikimedia.org/wiki/File:Insect_on_a_torus_tra
cing_out_a_non-trivial_geodesic.gif
• Coffee mug & torus homeomoephism:
https://commons.wikimedia.org/wiki/File:Mug_and_Torus_morp
h.gif

More Related Content

PPTX
Building maps with analysis
PPTX
Manifold learning
PPTX
Deep convolutional neural fields for depth estimation from a single image
PDF
PPT s03-machine vision-s2
PDF
Research Analysis and Design of Geometric Transformations using Affine Geometry
PDF
Lec15 sfm
PPTX
PPTX
Image segmentation
Building maps with analysis
Manifold learning
Deep convolutional neural fields for depth estimation from a single image
PPT s03-machine vision-s2
Research Analysis and Design of Geometric Transformations using Affine Geometry
Lec15 sfm
Image segmentation

What's hot (20)

PPT
Segmentation
PDF
Scaling Transform Methods For Compressing a 2D Graphical image
PDF
Feature extraction based retrieval of
PDF
PPT s08-machine vision-s2
PPTX
Segmentation of skin lesion from digital images using texture distinctiveness
PDF
Covariance models for geodetic applications of collocation brief version
PDF
View and illumination invariant iterative based image matching
PDF
Lec08 fitting
PPTX
UP-STAT 2015 Abstract Presentation - Statistical and Machine Learning Methods...
PDF
An Efficient Algorithm for the Segmentation of Astronomical Images
PDF
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
PPT
Image segmentation
PDF
MRI IMAGES THRESHOLDING FOR ALZHEIMER DETECTION
PDF
MRI IMAGES THRESHOLDING FOR ALZHEIMER DETECTION
PPTX
Visual realism
PDF
Heuristic Function Influence to the Global Optimum Value in Shortest Path Pro...
PDF
Lec05 filter
PDF
PPT s12-machine vision-s2
PDF
Change Detection of Water-Body in Synthetic Aperture Radar Images
PDF
APPEARANCE-BASED REPRESENTATION AND RENDERING OF CAST SHADOWS
Segmentation
Scaling Transform Methods For Compressing a 2D Graphical image
Feature extraction based retrieval of
PPT s08-machine vision-s2
Segmentation of skin lesion from digital images using texture distinctiveness
Covariance models for geodetic applications of collocation brief version
View and illumination invariant iterative based image matching
Lec08 fitting
UP-STAT 2015 Abstract Presentation - Statistical and Machine Learning Methods...
An Efficient Algorithm for the Segmentation of Astronomical Images
An_Accelerated_Nearest_Neighbor_Search_Method_for_the_K-Means_Clustering_Algo...
Image segmentation
MRI IMAGES THRESHOLDING FOR ALZHEIMER DETECTION
MRI IMAGES THRESHOLDING FOR ALZHEIMER DETECTION
Visual realism
Heuristic Function Influence to the Global Optimum Value in Shortest Path Pro...
Lec05 filter
PPT s12-machine vision-s2
Change Detection of Water-Body in Synthetic Aperture Radar Images
APPEARANCE-BASED REPRESENTATION AND RENDERING OF CAST SHADOWS
Ad

Similar to ODSC India 2018: Topological space creation &amp; Clustering at BigData scale (20)

PPT
[PPT]
PPT
Cs345 cl
PPT
cs4811-ch10c-clusuughv hgyf yfyf tering.ppt
PPTX
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
PDF
PR07.pdf
PPT
Lect4
DOC
Distance
PPT
UnSupervised Machincs4811-ch23a-clustering.ppt
PPT
cs4811-ch23a-K-means clustering algorithm .ppt
PPT
Digital Distance Geometry
PDF
Dimensionality reduction with UMAP
PDF
DMTM 2015 - 06 Introduction to Clustering
PDF
Module - 5 Machine Learning-22ISE62.pdf
PPTX
Hyperbolic Image Embedding.pptx
PDF
DMTM Lecture 11 Clustering
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PDF
Curse of Dimensionality and Big Data
PDF
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
PDF
Clustering
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
[PPT]
Cs345 cl
cs4811-ch10c-clusuughv hgyf yfyf tering.ppt
Fassold-MMAsia2023-Tutorial-GeometricDL-Part1.pptx
PR07.pdf
Lect4
Distance
UnSupervised Machincs4811-ch23a-clustering.ppt
cs4811-ch23a-K-means clustering algorithm .ppt
Digital Distance Geometry
Dimensionality reduction with UMAP
DMTM 2015 - 06 Introduction to Clustering
Module - 5 Machine Learning-22ISE62.pdf
Hyperbolic Image Embedding.pptx
DMTM Lecture 11 Clustering
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Curse of Dimensionality and Big Data
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
Clustering
Clustering Algorithms - Kmeans,Min ALgorithm
Ad

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Business Analytics and business intelligence.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Computer network topology notes for revision
Lecture1 pattern recognition............
Fluorescence-microscope_Botany_detailed content
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
IB Computer Science - Internal Assessment.pptx
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Business Analytics and business intelligence.pdf
.pdf is not working space design for the following data for the following dat...
oil_refinery_comprehensive_20250804084928 (1).pptx
Database Infoormation System (DBIS).pptx
annual-report-2024-2025 original latest.
Supervised vs unsupervised machine learning algorithms
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Miokarditis (Inflamasi pada Otot Jantung)
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
ISS -ESG Data flows What is ESG and HowHow
Computer network topology notes for revision

ODSC India 2018: Topological space creation &amp; Clustering at BigData scale

  • 1. Topological space creation & Clustering at BigData scale Kuldeep Jiwani
  • 2. Agenda •Motivation •Data Geometry: Analytical • Topological spaces • Curved spaces • Manifolds •Data Geometry: Applied Machine Learning • Manifolds: Global vs Local • Reference spaces (Probabilistic spaces) • Clustering technique •BigData computations • Apache Spark code and optimizations
  • 3. BigData Mining & Clustering • BigData mining is concerned with • Discovery of interesting patterns • Uncovering unknown knowledge present in vast data lakes • Cluster Analysis is the process of discovering homogeneous groups called clusters • Given some measure of similarity between data objects, the goal in most clustering algorithms is • Maximizing the homogeneity within each cluster • Maximizing the heterogeneity between different clusters
  • 4. Geometry and data • Curse of dimensionality: Our intuitions, which come from a three-dimensional world (Euclidean), often do not apply in high- dimensional ones* • In high dimensions, most of the mass of a multivariate Gaussian distribution is not near the mean • But in an increasingly distant “shell” around it • “Most of the volume of a high-dimensional orange is in the skin, not the pulp” • Blessings of non-uniformity: Distribution of natural data is non- uniform and concentrates around low-dimensional structures* • The shape (geometry) of the distribution can be exploited for efficient learning *Source: "A few useful things to know about Machine Learning" by Pedro Domingos at the University of Washington
  • 5. Clustering and metric spaces y x r 𝜃 𝜃 r 𝑟 = (𝑥2 + 𝑦2) 𝜃 = tan−1 ( 𝑦 𝑥 ) The two clusters are easily separable in (r, 𝜃) space
  • 6. Data Geometry approaches •Analytical • Theoretical •Empirical •Applied Machine Learning ( BigData )
  • 8. Topological spaces • Is the most general notion of a mathematical space • Is defined as a set of points • Along with a set of neighborhoods for each point • Satisfying a set of axioms relating points & neighborhood • All finite and infinite union & intersections of sets belong to the topology • It allows definition of concepts such as • Connectedness • Compactness • Dimensionality • Continuity • Presence of holes
  • 9. Metric spaces • Metric space is a special type of topological space • With added restriction of a distance function d over metric space M • For any x, y, z ∈ 𝑀, the following 4 properties should hold: • 𝑑 𝑥, 𝑦 ≥ 0 Non-negativity • 𝑑 𝑥, 𝑦 = 0, 𝑖𝑓 𝑥 == 𝑦 Identity of indiscernibles • 𝑑 𝑥, 𝑦 = 𝑑(𝑦, 𝑥) Symmetry • 𝑑 𝑥, 𝑧 ≤ 𝑑 𝑥, 𝑦 + 𝑑(𝑦, 𝑧) Triangle inequality
  • 10. Euclidean space • Euclidean metric is define by a norm on Rn • 𝑑 𝑥, 𝑦 = 𝑥 − 𝑦 = i=1 n (xi − yi)2 • Euclidean space is how the world visually appears to us • That’s why whenever a metric space is needed to be defined we either explicitly assume it to be Euclidean or implicitly model it • It may work well in many scenarios • But is the assumption of Euclidean metric always valid? • What if the underlying geometry is non-Euclidean • What if the Euclidean metric is distorting the actual geometry
  • 11. Curved spaces • For points over the surface of Earth, if we apply Euclidean geometry, then it will distort the shortest path (red lines) • Geodesic distance: measures the shortest path between two points along the curved surface, measured over great circles • Great circles: having same diameter as that of the sphere
  • 12. A complex Geodesic If an insect is placed on a surface and continually walks "forward", by definition it will trace out a geodesic.
  • 13. Geometries and Distances • Whenever a metric is used to measure distance between two points • It means that an assumption has been made about the geometry of the surface • As a metric is unique to a geometry • The problem is that different metrics can be used to estimate the distance and hence, different geometries can be imposed • A useful measure for identifying the right geometry amongst infinite possible geometries between points is Gaussian Curvature
  • 14. Gaussian Curvature of a surface • The Gaussian curvature (K) informs us how curved a specific surface is with respect to a flat surface • The magnitude of (K) tells us how much the surface is bending • Curvature defines distances K < 0 K = 0 K > 0 Distance Hyperboloid-n (K = -1) dh(p, q) = cosh−1 ( 1 + i=1 n pi 2 1 + i=1 n qi 2 − i=1 n piqi) Distance Euclidean-n 𝑑 𝑒 p, q = i=1 n (pi − qi)2 Distance Spherical-n 𝑑 𝑠 𝑝, 𝑞 = 𝑟 cos−1 ( 𝑖=1 𝑛 𝑝𝑖 𝑞𝑖)
  • 15. Manifold • A manifold is a topological space that is locally Euclidean • Simplest, known example of a manifold is the Atlas of Earth 2 – Dimensional Manifold3 – Dimensional Earth
  • 16. Manifold • For every point in a local Euclidean space there exists a continuous function, along with an inverse mapping that maps each point uniquely in both spaces (Homeomorphism) • These are called charts and a combination of these creates an atlas • For an atlas, two charts can overlap on a manifold and their intersection can map to the same Euclidean space • Transition maps are composite functions which help in this mapping
  • 17. Manifold: Geometry & Topology • Geometry and topology both study the properties of manifolds • Topology primarily studies those problem that are inherently global in nature • Geometry studies properties of manifolds, which do have an interesting local structure
  • 18. Riemannian geometry • It deals with a broad range of geometries whose metric properties vary from point to point • It studies Riemann manifolds (smooth manifolds) over Reimann metric • Reimann metric: an inner product on the tangent space at each point that varies smoothly from point to point • This was also used in general theory of relativity
  • 20. Data analysis & Clustering: Case – 1 K-Means
  • 21. Data analysis & Clustering: Case – 2 K-Means Euclidean metric is invariant to rotation, reflection and translation
  • 22. Data analysis & Clustering: Case – 3 Euclidean metric still works well on a plane K-Means
  • 23. Data analysis & Clustering: Case – 4 K-Means Although the overall geometry is curved, but in parts it can be easily approximated and separated by planes
  • 24. Data analysis & Clustering: Case – 5 K-Means (3D) As curvature increases beyond a point K-Means doesn’t work well
  • 25. Data analysis & Clustering: Case – 4 (Global Manifold) K-Means (3D) Manifold - Global (MDS)
  • 26. Data analysis & Clustering: Case – 5 K-Means (3D) Manifold - Global (MDS)
  • 27. Data analysis & Clustering: Case – 5 Global Manifold (MDS) K-Means (2D)
  • 28. Data analysis & Clustering: Case – 4 (Local Manifold) K-Means (3D) Manifold - Local (LLE)
  • 29. Manifold: Global vs Local Global Manifold Local Manifold
  • 30. Manifold construction: Global vs Local Global Manifold Local Manifold Clustering
  • 31. DBSCAN: Clustering on local neighborhood • Core points (A) • Core edges • Non-core points (B, C) • Un-clustered points (N) • Input: Distance matrix • Critical parameter: Epsilon (ε) • Radius or length of core edge • The minimum distance ε
  • 32. Reference Spaces Non-linear metric space Probability space
  • 33. Bipartite probability graph U2 U3 U4 U1 U5 U6 Users Movie categories Action Comedy Drama Thriller U7 0.1 0.2 0.3 0.4 Action Comedy Drama Thriller 0.3 0.2 0.1 0.4 0.1 0.3 0.2 0.4 0.4 0.2 0.3 0.1 0.0 0.3 0.3 0.4 0.5 0.2 0.3 0.0 0.0 0.7 0.3 0.0 U1 U2 U3 U4 U5 U6 U7 26 M EN NY 24 F HI CA 36 M EN PA 32 F FR CA 28 M HI NY 30 F FR CA 33 M EN NY Age Sex Lang Loc User Groups
  • 34. Probability Spaces A1 A2 A3 A4 A5 … … … AN p1 p2 p3 … pP Input data space N – Dimensional attribute tuples Probability distribution P – Dimensional probability vectors A1 A2 A3 A4 A5 … … … AN A1 A2 A3 A4 A5 … … … AN p1 p2 p3 … pP p1 p2 p3 … pP
  • 35. Distance metrics over probability spaces • Various metrics available for measuring distances between probability distributions: • KL divergence • Bhattacharya distance • Total variation distance • Similarity measures like SimRank can be computed over bipartite graph • True distance metric - Hellinger distance: 1 2 𝑖=1 𝑘 ( 𝑝𝑖 − 𝑞𝑖)2 • For two discrete probability distributions: • P = (p1, p2, …, pk) and Q = (q1, q2, …, qk) • We can now obtain a distance matrix over probability distributions
  • 36. Problem statement so far • For the purpose of clustering we need to capture local neighborhood distances • We need to compute a distance matrix • Distance matrix is the most general form of capturing neighborhoods • Works for any distance metric over high-dimensional data • Works for metrics over probabilistic spaces • Key challenges: • Building distance matrix for BigData • Finding the epsilon (radius or min. distance) for clustering
  • 37. Understanding the clustering-epsilon Frequency distribution (Histogram, density plot)
  • 38. Understanding the clustering-epsilon Frequency distribution (Histogram, density plot) Higher resolution density plots
  • 39. Method proposed: Finding optimal clustering-epsilon • Assumption: Distribution of natural high-dimensional data is non-uniform and concentrates around low-dimensional structures • A significant proportion of data is organised in clusters • Compute the density function (frequency distribution) of the entire distance matrix • The intra-cluster distances of points within a cluster should lie in a narrow range • If we model the density function as multi-modal Gaussian, then the peak of first mode is the clustering-epsilon
  • 40. Method proposed: Finding optimal clustering-epsilon • The problem comes down to finding the most optimal curve for the Gaussian kernel • One of the ways to solve it algorithmically Grid Search (band_width, grid_size) rFFT Silverman Transform I-rFFT Score (logLoss, stdDev) Minima (band_width, grid_size)
  • 42. Distance Matrix creation • Inputs: DataFrame / RDD of Feature vectors • Pairwise Distance function over 2 feature vectors • BigData assumptions: • Feature vector could have dimensions from 10 – 1000 • More than 1 million feature vectors • Computation complexity: • More than 1012 (106 x 106) distance computation • High storage requirements for a matrix with more than 1012 entries
  • 43. Distance Matrix creation: Optimization – 1 Reducing shuffle operations • An intuitive way of doing pair wise distance computation val feature_pairs_df = feature_vec_df.crossJoin(feature_vec_df .withColumnRenamed("featureVector", "featureVector_2")) val feature_dist_df = feature_pairs_df.withColumn("cosineDist", cosineDist($"featureVector", $"featureVector_2")) • Drawbacks • Huge shuffle cost of cross join as total pairs would be more than 1012 • Assuming E executors, at a given point of time only (1012 / E) distances computed
  • 44. Distance Matrix creation: Optimization – 1 Reducing shuffle operations • Optimal way to build a distance matrix • Instead computing a pair at a time compute an entire row at a time • Convert data to array of feature vectors • Compute dot product of each element with transpose of array • Broadcast the transposed array to do a row computation entirely in- memory val feature_array = feature_vec_df.rdd.map(row => (row. getAs[Long](”id"), row.getAs[Array[Double]]("prob_dist_arr") )).collect val broad_feature_array = spark.sparkContext.broadcast(feature_array) val distance_matrix = feature_vec_df.withColumn(”rowDist", getRowDist($”id", $"prob_dist_arr"))
  • 45. Distance Matrix creation: Optimization – 2 Reducing feature vector size • Broadcasting and retaining 1000 dimensional arrays creates memory overhead • Majority of high-dimensional feature vectors are sparse • Convert dense arrays to org.apache.spark.mllib.linalg.SparseVector val arrayToSparse = udf( (arr: Seq[Double]) => new DenseVector(arr.toArray).toSparse )
  • 46. Complexity tradeoffs: Time vs Space • Assuming N observations and k as sparsity in feature vectors • Cross Join complexity (per partition) • Time complexity: O (N2 * C) • C – Cost of shuffle and cost of distance function • Space complexity: O (1) (No in-memory storage needed) • Broadcast complexity (per partition) • Time complexity: O (N * k * C) • C – Cost of shuffle and cost of distance function • Space complexity: O (N * k) • Time improvement: O (N / k) • Typically N > 106 and k < 0.1
  • 47. Distance Matrix creation: Optimization – 3 Cut-off epsilon • As the purpose of distance matrix is density based clustering • We would only need the lower range of distances (as shown previously) • If we can obtain a gross cut-off value for epsilon • Then all distances above it can be ignored • Quick idea: • Take a random sample of data • Compute distances • Obtain a histogram and find the first peak val sample_df = feature_vec_df.sample(false, 0.01) val distance_matrix = sample_df .withColumn(”rowDist", getRowDist($”id", $"prob_dist_arr")) val histo = distance_matrix .select(col).rdd.map(row => row.getDouble(0)).histogram(numBins)
  • 48. Distance Matrix creation: Optimization – 4 Scala performance issues • Scala is a very decorative functional language • But comes along with heavy performance cost • Don’t use: {Iterators, for, foreach, map} • Use: Primitive while loop • Performance improvement in multiple orders for operations over entire data
  • 49. Distributed DBSCAN using GraphX/GraphFrames // Create an initial graph based on raw data val dist_graph = GraphFrame(vertex_df, edge_eps_df) // Find the core points in the graph, who have at least numPoints neighbhours val neighbour_df = dist_graph.outDegrees.filter($"outDegree" >= numPoints) val core_points_df = vertex_df.join(neighbour_df, Seq("id")) // Find the core edges that are edges which contain either both core points or one core point and one non-core point val core_edges_src_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "src"), Seq("src")).select("src", "dst", dist_label) val core_edges_dst_df = edge_eps_df.join(core_points_df.select("id").withColumnRenamed("id", "dst"), Seq("dst")).select("src", "dst", dist_label) val core_edges_df = core_edges_src_df.unionAll(core_edges_dst_df).dropDuplicates() // Create the core graph val core_graph = GraphFrame(core_points_df, core_edges_df) // Create check point directory to be used by connected components algorithm spark.sparkContext.setCheckpointDir(”/tmp/checkPointDir") // Obtain the clusters via connected components val connectedComp = core_graph.connectedComponents.run() val clusters_df = connectedComp.select("id", "component")
  • 50. Summary • Understanding the natural geometry of data, helps in capturing the true information content • Analytical approach: Figure out the correct geometry theoretically and choose appropriate distance metric • If data can be mapped to labels, then use it as a reference space for applying data geometry and clustering • Manifolds are an important tool to understand the global and local structure of data • For clustering focus on capturing the local neighbourhood • For BigData processing use efficient distance matrix computation techniques
  • 51. Topology Quiz: Can you differentiate between the 2 images? If the answer is “YES”: Then you are not a topologist 
  • 52. This how a topologist views it (Homeomorphism)
  • 54. References • A Few Useful Things to Know about Machine Learning • Thinking Outside the Euclidean Box: Riemannian Geometry and Inter-Temporal Decision-Making • WikiBooks Topology: https://en.wikibooks.org/wiki/Topology • Notes on topology: https://logancollinsblog.com/2017/11/12/notes-on-topology/ • Torus geodesic: https://commons.wikimedia.org/wiki/File:Insect_on_a_torus_tra cing_out_a_non-trivial_geodesic.gif • Coffee mug & torus homeomoephism: https://commons.wikimedia.org/wiki/File:Mug_and_Torus_morp h.gif