An introduction to similarity search and k-nn graphs

Distributed k-nearest
neighbors graph algorithms
Thibault Debatty, Ir PhD
2019-12-03

Distributed k-nearest neighbors graph algorithms 2
k-nn graph
Edge to k most
similar nodes

Context
Common tasks of machine learning,
data mining, Artificial Intelligence
or Big Data:
●
Similarity search
●
Clustering
●
Anomaly detection

Context : similarity search

Context : similarity search
“High Qua1ityMedications Discount
On All Reorders = Best Deal Ever!
Viagra50/100mg - $1.85 v8g6”
Similar to a known SPAM?

Context : clustering
Kobe Bryant traded to Clippers
No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp
Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj
Nurses make Great Incomes
Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3
Play here for summer fun
Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69
Obtain details on your cred1t online. Get started today
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071
Japanese food discount
Is your computer safe?
Luxury at a Discount!
Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5
Mutant fish sold at Connecticut market
High quality JBL speakers
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl
=> Identify SPAM campaigns

Context : clustering
To analyze 300 rogue websites:
●
Cluster
●
Analyze 1 representative of each
group

Context : anomaly detection
Find infected computer on a network

Context
●
Similarity search
●
Clustering and
●
Anomaly detection
… are crucial for data processing!

Challenges
How hard can that be?

Challenges
Computer memory is similar to a book
●
Accessible by address (page)
●
You have to read before you know
the content (e.g. coordinates of a
point)

Challenges
Naive similarity
search requires to
read all pages

Challenges
How many pages?
Bible TOB:
●
2000 pages
●
Extra thin paper
●
12cm
●
44 hours of reading

Challenges
Samsung Galaxy S9 (4GB)
63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...

Challenges
Our server
●
1500GB
●
200.000 books
●
A stack of 24km
●
1000 years of
reading
Brussels – Louvain la Neuve = 26km

Challenges
Even with modern hardware, naive
algorithms are not an option

Indexes
Divide space in
“zones”
Example:
●
North:
pages 1, 2, 3 and 4
●
South:
pages 5, 6, and 7

Indexes
Similarity search
with index
“query” is near zone
“SOUTH”
=> read pages 5, 6 and 7

Indexes : limitations
Similarity search
with index
Requires to read multiple
zones:
1d : 2 zones
2d : 4 zones
3d : 8 zones
8d : 256 zones
“curse of dimensionality”

Indexes : limitations
Great for low dimensional Euclidean
datasets (time)
But what about
●
Higher dimensions?
TV commercials: 4125 dimensions
●
Text?

k-nn graph
Can we use a k-nn graph for analyzing
large datasets ?

k-nn graph
Existing algorithms:
●
Clustering
●
Similarity search (but slow)

Outline
Build from large text datasets
●
Fast similarity search
●
Add and remove points
●
Applications:
– Text clustering
– Detection of compromised computers
●
… using distributed processing!

Build from large text datasets

String similarity
But first… how to measure similarity
between strings?
Lots of literature:
●
Levenshtein
●
Damerau
●
Jaro-Winkler
●
N-Gram
●
Q-Gram
●
Cosine
●
Jaccard index
●
…
But no clean implementation!

String similarity

Design and analysis of distributed k-nearest neighbors graph algorithms 30
Building from text datasets
●
NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
●
BUT: iterative!

NNCTPH
●
Hash using modified hashing
function
CTPH / ssdeep / spamsum
●
Build subgraphs in parallel
●
Merge subgraphs
Single iteration!

●
Experimental evaluation:
– Apache Hadoop MapReduce
– SPAM dataset
– Jaro-Winkler string similarity
(not metric)

Fast similarity search
Add and remove points

Online building
●
Given a distributed graph:
– Add nodes
– Remove nodes
– Search nearest neighbors of query node
●
Requires k-medoids partitioning of
graph

Partitioning
●
k-medoids clustering
●
CLARANS is slow to converge
●
Two faster methods:
– Inspired by Simulated Annealing
– Heuristic
●
Impact of partitioning when we
perform distributed search

Applications

Text clustering
●
Text dataset with Jaro-Winkler
similarity (not a metric)
●
Steps:
– Build (approximate) k-nn graph
– Prune
– Compute connected components

APT Detection
●
Advanced => no signatures
●
Persistent => limited activity
●
Threats
●
Need a C2 channel

APT Detection

APT Detection
Here:
APT relying on HTTP
=> proxy logs

APT Detection
How hard can that be?

APT Detection

APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy

APT Detection

APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats

APT Detection
After pruning the weighted graph,
the APT remains isolated!

APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats

APT Detection
●
Batch: build graphs
●
Interactive (web interface):
– Merge
– Prune
– Cluster
– Filter
●
Approximate k-nn graph
(time and memory)

APT Detection

APT Detection
●
Experimental evaluation
– Proxy logs of real network
– Simulated APT traffic
– Rank suspicious domains
●
Results
– High detection / false alarm ratio
– Without prior knowledge about APT

APT Detection
●
False positives:
– Content Delivery Networks (CDN)
– Advertising domains
– Javascript library delivery
– Websites with very few visits
=> same behavior as APT

Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
●
Only if approximation is acceptable
●
Other possibilities exist

Perspectives...
●
Broaden to other graph-like
structures:
– (Hierarchical) Small World Network
graphs
– Asymmetrical graphs
●
Broaden to other applications
(clustering, nn search)
●
Predict the magnitude of
approximation

Questions...
Cyber Defence Lab
www.cylab.be

An introduction to similarity search and k-nn graphs

More Related Content

What's hot (18)

Similar to An introduction to similarity search and k-nn graphs (20)

More from Thibault Debatty (14)

Recently uploaded (20)

An introduction to similarity search and k-nn graphs