SlideShare a Scribd company logo
Distributed k-nearest
neighbors graph algorithms
Thibault Debatty, Ir PhD
2019-12-03
Distributed k-nearest neighbors graph algorithms 2
k-nn graph
Edge to k most
similar nodes
Distributed k-nearest neighbors graph algorithms 3
Context
Common tasks of machine learning,
data mining, Artificial Intelligence
or Big Data:
●
Similarity search
●
Clustering
●
Anomaly detection
Distributed k-nearest neighbors graph algorithms 4
Context : similarity search
Distributed k-nearest neighbors graph algorithms 5
Context : similarity search
“High Qua1ityMedications Discount
On All Reorders = Best Deal Ever!
Viagra50/100mg - $1.85 v8g6”
Similar to a known SPAM?
Distributed k-nearest neighbors graph algorithms 6
Context : clustering
Kobe Bryant traded to Clippers
No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp
Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj
Nurses make Great Incomes
Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3
Play here for summer fun
Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69
Obtain details on your cred1t online. Get started today
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071
Japanese food discount
Is your computer safe?
Luxury at a Discount!
Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5
Mutant fish sold at Connecticut market
High quality JBL speakers
Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt
High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl
=> Identify SPAM campaigns
Distributed k-nearest neighbors graph algorithms 7
Context : clustering
To analyze 300 rogue websites:
●
Cluster
●
Analyze 1 representative of each
group
Distributed k-nearest neighbors graph algorithms 8
Context : anomaly detection
Find infected computer on a network
Distributed k-nearest neighbors graph algorithms 9
Context
●
Similarity search
●
Clustering and
●
Anomaly detection
… are crucial for data processing!
Distributed k-nearest neighbors graph algorithms 10
Challenges
How hard can that be?
Distributed k-nearest neighbors graph algorithms 11
Challenges
Computer memory is similar to a book
●
Accessible by address (page)
●
You have to read before you know
the content (e.g. coordinates of a
point)
Distributed k-nearest neighbors graph algorithms 12
Challenges
Naive similarity
search requires to
read all pages
Distributed k-nearest neighbors graph algorithms 13
Challenges
How many pages?
Bible TOB:
●
2000 pages
●
Extra thin paper
●
12cm
●
44 hours of reading
Distributed k-nearest neighbors graph algorithms 14
Challenges
Samsung Galaxy S9 (4GB)
63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
Distributed k-nearest neighbors graph algorithms 15
Challenges
Our server
●
1500GB
●
200.000 books
●
A stack of 24km
●
1000 years of
reading
Brussels – Louvain la Neuve = 26km
Distributed k-nearest neighbors graph algorithms 16
Challenges
Even with modern hardware, naive
algorithms are not an option
Distributed k-nearest neighbors graph algorithms 17
Indexes
Divide space in
“zones”
Example:
●
North:
pages 1, 2, 3 and 4
●
South:
pages 5, 6, and 7
Distributed k-nearest neighbors graph algorithms 18
Indexes
Similarity search
with index
“query” is near zone
“SOUTH”
=> read pages 5, 6 and 7
Distributed k-nearest neighbors graph algorithms 19
Indexes : limitations
Similarity search
with index
Requires to read multiple
zones:
1d : 2 zones
2d : 4 zones
3d : 8 zones
8d : 256 zones
“curse of dimensionality”
Distributed k-nearest neighbors graph algorithms 20
Indexes : limitations
Great for low dimensional Euclidean
datasets (time)
But what about
●
Higher dimensions?
TV commercials: 4125 dimensions
●
Text?
Distributed k-nearest neighbors graph algorithms 21
k-nn graph
Can we use a k-nn graph for analyzing
large datasets ?
Distributed k-nearest neighbors graph algorithms 22
k-nn graph
Existing algorithms:
●
Clustering
●
Similarity search (but slow)
Distributed k-nearest neighbors graph algorithms 23
Outline
Build from large text datasets
●
Fast similarity search
●
Add and remove points
●
Applications:
– Text clustering
– Detection of compromised computers
●
… using distributed processing!
Distributed k-nearest neighbors graph algorithms 24
Build from large text datasets
Distributed k-nearest neighbors graph algorithms 25
String similarity
But first… how to measure similarity
between strings?
Lots of literature:
●
Levenshtein
●
Damerau
●
Jaro-Winkler
●
N-Gram
●
Q-Gram
●
Cosine
●
Jaccard index
●
…
But no clean implementation!
Distributed k-nearest neighbors graph algorithms 26
String similarity
Distributed k-nearest neighbors graph algorithms 27
String similarity
Distributed k-nearest neighbors graph algorithms 28
String similarity
Distributed k-nearest neighbors graph algorithms 29
String similarity
Design and analysis of distributed k-nearest neighbors graph algorithms 30
Building from text datasets
●
NN-Descent
Build an approximate graph
Compute O(n1.14) similarities
●
BUT: iterative!
Distributed k-nearest neighbors graph algorithms 31
Building from text datasets
NNCTPH
●
Hash using modified hashing
function
CTPH / ssdeep / spamsum
●
Build subgraphs in parallel
●
Merge subgraphs
Single iteration!
Distributed k-nearest neighbors graph algorithms 32
Building from text datasets
Distributed k-nearest neighbors graph algorithms 33
Building from text datasets
●
Experimental evaluation:
– Apache Hadoop MapReduce
– SPAM dataset
– Jaro-Winkler string similarity
(not metric)
Distributed k-nearest neighbors graph algorithms 34
Building from text datasets
Distributed k-nearest neighbors graph algorithms 35
Fast similarity search
Add and remove points
Distributed k-nearest neighbors graph algorithms 36
Online building
●
Given a distributed graph:
– Add nodes
– Remove nodes
– Search nearest neighbors of query node
●
Requires k-medoids partitioning of
graph
Distributed k-nearest neighbors graph algorithms 37
Partitioning
●
k-medoids clustering
●
CLARANS is slow to converge
●
Two faster methods:
– Inspired by Simulated Annealing
– Heuristic
●
Impact of partitioning when we
perform distributed search
Distributed k-nearest neighbors graph algorithms 38
Applications
Distributed k-nearest neighbors graph algorithms 39
Text clustering
●
Text dataset with Jaro-Winkler
similarity (not a metric)
●
Steps:
– Build (approximate) k-nn graph
– Prune
– Compute connected components
Distributed k-nearest neighbors graph algorithms 40
APT Detection
●
Advanced => no signatures
●
Persistent => limited activity
●
Threats
●
Need a C2 channel
Distributed k-nearest neighbors graph algorithms 41
APT Detection
Distributed k-nearest neighbors graph algorithms 42
APT Detection
Here:
APT relying on HTTP
=> proxy logs
Distributed k-nearest neighbors graph algorithms 43
APT Detection
How hard can that be?
Distributed k-nearest neighbors graph algorithms 44
APT Detection
Distributed k-nearest neighbors graph algorithms 45
APT Detection
Displaying a page requires multiple
HTTP requests
=> link each request to its parent
using the logs from the proxy
Distributed k-nearest neighbors graph algorithms 46
APT Detection
Distributed k-nearest neighbors graph algorithms 47
APT Detection
Distributed k-nearest neighbors graph algorithms 48
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 49
APT Detection
After pruning the weighted graph,
the APT remains isolated!
Distributed k-nearest neighbors graph algorithms 50
APT Detection
weight is higher if:
●
Requests are close in time
●
Requests belong to the same domain
●
Same sequence repeats
Distributed k-nearest neighbors graph algorithms 51
APT Detection
●
Batch: build graphs
●
Interactive (web interface):
– Merge
– Prune
– Cluster
– Filter
●
Approximate k-nn graph
(time and memory)
Distributed k-nearest neighbors graph algorithms 52
APT Detection
Distributed k-nearest neighbors graph algorithms 53
APT Detection
●
Experimental evaluation
– Proxy logs of real network
– Simulated APT traffic
– Rank suspicious domains
●
Results
– High detection / false alarm ratio
– Without prior knowledge about APT
Distributed k-nearest neighbors graph algorithms 54
APT Detection
●
False positives:
– Content Delivery Networks (CDN)
– Advertising domains
– Javascript library delivery
– Websites with very few visits
=> same behavior as APT
Distributed k-nearest neighbors graph algorithms 55
Conclusion
k-nn graph is an interesting tool to
analyze large datasets, but
●
Only if approximation is acceptable
●
Other possibilities exist
Distributed k-nearest neighbors graph algorithms 56
Perspectives...
●
Broaden to other graph-like
structures:
– (Hierarchical) Small World Network
graphs
– Asymmetrical graphs
●
Broaden to other applications
(clustering, nn search)
●
Predict the magnitude of
approximation
Distributed k-nearest neighbors graph algorithms 57
Questions...
Cyber Defence Lab
www.cylab.be

More Related Content

PPTX
Mining of massive datasets using locality sensitive hashing (LSH)
PDF
Benchmark MinHash+LSH algorithm on Spark
PDF
Open LSH - september 2014 update
PDF
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
PDF
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
PPTX
Bloom filters
PDF
Staab programming thesemanticweb
PPTX
Programming the Semantic Web
Mining of massive datasets using locality sensitive hashing (LSH)
Benchmark MinHash+LSH algorithm on Spark
Open LSH - september 2014 update
Data Science Keys to Open Up OpenNASA Datasets - PyData New York 2017
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Bloom filters
Staab programming thesemanticweb
Programming the Semantic Web

What's hot (18)

PDF
Bloom filter
PDF
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
PDF
PyGotham NY 2017: Natural Language Processing from Scratch
PPTX
AINL 2016: Bugaychenko
PPT
Cosequential processing and the sorting of large files
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
Skytree big data london meetup - may 2013
PPT
Spot Sigs
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PPTX
Querying Linked Geospatial Data with Incomplete Information
PPTX
Medical Heritage Library (MHL) on ArchiveSpark
PDF
MongoDB Hacks of Frustration
PDF
Probabilistic Data Structures and Approximate Solutions
PPTX
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
PPT
PDF
inteSearch: An Intelligent Linked Data Information Access Framework
PPTX
2017 biological databases_part1_vupload
Bloom filter
IMCSummit 2015 - Day 1 Developer Track - Spark After Dark: Generating High Qu...
PyGotham NY 2017: Natural Language Processing from Scratch
AINL 2016: Bugaychenko
Cosequential processing and the sorting of large files
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
Skytree big data london meetup - may 2013
Spot Sigs
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Querying Linked Geospatial Data with Incomplete Information
Medical Heritage Library (MHL) on ArchiveSpark
MongoDB Hacks of Frustration
Probabilistic Data Structures and Approximate Solutions
Jake Mannix, Lead Data Engineer, Lucidworks at MLconf SEA - 5/20/16
inteSearch: An Intelligent Linked Data Information Access Framework
2017 biological databases_part1_vupload
Ad

Similar to An introduction to similarity search and k-nn graphs (20)

PDF
Design and analysis of distributed k-nearest neighbors graph algorithms
PDF
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
PPTX
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
PPTX
K Nearest Neighbor Algorithm
PDF
Graph Machine Learning - Past, Present, and Future -
PDF
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
PPT
Lect12 graph mining
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PPTX
R Ramya devi artificial intelligence and machine learning
PPTX
NEAREST NEIGHBOUR CLUSTER ANALYSIS.pptx
PDF
Graph Analysis Beyond Linear Algebra
PDF
Scalable Global Alignment Graph Kernel Using Random Features: From Node Embed...
PDF
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
PPT
Trends In Graph Data Management And Mining
PDF
Natural Language Processing of applications.pdf
PPTX
Nearest neighbor search
PDF
Parallel kmeans clustering in Erlang
PPT
cs4811-ch23a-K-means clustering algorithm .ppt
PPT
UnSupervised Machincs4811-ch23a-clustering.ppt
Design and analysis of distributed k-nearest neighbors graph algorithms
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
K Nearest Neighbor V1.0 Supervised Machine Learning Algorithm
K Nearest Neighbor Algorithm
Graph Machine Learning - Past, Present, and Future -
Neo4j Graph Data Science Training - June 9 & 10 - Slides #6 Graph Algorithms
Lect12 graph mining
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
R Ramya devi artificial intelligence and machine learning
NEAREST NEIGHBOUR CLUSTER ANALYSIS.pptx
Graph Analysis Beyond Linear Algebra
Scalable Global Alignment Graph Kernel Using Random Features: From Node Embed...
CS8080_IRT_UNIT - III T6 K-NN CLASSIFIER.pdf
Trends In Graph Data Management And Mining
Natural Language Processing of applications.pdf
Nearest neighbor search
Parallel kmeans clustering in Erlang
cs4811-ch23a-K-means clustering algorithm .ppt
UnSupervised Machincs4811-ch23a-clustering.ppt
Ad

More from Thibault Debatty (14)

PPTX
Blockchain for dummies
ODP
Building a Cyber Range for training Cyber Defense Situation Awareness
PDF
A comparative analysis of visualisation techniques to achieve CySA in the mi...
PDF
Cyber Range
PDF
Easy Server Monitoring
PDF
Data diode
PDF
USB Portal
PDF
Smart Router
PDF
Web shell detector
PDF
Graph based APT detection
ODP
Multi-Agent System for APT Detection
ODP
Building k-nn Graphs From Large Text Data
PDF
Determining the k in k-means with MapReduce
ODP
Parallel SPAM Clustering with Hadoop
Blockchain for dummies
Building a Cyber Range for training Cyber Defense Situation Awareness
A comparative analysis of visualisation techniques to achieve CySA in the mi...
Cyber Range
Easy Server Monitoring
Data diode
USB Portal
Smart Router
Web shell detector
Graph based APT detection
Multi-Agent System for APT Detection
Building k-nn Graphs From Large Text Data
Determining the k in k-means with MapReduce
Parallel SPAM Clustering with Hadoop

Recently uploaded (20)

PPTX
famous lake in india and its disturibution and importance
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
neck nodes and dissection types and lymph nodes levels
PPT
protein biochemistry.ppt for university classes
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
2. Earth - The Living Planet Module 2ELS
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
famous lake in india and its disturibution and importance
. Radiology Case Scenariosssssssssssssss
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
ECG_Course_Presentation د.محمد صقران ppt
neck nodes and dissection types and lymph nodes levels
protein biochemistry.ppt for university classes
AlphaEarth Foundations and the Satellite Embedding dataset
2. Earth - The Living Planet Module 2ELS
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Biophysics 2.pdffffffffffffffffffffffffff
TOTAL hIP ARTHROPLASTY Presentation.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Phytochemical Investigation of Miliusa longipes.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
lecture 2026 of Sjogren's syndrome l .pdf
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx

An introduction to similarity search and k-nn graphs

  • 1. Distributed k-nearest neighbors graph algorithms Thibault Debatty, Ir PhD 2019-12-03
  • 2. Distributed k-nearest neighbors graph algorithms 2 k-nn graph Edge to k most similar nodes
  • 3. Distributed k-nearest neighbors graph algorithms 3 Context Common tasks of machine learning, data mining, Artificial Intelligence or Big Data: ● Similarity search ● Clustering ● Anomaly detection
  • 4. Distributed k-nearest neighbors graph algorithms 4 Context : similarity search
  • 5. Distributed k-nearest neighbors graph algorithms 5 Context : similarity search “High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v8g6” Similar to a known SPAM?
  • 6. Distributed k-nearest neighbors graph algorithms 6 Context : clustering Kobe Bryant traded to Clippers No.1 Ma1eEnhancement Supplement. Trusted by Millions. Buy Today! J9 Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price 6xkp Percocet 10/625 mg withoutPrescription 30 tabs - $225! [20100815-3] rjj Nurses make Great Incomes Order Now! HYDROCODONE BRAND Watson 540 10mg/mg, 60 Pills - $479, 90 Pills - $656, 120 Pills - $838 36fy High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 7z3 Play here for summer fun Perfect Watches Clones Cheap from $150. Buy Rep1icaWatches: Swiss Rep1icaWatch xz High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 v69 Obtain details on your cred1t online. Get started today High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 071 Japanese food discount Is your computer safe? Luxury at a Discount! Phentermin 37.5 mg as cheap as 120 pills $366.00 8eg5 Mutant fish sold at Connecticut market High quality JBL speakers Need The CheapestViagra? Here's the Right Place. OrderViagra For the Best Price xt High Qua1ityMedications Discount On All Reorders = Best Deal Ever! Viagra50/100mg - $1.85 pl => Identify SPAM campaigns
  • 7. Distributed k-nearest neighbors graph algorithms 7 Context : clustering To analyze 300 rogue websites: ● Cluster ● Analyze 1 representative of each group
  • 8. Distributed k-nearest neighbors graph algorithms 8 Context : anomaly detection Find infected computer on a network
  • 9. Distributed k-nearest neighbors graph algorithms 9 Context ● Similarity search ● Clustering and ● Anomaly detection … are crucial for data processing!
  • 10. Distributed k-nearest neighbors graph algorithms 10 Challenges How hard can that be?
  • 11. Distributed k-nearest neighbors graph algorithms 11 Challenges Computer memory is similar to a book ● Accessible by address (page) ● You have to read before you know the content (e.g. coordinates of a point)
  • 12. Distributed k-nearest neighbors graph algorithms 12 Challenges Naive similarity search requires to read all pages
  • 13. Distributed k-nearest neighbors graph algorithms 13 Challenges How many pages? Bible TOB: ● 2000 pages ● Extra thin paper ● 12cm ● 44 hours of reading
  • 14. Distributed k-nearest neighbors graph algorithms 14 Challenges Samsung Galaxy S9 (4GB) 63m assuming 4KB/page (atomium = 102m), 2.6 years of reading...
  • 15. Distributed k-nearest neighbors graph algorithms 15 Challenges Our server ● 1500GB ● 200.000 books ● A stack of 24km ● 1000 years of reading Brussels – Louvain la Neuve = 26km
  • 16. Distributed k-nearest neighbors graph algorithms 16 Challenges Even with modern hardware, naive algorithms are not an option
  • 17. Distributed k-nearest neighbors graph algorithms 17 Indexes Divide space in “zones” Example: ● North: pages 1, 2, 3 and 4 ● South: pages 5, 6, and 7
  • 18. Distributed k-nearest neighbors graph algorithms 18 Indexes Similarity search with index “query” is near zone “SOUTH” => read pages 5, 6 and 7
  • 19. Distributed k-nearest neighbors graph algorithms 19 Indexes : limitations Similarity search with index Requires to read multiple zones: 1d : 2 zones 2d : 4 zones 3d : 8 zones 8d : 256 zones “curse of dimensionality”
  • 20. Distributed k-nearest neighbors graph algorithms 20 Indexes : limitations Great for low dimensional Euclidean datasets (time) But what about ● Higher dimensions? TV commercials: 4125 dimensions ● Text?
  • 21. Distributed k-nearest neighbors graph algorithms 21 k-nn graph Can we use a k-nn graph for analyzing large datasets ?
  • 22. Distributed k-nearest neighbors graph algorithms 22 k-nn graph Existing algorithms: ● Clustering ● Similarity search (but slow)
  • 23. Distributed k-nearest neighbors graph algorithms 23 Outline Build from large text datasets ● Fast similarity search ● Add and remove points ● Applications: – Text clustering – Detection of compromised computers ● … using distributed processing!
  • 24. Distributed k-nearest neighbors graph algorithms 24 Build from large text datasets
  • 25. Distributed k-nearest neighbors graph algorithms 25 String similarity But first… how to measure similarity between strings? Lots of literature: ● Levenshtein ● Damerau ● Jaro-Winkler ● N-Gram ● Q-Gram ● Cosine ● Jaccard index ● … But no clean implementation!
  • 26. Distributed k-nearest neighbors graph algorithms 26 String similarity
  • 27. Distributed k-nearest neighbors graph algorithms 27 String similarity
  • 28. Distributed k-nearest neighbors graph algorithms 28 String similarity
  • 29. Distributed k-nearest neighbors graph algorithms 29 String similarity
  • 30. Design and analysis of distributed k-nearest neighbors graph algorithms 30 Building from text datasets ● NN-Descent Build an approximate graph Compute O(n1.14) similarities ● BUT: iterative!
  • 31. Distributed k-nearest neighbors graph algorithms 31 Building from text datasets NNCTPH ● Hash using modified hashing function CTPH / ssdeep / spamsum ● Build subgraphs in parallel ● Merge subgraphs Single iteration!
  • 32. Distributed k-nearest neighbors graph algorithms 32 Building from text datasets
  • 33. Distributed k-nearest neighbors graph algorithms 33 Building from text datasets ● Experimental evaluation: – Apache Hadoop MapReduce – SPAM dataset – Jaro-Winkler string similarity (not metric)
  • 34. Distributed k-nearest neighbors graph algorithms 34 Building from text datasets
  • 35. Distributed k-nearest neighbors graph algorithms 35 Fast similarity search Add and remove points
  • 36. Distributed k-nearest neighbors graph algorithms 36 Online building ● Given a distributed graph: – Add nodes – Remove nodes – Search nearest neighbors of query node ● Requires k-medoids partitioning of graph
  • 37. Distributed k-nearest neighbors graph algorithms 37 Partitioning ● k-medoids clustering ● CLARANS is slow to converge ● Two faster methods: – Inspired by Simulated Annealing – Heuristic ● Impact of partitioning when we perform distributed search
  • 38. Distributed k-nearest neighbors graph algorithms 38 Applications
  • 39. Distributed k-nearest neighbors graph algorithms 39 Text clustering ● Text dataset with Jaro-Winkler similarity (not a metric) ● Steps: – Build (approximate) k-nn graph – Prune – Compute connected components
  • 40. Distributed k-nearest neighbors graph algorithms 40 APT Detection ● Advanced => no signatures ● Persistent => limited activity ● Threats ● Need a C2 channel
  • 41. Distributed k-nearest neighbors graph algorithms 41 APT Detection
  • 42. Distributed k-nearest neighbors graph algorithms 42 APT Detection Here: APT relying on HTTP => proxy logs
  • 43. Distributed k-nearest neighbors graph algorithms 43 APT Detection How hard can that be?
  • 44. Distributed k-nearest neighbors graph algorithms 44 APT Detection
  • 45. Distributed k-nearest neighbors graph algorithms 45 APT Detection Displaying a page requires multiple HTTP requests => link each request to its parent using the logs from the proxy
  • 46. Distributed k-nearest neighbors graph algorithms 46 APT Detection
  • 47. Distributed k-nearest neighbors graph algorithms 47 APT Detection
  • 48. Distributed k-nearest neighbors graph algorithms 48 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 49. Distributed k-nearest neighbors graph algorithms 49 APT Detection After pruning the weighted graph, the APT remains isolated!
  • 50. Distributed k-nearest neighbors graph algorithms 50 APT Detection weight is higher if: ● Requests are close in time ● Requests belong to the same domain ● Same sequence repeats
  • 51. Distributed k-nearest neighbors graph algorithms 51 APT Detection ● Batch: build graphs ● Interactive (web interface): – Merge – Prune – Cluster – Filter ● Approximate k-nn graph (time and memory)
  • 52. Distributed k-nearest neighbors graph algorithms 52 APT Detection
  • 53. Distributed k-nearest neighbors graph algorithms 53 APT Detection ● Experimental evaluation – Proxy logs of real network – Simulated APT traffic – Rank suspicious domains ● Results – High detection / false alarm ratio – Without prior knowledge about APT
  • 54. Distributed k-nearest neighbors graph algorithms 54 APT Detection ● False positives: – Content Delivery Networks (CDN) – Advertising domains – Javascript library delivery – Websites with very few visits => same behavior as APT
  • 55. Distributed k-nearest neighbors graph algorithms 55 Conclusion k-nn graph is an interesting tool to analyze large datasets, but ● Only if approximation is acceptable ● Other possibilities exist
  • 56. Distributed k-nearest neighbors graph algorithms 56 Perspectives... ● Broaden to other graph-like structures: – (Hierarchical) Small World Network graphs – Asymmetrical graphs ● Broaden to other applications (clustering, nn search) ● Predict the magnitude of approximation
  • 57. Distributed k-nearest neighbors graph algorithms 57 Questions... Cyber Defence Lab www.cylab.be