SlideShare a Scribd company logo
Efficient
Similarity Search
on Big Data
with office laptop
Sergii Shelpuk
Head of Data Science, V.I.Tech
The Problem
You have a database of 30M patients with all medical records. Each patient described by
250K of binary features.
You need a system for finding N most similar patients to a given one.
Jesus Christ, it’s Big Data, get Hadoop!
Jesus Christ, it’s Big Data, get Hadoop!
Pre-compute
none
Pre-compute
all
450+ trillion pairs
Stored as key-
values, more than
1Pb for values only
Compare 30
million pairs by
250K features
37+ Tflops
One Intel i7 would
compute it in 10
minutes (pure
computing time)
Can we do better?
Two main ideas:
- we don’t need the meaning of each feature, we only care about
similarity of the patients;
- we don’t want to compare very different patients, we want to
compare only the most similar ones.
Step 1: Reduce dimensionality
Decrease dimensionality of the data while preserving similarities
Locality-sensitive hashing and minhashing
K-Means clustering
K-Means clustering groups similar patients in one group
Step 2: Group similar
Group similar patients and store groups as separate files
Store centroids of each cluster in a separate file, too
cluster1.bin
clusterN.bin
Approach
To find N similar patients:
1. Load a patient
2. Reduce dimensionality with minhashing
3. Load centroid file
4. Compare patient to every centroid
5. Load cluster file of the closest centroid
6. Compare patient with patients in the cluster
7. Show top N similar
Results
50000 clusters up to ~1000 patients per cluster
~500Kb-1Mb of every cluster file
~18Mb centroid file
To do similarity search you need:
~20Gb HDD
~20Mb RAM
Search works in ~100 milliseconds on a regular
office laptop
Thank you

More Related Content

PDF
Big data technology
PPTX
Data Intensive Computing
PDF
Intake 37 10
PDF
Yo. big data. understanding data science in the era of big data.
PDF
Introduction_OF_Hadoop_and_BigData
PDF
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
PDF
Thinking Outside the Table
PPTX
Overview of bigdata
Big data technology
Data Intensive Computing
Intake 37 10
Yo. big data. understanding data science in the era of big data.
Introduction_OF_Hadoop_and_BigData
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Thinking Outside the Table
Overview of bigdata

What's hot (12)

PPTX
PPT
Biotechnology Lab Day 2
PPTX
Big data in action
PPTX
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
PPTX
Big Data: The 4 Layers Everyone Must Know
PDF
9 facts about statice's data anonymization solution
PPT
Group4 Unit5
PDF
Big Data presentation Tensing
PPTX
Big Data - The 5 Vs Everyone Must Know
PPTX
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
PPTX
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
PPTX
Big data peresintaion
Biotechnology Lab Day 2
Big data in action
Mining on Relationships in Big Data era using Improve Apriori Algorithm with ...
Big Data: The 4 Layers Everyone Must Know
9 facts about statice's data anonymization solution
Group4 Unit5
Big Data presentation Tensing
Big Data - The 5 Vs Everyone Must Know
Big data (4Vs,history,concept,algorithm) analysis and applications #bigdata #...
What is Data Science? |Role of Data Science in Big Data, Hadoop & Machine Lea...
Big data peresintaion
Ad

Viewers also liked (7)

PPTX
Data science: A New Profession in IT
PDF
Buzzword scheme
PDF
How to take over the world with artificial intelligence final
PDF
Machine learning intro
PDF
Artificial intelligence 2015: Quo Vadis?
PDF
CRISP-DM: a data science project methodology
PDF
Machine Learning: Advanced Topics Overview
Data science: A New Profession in IT
Buzzword scheme
How to take over the world with artificial intelligence final
Machine learning intro
Artificial intelligence 2015: Quo Vadis?
CRISP-DM: a data science project methodology
Machine Learning: Advanced Topics Overview
Ad

Similar to Object similarity with office laptop (20)

PDF
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
PPT
PUC Masterclass Big Data
PPT
eScience: A Transformed Scientific Method
PDF
II-SDV 2015, 20 - 21 April, in Nice
PDF
HyperLogLog Intuition Without Hard Math
PDF
Chapter 1 Overview of Database Systems.pdf
PPT
Big Data
PPTX
Hadoop World 2010 - BAH - Fuzzy Table
PPT
Big data analytics, survey r.nabati
PPTX
Big Data & ML for Clinical Data
PDF
Big Data, The Community and The Commons (May 12, 2014)
PPTX
PDF
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
PDF
Big Data Technology Accelerate Genomics Precision Medicine
PDF
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
PPTX
Becoming Datacentric
ODP
Data massage! databases scaled from one to one million nodes (ulf wendel)
PDF
Prediction of heart disease using classification mining technique on spark
PPTX
Big Data in Clinical Research
PPTX
BrightTALK - Semantic AI
DataScienceLab2017_Сходство пациентов: вычистка дубликатов и предсказание про...
PUC Masterclass Big Data
eScience: A Transformed Scientific Method
II-SDV 2015, 20 - 21 April, in Nice
HyperLogLog Intuition Without Hard Math
Chapter 1 Overview of Database Systems.pdf
Big Data
Hadoop World 2010 - BAH - Fuzzy Table
Big data analytics, survey r.nabati
Big Data & ML for Clinical Data
Big Data, The Community and The Commons (May 12, 2014)
BIG DATA TECHNOLOGY ACCELERATE GENOMICS PRECISION MEDICINE
Big Data Technology Accelerate Genomics Precision Medicine
Digital Pragmatism with Business Intelligence, Big Data and Data Visualisation
Becoming Datacentric
Data massage! databases scaled from one to one million nodes (ulf wendel)
Prediction of heart disease using classification mining technique on spark
Big Data in Clinical Research
BrightTALK - Semantic AI

Recently uploaded (20)

PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPT
Introduction Database Management System for Course Database
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
history of c programming in notes for students .pptx
PPTX
Online Work Permit System for Fast Permit Processing
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
medical staffing services at VALiNTRY
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Digital Strategies for Manufacturing Companies
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
System and Network Administraation Chapter 3
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Introduction Database Management System for Course Database
VVF-Customer-Presentation2025-Ver1.9.pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
ISO 45001 Occupational Health and Safety Management System
history of c programming in notes for students .pptx
Online Work Permit System for Fast Permit Processing
How to Migrate SBCGlobal Email to Yahoo Easily
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
medical staffing services at VALiNTRY
Which alternative to Crystal Reports is best for small or large businesses.pdf
Digital Strategies for Manufacturing Companies
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Navsoft: AI-Powered Business Solutions & Custom Software Development

Object similarity with office laptop

  • 1. Efficient Similarity Search on Big Data with office laptop Sergii Shelpuk Head of Data Science, V.I.Tech
  • 2. The Problem You have a database of 30M patients with all medical records. Each patient described by 250K of binary features. You need a system for finding N most similar patients to a given one. Jesus Christ, it’s Big Data, get Hadoop!
  • 3. Jesus Christ, it’s Big Data, get Hadoop! Pre-compute none Pre-compute all 450+ trillion pairs Stored as key- values, more than 1Pb for values only Compare 30 million pairs by 250K features 37+ Tflops One Intel i7 would compute it in 10 minutes (pure computing time)
  • 4. Can we do better? Two main ideas: - we don’t need the meaning of each feature, we only care about similarity of the patients; - we don’t want to compare very different patients, we want to compare only the most similar ones.
  • 5. Step 1: Reduce dimensionality Decrease dimensionality of the data while preserving similarities Locality-sensitive hashing and minhashing
  • 6. K-Means clustering K-Means clustering groups similar patients in one group
  • 7. Step 2: Group similar Group similar patients and store groups as separate files Store centroids of each cluster in a separate file, too cluster1.bin clusterN.bin
  • 8. Approach To find N similar patients: 1. Load a patient 2. Reduce dimensionality with minhashing 3. Load centroid file 4. Compare patient to every centroid 5. Load cluster file of the closest centroid 6. Compare patient with patients in the cluster 7. Show top N similar
  • 9. Results 50000 clusters up to ~1000 patients per cluster ~500Kb-1Mb of every cluster file ~18Mb centroid file To do similarity search you need: ~20Gb HDD ~20Mb RAM Search works in ~100 milliseconds on a regular office laptop