SlideShare a Scribd company logo
Approximate nearest
neighbors & vector
models
I’m Erik
• @fulhack
• Author of Annoy, Luigi
• Currently CTO of Better
• Previously 5 years at Spotify
What’s nearest
neighbor(s)
• Let’s say you have a bunch of points
Grab a bunch of points
5 nearest neighbors
20 nearest neighbors
100 nearest neighbors
…But what’s the point?
• vector models are everywhere
• lots of applications (language processing,
recommender systems, computer vision)
MNIST example
• 28x28 = 784-dimensional dataset
• Define distance in terms of pixels:
MNIST neighbors
…Much better approach
1. Start with high dimensional data
2. Run dimensionality reduction to 10-1000 dims
3. Do stuff in a small dimensional space
Deep learning for food
• Deep model trained on a GPU on 6M random pics
downloaded from Yelp
156x156x32
154x154x32
152x152x32
76x76x64
74x74x64
72x72x64
36x36x128
34x34x128
32x32x128
16x16x256
14x14x256
12x12x256
6x6x512
4x4x512
2x2x512
2048
2048
128
1244
3x3 convolutions
2x2 maxpool
fully
connected
with dropout
bottleneck
layer
Distance in smaller space
1. Run image through the network
2. Use the 128-dimensional bottleneck layer as an
item vector
3. Use cosine distance in the reduced space
Nearest food pics
Vector methods for text
• TF-IDF (old) – no dimensionality reduction
• Latent Semantic Analysis (1988)
• Probabilistic Latent Semantic Analysis (2000)
• Semantic Hashing (2007)
• word2vec (2013), RNN, LSTM, …
Represent documents and/or
words as f-dimensional vector
Latentfactor1
Latent factor 2
banana
apple
boat
Vector methods for
collaborative filtering
• Supervised methods: See everything from the
Netflix Prize
• Unsupervised: Use NLP methods
CF vectors – examplesIPMF item item:
P(i ! j) = exp(bT
j bi)/Zi =
exp(bT
j bi)
P
k exp(bT
k bi)
VECTORS:
pui = aT
u bi
simij = cos(bi, bj) =
bT
i bj
|bi||bj|
O(f)
i j simi,j
2pac 2pac 1.0
2pac Notorious B.I.G. 0.91
2pac Dr. Dre 0.87
2pac Florence + the Machine 0.26
Florence + the Machine Lana Del Rey 0.81
IPMF item item MDS:
P(i ! j) = exp(bT
j bi)/Zi =
exp( |bj bi|
2
)
P
k exp( |bk bi|
2
)
simij = |bj bi|
2
(u, i, count)
@L
Geospatial indexing
• Ping the world: https://github.com/erikbern/ping
• k-NN regression using Annoy
Nearest neighbors the
brute force way
• we can always do an exhaustive search to find the
nearest neighbors
• imagine MySQL doing a linear scan for every
query…
Using word2vec’s brute
force search
$ time echo -e "chinese rivernEXITn" | ./distance GoogleNews-
vectors-negative300.bin	
!
Qiantang_River		 0.597229	
Yangtse		 0.587990	
Yangtze_River		 0.576738	
lake		 0.567611	
rivers		 0.567264	
creek		 0.567135	
Mekong_river		 0.550916	
Xiangjiang_River		 0.550451	
Beas_river		 0.549198	
Minjiang_River		 0.548721	
real	2m34.346s	
user	1m36.235s	
sys	0m16.362s
Introducing Annoy
• https://github.com/spotify/annoy
• mmap-based ANN library
• Written in C++, with Python and R bindings
• 585 stars on Github
Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 100000	
Yangtse	 0.907756	
Yangtze_River	 0.920067	
rivers	 0.930308	
creek	 0.930447	
Mekong_river	 0.947718	
Huangpu_River	 0.951850	
Ganges	 0.959261	
Thu_Bon	 0.960545	
Yangtze	 0.966199	
Yangtze_river	 0.978978	
real	0m0.470s	
user	0m0.285s	
sys	0m0.162s
Using Annoy’s search
$ time echo -e "chinese rivernEXITn" | python
nearest_neighbors.py ~/tmp/word2vec/GoogleNews-vectors-
negative300.bin 1000000	
Qiantang_River	 0.897519	
Yangtse	 0.907756	
Yangtze_River	 0.920067	
lake	 0.929934	
rivers	 0.930308	
creek	 0.930447	
Mekong_river	 0.947718	
Xiangjiang_River	 0.948208	
Beas_river	 0.949528	
Minjiang_River	 0.950031	
real	0m2.013s	
user	0m1.386s	
sys	0m0.614s
(performance)
1. Building an Annoy
index
Start with the point set
Split it in two halves
Split again
Again…
…more iterations later
Side note: making trees
small
• Split until K items in each leaf (K~100)
• Takes (n/K) memory instead of n
Binary tree
2. Searching
Nearest neighbors
Searching the tree
Approximate nearest neighbor methods and vector models – NYC ML meetup
Problemo
• The point that’s the closest isn’t necessarily in the
same leaf of the binary tree
• Two points that are really close may end up on
different sides of a split
• Solution: go to both sides of a split if it’s close
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Trick 1: Priority queue
• Traverse the tree using a priority queue
• sort by min(margin) for the path from the root
Trick 2: many trees
• Construct trees randomly many times
• Use the same priority queue to search all of them at
the same time
Approximate nearest neighbor methods and vector models – NYC ML meetup
heap + forest = best
• Since we use a priority queue, we will dive down
the best splits with the biggest distance
• More trees always helps!
• Only constraint is more trees require more RAM
Annoy query structure
1. Use priority queue to search all trees until we’ve
found k items
2. Take union and remove duplicates (a lot)
3. Compute distance for remaining items
4. Return the nearest n items
Find candidates
Take union of all leaves
Compute distances
Return nearest neighbors
“Curse of dimensionality”
Approximate nearest neighbor methods and vector models – NYC ML meetup
Approximate nearest neighbor methods and vector models – NYC ML meetup
Are we screwed?
• Would be nice if the data is has a much smaller
“intrinsic dimension”!
Approximate nearest neighbor methods and vector models – NYC ML meetup
Improving the algorithm
Queries/s
1-NN accuracy
more accurate
faster
• https://github.com/erikbern/ann-benchmarks
ann-benchmarks
perf/accuracy tradeoffs
Queries/s
1-NN accuracy
search more nodes
more trees
Things that work
• Smarter plane splitting
• Priority queue heuristics
• Search more nodes than number of results
• Align nodes closer together
Things that don’t work
• Use lower-precision arithmetic
• Priority queue by other heuristics (number of trees)
• Precompute vector norms
Things for the future
• Use a optimization scheme for tree building
• Add more distance functions (eg. edit distance)
• Use a proper KV store as a backend (eg. LMDB) to
support incremental adds, out-of-core, arbitrary
keys: https://github.com/Houzz/annoy2
Thanks!
• https://github.com/spotify/annoy
• https://github.com/erikbern/ann-benchmarks
• https://github.com/erikbern/ann-presentation
• erikbern.com
• @fulhack

More Related Content

PDF
Music recommendations @ MLConf 2014
PDF
Full-RAG: A modern architecture for hyper-personalization
PPTX
Collaborative Filtering at Spotify
PDF
Deep Learning for Recommender Systems RecSys2017 Tutorial
PDF
Marketplace in motion - AdKDD keynote - 2020
PPTX
Vector_db_introduction.pptx
PDF
Vectors are the new JSON in PostgreSQL
PDF
Music Recommendation 2018
Music recommendations @ MLConf 2014
Full-RAG: A modern architecture for hyper-personalization
Collaborative Filtering at Spotify
Deep Learning for Recommender Systems RecSys2017 Tutorial
Marketplace in motion - AdKDD keynote - 2020
Vector_db_introduction.pptx
Vectors are the new JSON in PostgreSQL
Music Recommendation 2018

What's hot (20)

PDF
ML+Hadoop at NYC Predictive Analytics
PDF
Building Data Pipelines for Music Recommendations at Spotify
PDF
Algorithmic Music Recommendations at Spotify
PDF
Music Personalization : Real time Platforms.
PDF
Recommending and searching @ Spotify
PDF
Artwork Personalization at Netflix
PDF
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
PDF
Homepage Personalization at Spotify
PDF
Deep Learning for Personalized Search and Recommender Systems
PDF
Music Recommendations at Scale with Spark
PDF
Contextualization at Netflix
PDF
Past, Present & Future of Recommender Systems: An Industry Perspective
PPTX
Recommender systems: Content-based and collaborative filtering
PDF
Machine Learning for Recommender Systems MLSS 2015 Sydney
PDF
Recent Trends in Personalization at Netflix
PPTX
Deep learning based recommender systems (lab seminar paper review)
PDF
Deep Learning for Recommender Systems
PDF
Interactive Recommender Systems with Netflix and Spotify
PDF
Vector database
PDF
Crafting Recommenders: the Shallow and the Deep of it!
ML+Hadoop at NYC Predictive Analytics
Building Data Pipelines for Music Recommendations at Spotify
Algorithmic Music Recommendations at Spotify
Music Personalization : Real time Platforms.
Recommending and searching @ Spotify
Artwork Personalization at Netflix
Recurrent Neural Networks (RNN) | RNN LSTM | Deep Learning Tutorial | Tensorf...
Homepage Personalization at Spotify
Deep Learning for Personalized Search and Recommender Systems
Music Recommendations at Scale with Spark
Contextualization at Netflix
Past, Present & Future of Recommender Systems: An Industry Perspective
Recommender systems: Content-based and collaborative filtering
Machine Learning for Recommender Systems MLSS 2015 Sydney
Recent Trends in Personalization at Netflix
Deep learning based recommender systems (lab seminar paper review)
Deep Learning for Recommender Systems
Interactive Recommender Systems with Netflix and Spotify
Vector database
Crafting Recommenders: the Shallow and the Deep of it!
Ad

Viewers also liked (20)

PDF
김병관 성공캠프 SNS팀 자원봉사 후기
PDF
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
PPTX
영상 데이터의 처리와 정보의 추출
PDF
Behavior Tree in Unreal engine 4
PDF
NDC16 스매싱더배틀 1년간의 개발일지
PDF
게임회사 취업을 위한 현실적인 전략 3가지
PDF
Custom fabric shader for unreal engine 4
PPTX
Deep learning as_WaveExtractor
PDF
Luigi presentation NYC Data Science
PDF
Profiling - 실시간 대화식 프로파일러
PDF
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
PDF
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
PDF
Re:Zero부터 시작하지 않는 오픈소스 개발
PDF
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
PDF
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
PDF
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
PDF
Docker
PDF
Online game server on Akka.NET (NDC2016)
PDF
8년동안 테라에서 배운 8가지 교훈
PDF
버텍스 셰이더로 하는 머리카락 애니메이션
김병관 성공캠프 SNS팀 자원봉사 후기
Developing Success in Mobile with Unreal Engine 4 | David Stelzer
영상 데이터의 처리와 정보의 추출
Behavior Tree in Unreal engine 4
NDC16 스매싱더배틀 1년간의 개발일지
게임회사 취업을 위한 현실적인 전략 3가지
Custom fabric shader for unreal engine 4
Deep learning as_WaveExtractor
Luigi presentation NYC Data Science
Profiling - 실시간 대화식 프로파일러
자동화된 소스 분석, 처리, 검증을 통한 소스의 불필요한 #if - #endif 제거하기 NDC2012
[야생의 땅: 듀랑고] 지형 관리 완전 자동화 - 생생한 AWS와 Docker 체험기
Re:Zero부터 시작하지 않는 오픈소스 개발
NDC17 게임 디자이너 커리어 포스트모템: 8년, 3개의 회사, 4개의 게임
레퍼런스만 알면 언리얼 엔진이 제대로 보인다
PyCon 2017 프로그래머가 이사하는 법 2 [천원경매]
Docker
Online game server on Akka.NET (NDC2016)
8년동안 테라에서 배운 8가지 교훈
버텍스 셰이더로 하는 머리카락 애니메이션
Ad

Similar to Approximate nearest neighbor methods and vector models – NYC ML meetup (20)

PDF
Erik Bernhardsson, CTO, Better Mortgage
PDF
kdtrees.pdf
PDF
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
PDF
Designing of Semantic Nearest Neighbor Search: Survey
PDF
Scalable and efficient cluster based framework for multidimensional indexing
PDF
Scalable and efficient cluster based framework for
PDF
Enhanced Methodology for supporting approximate string search in Geospatial ...
PDF
Tailored Bregman Ball Trees for Effective Nearest Neighbors
PPTX
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
PDF
Building graphs to discover information by David Martínez at Big Data Spain 2015
PPT
PAM.ppt
PDF
Machine learning @ Spotify - Madison Big Data Meetup
PDF
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
PDF
M tree
PDF
Knn solution
PPTX
An optimal and progressive algorithm for skyline queries slide
PDF
Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small Worl...
PPTX
Fast Single-pass K-means Clusterting at Oxford
PDF
Spatial Approximate String Keyword content Query processing
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Erik Bernhardsson, CTO, Better Mortgage
kdtrees.pdf
Chapter2 NEAREST NEIGHBOURHOOD ALGORITHMS.pdf
Designing of Semantic Nearest Neighbor Search: Survey
Scalable and efficient cluster based framework for multidimensional indexing
Scalable and efficient cluster based framework for
Enhanced Methodology for supporting approximate string search in Geospatial ...
Tailored Bregman Ball Trees for Effective Nearest Neighbors
" How to survive with 1 billion vectors and not sell a kidney: our low-cost c...
Building graphs to discover information by David Martínez at Big Data Spain 2015
PAM.ppt
Machine learning @ Spotify - Madison Big Data Meetup
Machine learning by using python Lesson One Part 2 By Professor Lili Saghafi
M tree
Knn solution
An optimal and progressive algorithm for skyline queries slide
Cobinatorial Algorithms for Nearest Neighbors, Near-Duplicates and Small Worl...
Fast Single-pass K-means Clusterting at Oxford
Spatial Approximate String Keyword content Query processing
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Construction Project Organization Group 2.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPT
Project quality management in manufacturing
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
PPT on Performance Review to get promotions
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Welding lecture in detail for understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Construction Project Organization Group 2.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Lesson 3_Tessellation.pptx finite Mathematics
Foundation to blockchain - A guide to Blockchain Tech
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Project quality management in manufacturing
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Arduino robotics embedded978-1-4302-3184-4.pdf
PPT on Performance Review to get promotions
Operating System & Kernel Study Guide-1 - converted.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
CYBER-CRIMES AND SECURITY A guide to understanding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
CH1 Production IntroductoryConcepts.pptx
OOP with Java - Java Introduction (Basics)
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Welding lecture in detail for understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd

Approximate nearest neighbor methods and vector models – NYC ML meetup