SlideShare a Scribd company logo
The Case for
Learned Index Structures
Eric Fu
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
Introduction
• Index: map key to position efficiently
• B-Tree
• Self-balanced binary search tree
• Store on disk
• Lookup in O(log n)
B-Tree vs. Models
• Task: Predict the offset of value given a key
• Input: key
• Output:
• B-Tree: [pos, pos + pagesize]
• Model: [pos - min_err, pos + max_err]
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
What is Machine Learning?
• Machine learning is a field of computer science that
gives computers the ability to learn without being explicitly
programmed.
• Statistics: collect data  build model  predict
Problems
• Regression 回归
• Classification 分类
• Clustering 聚类 Clustering
Algorithms
• Linear Regression 线性回归
• Decision Tree 决策树
• Neural Network 神经网络
• Support Vector Machine (SVM) 支持向量机
• Bayes Classifier 贝叶斯分类器
• K-means
.......
Linear Regression 线性回归
Decision Tree 决策树
Neuron 神经元
Activation Function
Neural Network 神经网络
neuron
Deep Neural Network 深度神经网络
GoogLeNet, 22 layers
Machine Learning
• A regular process
• Feature extraction
• Train model
• Test model
• Objective function
• minimum error (e.g. MSE)
Machine Learning
• The biggest challenge - overfitting
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
B-Tree vs. Models
• Task: Predict the offset of value given a key
• Input: key
• Output:
• B-Tree: [pos, pos + pagesize]
• Model: [pos - min_err, pos + max_err]
How to bound
min_err, max_err? No test dataset!
Index as a Function
• B-Tree or ML model are fitting this curve in different approach.
The Case for Learned Index Structures
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
Recursive Model Index (RMI)
Root and middle nodes
• Pick a model for next stage
Leaf nodes
• Predict position
Recursive Model Index (RMI)
Solved last-mile dilemma!
• 100M - 1M - 10K – 100
• Not a tree
Hybrid Index
Parameter: model structures
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
Test datasets
• weblogs:访问时间 timestamp -> log entry (~200M)
• maps:纬度 longitude -> locations (~200M)
• web-documents:documents (strings) -> document-id (~10M)
• lognormal:
Test models
• B-Tree with different page sizes
• very competitive performance
• RMI with 2-stage models using simple grid-search
• 0 to 2 hidden layers
• layer-width ranging from 4 to 32 nodes
• Total time = lookup time + search time
The Case for Learned Index Structures
The Case for Learned Index Structures
The Case for Learned Index Structures
Conclusion
• Up to 3x faster
• An order-of-magnitude smaller
• Data distribution dependent
Inserts and Updates
• Achilles heel of learned indexes because of the potentially high cost
for learning models
• Introduce additional space in sorted dataset, similar to a B-Tree
• Assume that the inserts follow roughly a similar pattern as the
learned CDF
• What happens if the distribution changes?
• Retrain model. Stage 2 -> Stage 1
• Delta-index
Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index
Point Index
• Hash collisions
• Probing (e.g. linked list)
• Trade-off between time and space
• Learned Hash-map
• more uniform hash function
• more uniquely mapping
Point Index
• Baseline Hash-map
• only uses 2 multiplications,
3 bitshifts and 3 XORs
• 2-stage RMI models
• 100k models on the 2nd stage
• without any hidden layers.
• available slots from 75% to
125% of the data
Existence Index
• most importantly Bloom-Filters
• Guarantee no false negative
• Potential false positive (FP)
• Targeted FPR = 0.1% then ~14x bits
• Targeted FPR = 0.01% then ~18x bits
Bloom-filters with learned hash-functions
• We denote the set of keys by K and the set of non-keys by U
• Dataset of non-keys U
• randomly generated keys
• based on logs of previous queries
• generated by another ML model
• A binary classification task
• NN with Sigmoid activation function
Bloom-filters with learned hash-functions
• non-zero FPR and FNR
• as the FPR goes down, the FNR will go up
• How to preserve FPR = 0 constraint?
• an overflow Bloom filter
Bloom-filters with learned hash-functions
• Task: blacklisted phishing URLs (~1.7M)
• a character-level RNN (GRU)
• Bloom filter
• desired 1% FPR requires 2.04MB space
• Our approach
• GRU model: 0.0259MB
• With spill-over Bloom filter: 1.07MB (save ~47%)
Thank You
Q&A
Furthermore...
• Machine Learning by Zhou Zhi-hua
• TensorFlow and deep learning without a PhD

More Related Content

PPTX
An introduction to reinforcement learning
PPTX
Introduction to Apache Kafka
PDF
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
PPTX
Diabetes Mellitus
PPTX
Hypertension
PPTX
Republic Act No. 11313 Safe Spaces Act (Bawal Bastos Law).pptx
PPTX
Power Point Presentation on Artificial Intelligence
An introduction to reinforcement learning
Introduction to Apache Kafka
Unlocking the Power of Apache Flink: An Introduction in 4 Acts
Diabetes Mellitus
Hypertension
Republic Act No. 11313 Safe Spaces Act (Bawal Bastos Law).pptx
Power Point Presentation on Artificial Intelligence

What's hot (10)

PPSX
Digital signature
PPT
Logical instruction of 8085
PPTX
Branching instructions in 8086 microprocessor
PPT
Stream ciphers presentation
PPTX
Information Theory Coding 1
PPTX
PPTX
Modified booth's algorithm Part 2
PPT
Number system and codes
DOCX
All flipflop
PPTX
Automata theory - CFG and normal forms
Digital signature
Logical instruction of 8085
Branching instructions in 8086 microprocessor
Stream ciphers presentation
Information Theory Coding 1
Modified booth's algorithm Part 2
Number system and codes
All flipflop
Automata theory - CFG and normal forms
Ad

Similar to The Case for Learned Index Structures (20)

PPT
Processing Large Graphs
PPTX
Agility and Scalability with MongoDB
PPT
Basic terminologies & asymptotic notations
PPTX
Graph Databases & OrientDB
DOCX
Advanced data structures & algorithms important questions
PPT
Associations.ppt
PDF
Hadoop Tutorial with @techmilind
 
PPTX
Automated Slow Query Analysis: Dex the Index Robot
PDF
Pivotal OSS meetup - MADlib and PivotalR
PDF
Webtech Conference: NoSQL and Web scalability
PDF
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
PPTX
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
PPT
Advanced Data Analytics with R Programming.ppt
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PPTX
Sharding Methods for MongoDB
PPTX
Lasso Regression regression amalysis.pptx
PDF
MySQL Query Optimization (Basics)
PDF
Hadoop Overview kdd2011
PPTX
MATLAB & Image Processing
PDF
R user group meeting 25th jan 2017
Processing Large Graphs
Agility and Scalability with MongoDB
Basic terminologies & asymptotic notations
Graph Databases & OrientDB
Advanced data structures & algorithms important questions
Associations.ppt
Hadoop Tutorial with @techmilind
 
Automated Slow Query Analysis: Dex the Index Robot
Pivotal OSS meetup - MADlib and PivotalR
Webtech Conference: NoSQL and Web scalability
Spark Summit EU 2015: Combining the Strengths of MLlib, scikit-learn, and R
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Advanced Data Analytics with R Programming.ppt
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Sharding Methods for MongoDB
Lasso Regression regression amalysis.pptx
MySQL Query Optimization (Basics)
Hadoop Overview kdd2011
MATLAB & Image Processing
R user group meeting 25th jan 2017
Ad

More from 宇 傅 (12)

PDF
Parallel Query Execution
PPTX
The Evolution of Data Systems
PPTX
The Volcano/Cascades Optimizer
PPTX
PelotonDB - A self-driving database for hybrid workloads
PPTX
Immutable Data Structures
PPTX
Spark and Spark Streaming
PDF
Functional Programming in Java 8
PDF
第三届阿里中间件性能挑战赛冠军队伍答辩
PDF
Data Streaming Algorithms
PDF
Golang 101
PDF
Docker Container: isolation and security
PDF
Paxos and Raft Distributed Consensus Algorithm
Parallel Query Execution
The Evolution of Data Systems
The Volcano/Cascades Optimizer
PelotonDB - A self-driving database for hybrid workloads
Immutable Data Structures
Spark and Spark Streaming
Functional Programming in Java 8
第三届阿里中间件性能挑战赛冠军队伍答辩
Data Streaming Algorithms
Golang 101
Docker Container: isolation and security
Paxos and Raft Distributed Consensus Algorithm

Recently uploaded (20)

PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
assetexplorer- product-overview - presentation
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
L1 - Introduction to python Backend.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administraation Chapter 3
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
Odoo Companies in India – Driving Business Transformation.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
iTop VPN Free 5.6.0.5262 Crack latest version 2025
Computer Software and OS of computer science of grade 11.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
assetexplorer- product-overview - presentation
Digital Systems & Binary Numbers (comprehensive )
Adobe Illustrator 28.6 Crack My Vision of Vector Design
L1 - Introduction to python Backend.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
wealthsignaloriginal-com-DS-text-... (1).pdf
Operating system designcfffgfgggggggvggggggggg
PTS Company Brochure 2025 (1).pdf.......
System and Network Administraation Chapter 3
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Understanding Forklifts - TECH EHS Solution
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
How to Choose the Right IT Partner for Your Business in Malaysia

The Case for Learned Index Structures