SlideShare a Scribd company logo
Nikhil Simha, Airbnb
Andrew Hoh, Airbnb
Zipline – Airbnb’s ML Data
Management Framework
#ML3SAIS
Feature Engineering
2#ML3SAIS
Idea
Discover
Data
Training
Set
Training &
Evaluation
Airbnb Zipline
Discovering data
3#ML3SAIS
• Where is it?
• Is it good?
• Will it continue to be good?
• Has it already been done before?
Training set
4#ML3SAIS
• What processing engine to use?
• Backfilling
• Point-in-time correct windows
Production
5#ML3SAIS
Training Set
Generation
Model
Training
Push model
to prediction
service
Online
Features
Model
monitoring
Data quality
monitoring
What is Zipline?
• Part of Bighead – Airbnb’s e2e ml platform
• Training Sets
• Feature Store
• Data quality
6#ML3SAIS
Concepts
• Data Source
• Feature-set
• Training-set
• Client
7#ML3SAIS
Feature-Set
• Source queries
– Hive tables
– Event streams
– Mutation streams
• Primary Keys
• Time Stamp
8#ML3SAIS
Feature-Set – Sources
• Multiple sources
– Backfills
– Migrations
9#ML3SAIS
Example/Operations
10#ML3SAIS
11#ML3SAIS
The picture can't be displayed.
Data Quality
Training set
• Driver query
– primary keys
– timestamps
– Point-in-time correct
12#ML3SAIS
Training set – example
13#ML3SAIS
Training set – example
14#ML3SAIS
Backfills
15#ML3SAIS
Old training Set
Time
Features
Label offsets
16#ML3SAIS
Online scoring
• Backfills
• Low latency
• DB Mutations
– Batch correction
17#ML3SAIS
Lambda
• Batch – Spark
• Streaming – Flink
• GDPR
18#ML3SAIS
User
Conf
Batch
Streaming
Zipline Client
User
App
*Daily
*Continuous
KV Store
Why Flink?
19#ML3SAIS
Why Flink?
20#ML3SAIS
• Across batch and streaming
• Fixed length sliding windows
• Raw Events at the tail and head
Aggregations
21#ML3SAIS
7 day sliding window
Availability
22#ML3SAIS
Processing time
Eventtime
Client
23#ML3SAIS
• Aggregation
• Logic to handle availability
• Java
– The API is (List of Feature Names) => Map of feature
values – as objects.
Mutations
• Organizationally easy
• Consistent
• Not flexible
• More complex
– Inverse
24#ML3SAIS
Mutations - Aggregation
25#ML3SAIS
• Monoids
– Associative: (a + b) + c = a + (b + c)
– Distribute aggregation
– Sum, Avg, Count etc..
• Group
– Invertible
– Min, Max, Median, Ntile
– Not possible without memory
– Compromise
Mutations - Aggregation
26#ML3SAIS
• Deletes
= Inverse(before)
• Updates
= Inverse(before) + after
Overview
FeatureSet
Conf
Feature
Stream
Batch
Table
Stats
Table
Data Quality
UI
Client
Offline
Training Set
When?
• Plan to open source in Q3 2018.
• Will be part of Bighead
– Talk tomorrow.
28#ML3SAIS
Questions?
29#ML3SAIS

More Related Content

PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
PDF
What is MLOps
PDF
Zipline—Airbnb’s Declarative Feature Engineering Framework
PDF
Vector databases and neural search
PDF
Using MLOps to Bring ML to Production/The Promise of MLOps
PPTX
LLaMA 2.pptx
PDF
MLOps with Kubeflow
PDF
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
What is MLOps
Zipline—Airbnb’s Declarative Feature Engineering Framework
Vector databases and neural search
Using MLOps to Bring ML to Production/The Promise of MLOps
LLaMA 2.pptx
MLOps with Kubeflow
"Getting More from Your Datasets: Data Augmentation, Annotation and Generativ...

What's hot (20)

PPTX
Introduction to Keras
PPTX
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
PDF
MLOps Using MLflow
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
PPTX
From Data Science to MLOps
PPT
Graph Analytics for big data
PDF
What is in a Lucene index?
PPTX
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
PPTX
Frame - Feature Management for Productive Machine Learning
PDF
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
PPTX
ML Infrastracture @ Dropbox
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
PDF
Optimization for Deep Learning
PDF
MLOps for production-level machine learning
PDF
Feature Engineering
PDF
Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow
PDF
DVC - Git-like Data Version Control for Machine Learning projects
PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
PPTX
Google Vertex AI
PDF
Introduction to MLflow
Introduction to Keras
What is TensorFlow? | Introduction to TensorFlow | TensorFlow Tutorial For Be...
MLOps Using MLflow
"Managing the Complete Machine Learning Lifecycle with MLflow"
From Data Science to MLOps
Graph Analytics for big data
What is in a Lucene index?
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
Frame - Feature Management for Productive Machine Learning
Deep Learning for Computer Vision: Data Augmentation (UPC 2016)
ML Infrastracture @ Dropbox
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Optimization for Deep Learning
MLOps for production-level machine learning
Feature Engineering
Automatic Forecasting using Prophet, Databricks, Delta Lake and MLflow
DVC - Git-like Data Version Control for Machine Learning projects
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Google Vertex AI
Introduction to MLflow
Ad

Similar to Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh (20)

PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
Sf big analytics: bighead
PDF
AirBNB's ML platform - BigHead
PDF
Keeping Data Fresh: Mastering Updates in Vector Databases
PDF
A missing link in the ML infrastructure stack?
PPTX
Feature Store as a Data Foundation for Machine Learning
PDF
Data ops in practice - Swedish style
PPTX
Democratizing data science Using spark, hive and druid
PDF
User behavior analyses JavaZone 2020
PDF
Big data - Talend presentation to STLHUG
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PDF
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
PDF
Lambda architecture @ Indix
PPTX
Spark streaming for the internet of flying things 20160510.pptx
PDF
Recommender Systems @ Scale - PyData 2019
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
PDF
DevOps for DataScience
PDF
Hopsworks at Google AI Huddle, Sunnyvale
PDF
Storage Challenges for Production Machine Learning
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
Sf big analytics: bighead
AirBNB's ML platform - BigHead
Keeping Data Fresh: Mastering Updates in Vector Databases
A missing link in the ML infrastructure stack?
Feature Store as a Data Foundation for Machine Learning
Data ops in practice - Swedish style
Democratizing data science Using spark, hive and druid
User behavior analyses JavaZone 2020
Big data - Talend presentation to STLHUG
Production ready big ml workflows from zero to hero daniel marcous @ waze
Metadata and Provenance for ML Pipelines with Hopsworks
Building Data Lakes and Analytics on AWS. IPExpo Manchester.
Lambda architecture @ Indix
Spark streaming for the internet of flying things 20160510.pptx
Recommender Systems @ Scale - PyData 2019
Ml ops and the feature store with hopsworks, DC Data Science Meetup
DevOps for DataScience
Hopsworks at Google AI Huddle, Sunnyvale
Storage Challenges for Production Machine Learning
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Foundation of Data Science unit number two notes
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Database Infoormation System (DBIS).pptx
PDF
Introduction to Business Data Analytics.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
oil_refinery_comprehensive_20250804084928 (1).pptx
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Foundation of Data Science unit number two notes
.pdf is not working space design for the following data for the following dat...
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Knowledge Engineering Part 1
Acceptance and paychological effects of mandatory extra coach I classes.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Database Infoormation System (DBIS).pptx
Introduction to Business Data Analytics.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Moving the Public Sector (Government) to a Digital Adoption
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
climate analysis of Dhaka ,Banglades.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Introduction-to-Cloud-ComputingFinal.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj

Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha and Andrew Hoh