SlideShare a Scribd company logo
Distributed	Time	Travel	for	Feature	
Generation
DB	Tsai
March	24,	2016	at	SF	Big	Analytics	Meetup
Who	am	I?
• I	am	a	Senior	Research	Engineer	at	Netflix
• I	am	an	Apache	Spark	Committer
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
Turn on Netflix, and the absolute best
content for you would automatically
start playing
Data	Driven
• Try	an	idea	offline	using	historical	data to	see	if	it	would	
have	made	better	recommendations
• If	it	did,	deploy	a	live	A/B	test to	see	if	it	performs	well	in	
Production
Quickly try ideas on historical data and
transition to online A/B test
+ ≠
Feature	Engineering
Why build a Time Machine?
Label	Leaks
• Recommendation	is	an	analytic	learning	process	to	learn	from	
the	past	to	predict	the	future.	
• P(playt|featurest’)	where	t’	<	t
• This	has	to	be	very	careful;	otherwise	the	features	can	contain	
the	labels	such	that	offline	metrics	are	very	good,	but	online	
evaluation	will	not	be	performing	as	expected.
The	Past
• Generate	features	based	on	event	data	logged	in	Hive
• Need	to	reimplement	features	for	online	A/B	test
• Data discrepancies between offline and online sources
• Log	features	online	where	the	model	will	be	used
• Need to deploy each idea into production
• Feature	generation	calls	online	services	and	filters	data	past	a	
certain	time
• Works only when a service records a log of historical events
• Additional load on online services
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
Time	Travel	using	Snapshots
• Snapshot	online	services	and	use	the	snapshot	
data	offline	to	generate	features
• Share	facts	and	features	between	experiments	
without	calling	live	systems
How to build a Time Machine
Context	Selection
Data	Snapshots
APIs	for	Time	Travel
Context Selection
Context	
Selection
Runs once a day
Hive
S3
Context
SetStratified
Sampling
Contexts
tagged with
meta data
Data	Snapshots
S3
Context
Set
Data	
Snapshots Runs once
a day
S3
Snapshot
Prana	
(Netflix	
Libraries)
Viewing	
History	
Service
MyList	
Service
Ratings	
Service
Snapshot data for
each Context
Thrift
Parquet
APIs	for	Time	Travel
Data	Architecture
S3
Snapshot
S3
Context
Set
Runs once
a day
Prana	
(Netflix	
Libraries)
Viewing	
History	
Service
MyList	
Service
Ratings
Service
Context	
Selection
Runs once a day
Hive
Stratified
Sampling
Contexts
tagged with
meta data
Thrift
Context Selection
Data Snapshots
Batch APIs
RDD of
Snapshot
Objects
Data	
Snapshots
Batch	APIs
Generating Features via Time Travel
Great	Scott!	
• DeLorean:	A	time-traveling	vehicle
• uses	data	snapshots	to	travel	in	time
• scales	with	Apache	Spark
• prototypes	new	ideas	with	Zeppelin
• requires	minimal	code	changes	from	experimentation	
to	A/B	test	to	production
https://en.wikipedia.org/wiki/Emmett_Brown
There’s the DeLorean!
Running	Time	Travel	Experiment
Select the destination time
Bring it up to 88 miles per hour!
Running	Time	Travel	Experiment
Design	Experiment
Collect	Label	Dataset
DeLorean:	Offline	
Feature	Generation	
Distributed	Model	
Training
Parallel	training	of	
individual	models	
using	different	
executors
Compute	Validation	
Metrics
Model	Testing
Choose
best model
Design a New Experimentto Test Out DifferentIdeas
Good
Metrics
Offline
Experiment
Online
System
Online	
AB	Testing
Bad Metrics
Selected
Contexts
DeLorean	Input	Data
• Contexts:	The	setting	for	evaluating	a	set	of	items	(e.g.	tuples	
of	member	profiles,	country,	time,	device,	etc.)	
• Items:	The	elements	to	be	trained	on,	scored,	and/or	ranked	
(e.g.	videos,	rows,	search	entities).
• Labels:	For	supervised	learning,	this	will	be	the	label	(target)	
for	each	item.
Feature	Encoders
• Compute	features	for	each	item	in	a	given	context
• Each	type	of	raw	data	element	has	its	own	data	key
• Data	map	is	a	map	from	data	keys	to	data	objects	in	a	given	
context
• Data	map	is	consumed	by	feature	encoder	to	compute	features
Two	type	of	Data	Elements
• Context-dependent	data	elements
• Viewing	History	
• Mylist
• ...
• Context-independent	data	elements
• Video	Metadata	
• Genre	Metadata
• ...
Video	Country	of	Origin	Matching	Fraction
Context-Items
Context:										s										
Items:
Context:										s										
Items:
Context
Dependent
Data Element
Viewing History
Context:										s										
Items:
Context:										s										
Items:
Context:							s																			
Items:
=	0.5
=	0.5
=	0.5
Context
Independent
Data Element
Video Metadata
Context:							s																			
Items:
=	1.0
=	0.0
=	1.0
Features
Feature	GenerationS3
Snapshot
Model	Training
Label Features
Feature	EncodersLabel	Data
Feature	Encoders
Data	Elements
Feature	Model
(JSON)
Feature	Encoders
Feature	Encoders
Feature	Encoders
Required
Feature Keys
Data
Data Map
Features
Data in POJOs
Data Keys
Data Keys
Features
• Represented	in	Spark’s	DataFrames
• In	nested	structure	to	avoid	data	shuffling	in	ranking	
process
• Stored	with	Parquet	format	in	S3
Features
Context
Item, label,
and features
Going	Online
S3
Snapshot
DeLorean:	Offline	
Feature	Generation
Online	Ranking	/	
Scoring	Service
Model	Training	/	
Validation	/	Testing
Offline Experiment
Online SystemViewing	
History	
Service
MyList	
Service
Ratings
Service
Online	Feature	
Generation
Deploy
models
Shared	Feature	
Encoders
Conclusion
Spark helped us significantly reduce
the time from an idea to an AB Test
Future	work
Event Driven Data Snapshots
Time Travel to the Future!!
We’re	hiring!
(come	talk	to	us)
https://jobs.netflix.com/
Tech Blog: http://bit.ly/sparktimetravel

More Related Content

PPTX
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
PPTX
Kaggle Days Porto 2019 - 1st place presentation by team DevScope
PDF
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
PDF
Exploratory data analysis using apache lens and apache zeppelin
PDF
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
Big data and AI in Socialbakers
PPTX
Data Engineering at Udemy
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)
Kaggle Days Porto 2019 - 1st place presentation by team DevScope
Taking Jupyter Notebooks and Apache Spark to the Next Level PixieDust with Da...
Exploratory data analysis using apache lens and apache zeppelin
Big Data Day LA 2016/ Big Data Track - Rapid Analytics @ Netflix LA (Updated ...
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
Big data and AI in Socialbakers
Data Engineering at Udemy

What's hot (16)

PPTX
PowerPoint and Airtable Integration
PPTX
Rapid Data Analytics @ Netflix
PDF
Maximize Your Time in the Field
PDF
VMworld 2016 vBrownBag Tech Talk - VM Capacity Management
PDF
Data Science at Udemy
PPTX
Compare Table Unleashed
PPTX
Voxeo Summit Day 2 - Using CXP hotspot analytics
PDF
Bootstrapping of PySpark Models for Factorial A/B Tests
PDF
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...
PPTX
Development Productivity for IBM i - Build an Efficient IT Department with AB...
PDF
mabl's Machine Learning Implementation on Google Cloud Platform
PDF
Open Source DataViz with Apache Superset
PDF
Store, Extract, Transform, Load, Visualize. Untagged Conference
PPTX
Salesforce university Working with Data
PPTX
Department Budget and Resource Planning Proposal
PPTX
Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System
PowerPoint and Airtable Integration
Rapid Data Analytics @ Netflix
Maximize Your Time in the Field
VMworld 2016 vBrownBag Tech Talk - VM Capacity Management
Data Science at Udemy
Compare Table Unleashed
Voxeo Summit Day 2 - Using CXP hotspot analytics
Bootstrapping of PySpark Models for Factorial A/B Tests
Building Event Streaming Applications with Pac-Man (Ricardo Ferreira, Conflue...
Development Productivity for IBM i - Build an Efficient IT Department with AB...
mabl's Machine Learning Implementation on Google Cloud Platform
Open Source DataViz with Apache Superset
Store, Extract, Transform, Load, Visualize. Untagged Conference
Salesforce university Working with Data
Department Budget and Resource Planning Proposal
Architecting a Predictive, Petabyte-Scale, Self-Learning Fraud Detection System
Ad

Viewers also liked (20)

PDF
شكر وتقدير 2014
PDF
Evaluacion final grupo_212060_40
PPTX
Delapan tujuan pembangunan milenium
PPTX
Robust Stream Processing With Apache Flink
PPT
Presentación libro capilla la magdalena parral 27 dic 2007
DOCX
Scribd ajay enclave
PPTX
las maravillas y curiosidades del computador
PDF
99年度教育優先區計畫申請及執行應注意事項依規陳核
DOCX
Mahesh will act in Mani’s Movie: Suhasini
PPT
Transcription
PDF
Lean manufacturing &amp; lean supply chain awareness workshop
PPT
SAP Inside Track Lima 09 - Keynote
PDF
Рынок жилья России 2015.
PDF
Рынок жилья Екатеринбурга, 2016
PDF
Mobile cloudnight 2015/11/11
PPT
Source y Labels
PPT
PPTX
Frontal sinus procedures
PPTX
Cadena de suministro de helados
شكر وتقدير 2014
Evaluacion final grupo_212060_40
Delapan tujuan pembangunan milenium
Robust Stream Processing With Apache Flink
Presentación libro capilla la magdalena parral 27 dic 2007
Scribd ajay enclave
las maravillas y curiosidades del computador
99年度教育優先區計畫申請及執行應注意事項依規陳核
Mahesh will act in Mani’s Movie: Suhasini
Transcription
Lean manufacturing &amp; lean supply chain awareness workshop
SAP Inside Track Lima 09 - Keynote
Рынок жилья России 2015.
Рынок жилья Екатеринбурга, 2016
Mobile cloudnight 2015/11/11
Source y Labels
Frontal sinus procedures
Cadena de suministro de helados
Ad

Similar to Distributed Time Travel for Feature Generation at Netflix (10)

PDF
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
PDF
Netflix Recommendations Feature Engineering with Time Travel
PDF
Data Time Travel by Delta Time Machine
PDF
Data Time Travel by Delta Time Machine
PDF
Berlin buzzwords 2020-feature-store-dowling
PDF
Managed Feature Store for Machine Learning
PPTX
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
PPTX
Temporal EMF: A temporal metamodeling platform
PDF
Simplify Feature Engineering in Your Data Warehouse
PDF
MLOps with a Feature Store: Filling the Gap in ML Infrastructure
Distributed Time Travel for Feature Generation by DB Tsai and Prasanna Padman...
Netflix Recommendations Feature Engineering with Time Travel
Data Time Travel by Delta Time Machine
Data Time Travel by Delta Time Machine
Berlin buzzwords 2020-feature-store-dowling
Managed Feature Store for Machine Learning
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Temporal EMF: A temporal metamodeling platform
Simplify Feature Engineering in Your Data Warehouse
MLOps with a Feature Store: Filling the Gap in ML Infrastructure

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Digital Systems & Binary Numbers (comprehensive )
PPTX
Why Generative AI is the Future of Content, Code & Creativity?
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
assetexplorer- product-overview - presentation
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Transform Your Business with a Software ERP System
PDF
Cost to Outsource Software Development in 2025
PPTX
history of c programming in notes for students .pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
iTop VPN Free 5.6.0.5262 Crack latest version 2025
How to Choose the Right IT Partner for Your Business in Malaysia
Log360_SIEM_Solutions Overview PPT_Feb 2020.pptx
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
L1 - Introduction to python Backend.pptx
Reimagine Home Health with the Power of Agentic AI​
wealthsignaloriginal-com-DS-text-... (1).pdf
Digital Systems & Binary Numbers (comprehensive )
Why Generative AI is the Future of Content, Code & Creativity?
Computer Software and OS of computer science of grade 11.pptx
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
assetexplorer- product-overview - presentation
Upgrade and Innovation Strategies for SAP ERP Customers
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Transform Your Business with a Software ERP System
Cost to Outsource Software Development in 2025
history of c programming in notes for students .pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
iTop VPN Free 5.6.0.5262 Crack latest version 2025

Distributed Time Travel for Feature Generation at Netflix