SlideShare a Scribd company logo
RecSysOps:
Best Practices for Operating a Large-Scale
Recommender System
Ehsan (Mohammad) Saberian,
Justin Basilico
RecSys 2021
2021-09-27
@ehsan_saberian, @JustinBasilico, @NetflixResearch
Large scale RecSys is a complex
operation
RecSys environment is dynamic
It changes every second!
New
members
New
members
New
items
New
members
New
items
New
member
interests
New
members
New
items
New
member
interests
New
ML models
New
members
New
items
New
member
interests
New
ML models
Libraries
updates
New
members
New
items
New
member
interests
New
ML models
Libraries
updates
New
codes
How to ensure that our
RecSys is working
correctly?
RecSysOps:
Lessons we learned while
operating a large RecSys
Benefits:
Reduce firefighting time
Focus on innovation
Build trust with our stakeholders
RecSysOps Components
Detection
RecSysOps Components
Detection
Prediction
RecSysOps Components
Detection
Prediction
Diagnosis
RecSysOps Components
Detection
Prediction
Diagnosis
Resolution
Detection
Prediction
Diagnosis
Resolution
Detect issues quickly
The most challenging part
There are endless potential issues
Some of them we don’t know yet!
Detection lesson 1:
Implement all the known best practices
Unit test, Integration test
MLOps: data/metrics check
CICD, regular retraining
Detection lesson 1:
Implement all the known best practices
Unit test, Integration test
MLOps: data/metrics check
CICD, regular retraining
Detection lesson 2:
Monitor end-to-end your own way
Detection lesson 2:
Monitor end-to-end your own way
Don’t rely only on partner teams’ audits
What does correct data
look like from your perspective?
Detection lesson 3:
Understand your stakeholders’ concerns
Stakeholders: members and items
Detection lesson 3.1:
Every time a member plays something
that is ranked low by the model; it is a
potential issue
Detection lesson 3.1:
Every time a member plays something
that is ranked low by the model; it is a
potential issue
Monitor and analyze them
Get inspiration for future innovations
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Is an item cold-started properly?
Is production bias hurting an item?
Detection lesson 3.2:
Engage with teams responsible for items
and understand their concerns
Build tools to detect their concerns and
integrated them in your system
Detection
Prediction
Diagnosis
Resolution
Can you predict
issues before they
happen?
Netflix case:
Is it possible to predict if an item is going
to cold-start properly 7 days before its
launch date?
Yes, we can train a model to predict
production model’s behaviour on day of
launch
Flag any item with unexpected prediction
and investigate
Prediction Lesson:
Try to predict issues before they happen
Detection
Prediction
Diagnosis
Resolution
Step 1:
Reproduce issue in isolation
Step 1:
Reproduce issue in isolation
Need sufficient advanced logging
Step 2:
Input data issue or model issue?
Step 2.1:
Input data issue?
Are input values right?
Step 2.1:
Input data issue?
Are input values right?
Trick: use similar items or members to
estimate range of typical values
Step 2.1:
Input data issue?
Are input values right?
Example: language of an item was set up
incorrectly
Step 2.2:
Model issue?
Need to inspect/interpret model
Step 2.2:
Model issue?
Need to inspect/interpret model
There are many tools, SHAP, LIME ..
Step 2.2:
Model issue?
Need to inspect/interpret model
Example: missing values were handled
incorrectly
Diagnosis lessons:
Set up logging to reproduce issue
Develop tools to check validity of inputs
Develop tools to inspect models
Detection
Prediction
Diagnosis
Resolution
It’s like software engineering
Hotfixes and long term
solutions
Hotfixes in ML are challenging!
Models are highly optimized and hotfix
modification will lead to suboptimality
Resolution lesson 1:
Have a collection of hotfixes ready
Understand their costs and trade-offs
Resolution lesson 2:
With every issue, make RecsysOps better
Detection
Prediction
Diagnosis
Resolution
Final lesson:
Make RecSysOps frictionless
Run checks on a regular basis
If human judgment is needed,
make all required information ready
Be able to deploy hotfixes
with couple of clicks
Questions?
Ehsan Saberian
@ehsan_saberian
Justin Basilico
@JustinBasilico
@NetflixResearch

More Related Content

PDF
Making Netflix Machine Learning Algorithms Reliable
PDF
Artwork Personalization at Netflix
PDF
Deep Learning for Personalized Search and Recommender Systems
PDF
Recommending for the World
PDF
Missing values in recommender models
PDF
Deep Learning for Recommender Systems
PDF
Past, Present & Future of Recommender Systems: An Industry Perspective
PDF
Deep Learning for Recommender Systems
Making Netflix Machine Learning Algorithms Reliable
Artwork Personalization at Netflix
Deep Learning for Personalized Search and Recommender Systems
Recommending for the World
Missing values in recommender models
Deep Learning for Recommender Systems
Past, Present & Future of Recommender Systems: An Industry Perspective
Deep Learning for Recommender Systems

What's hot (20)

PDF
Contextualization at Netflix
PPTX
Learning a Personalized Homepage
PDF
Recent Trends in Personalization at Netflix
PPTX
Recommendation at Netflix Scale
PPTX
Netflix talk at ML Platform meetup Sep 2019
PDF
10 Lessons Learned from Building Machine Learning Systems
PDF
Context Aware Recommendations at Netflix
PDF
Homepage Personalization at Spotify
PPTX
Personalized Page Generation for Browsing Recommendations
PDF
Sequential Decision Making in Recommendations
PDF
Time, Context and Causality in Recommender Systems
PDF
Recent Trends in Personalization: A Netflix Perspective
PDF
Personalizing "The Netflix Experience" with Deep Learning
PDF
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
PDF
Democratizing Data at Airbnb
PDF
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
PDF
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
PDF
Recommender system algorithm and architecture
PDF
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
PDF
Calibrated Recommendations
Contextualization at Netflix
Learning a Personalized Homepage
Recent Trends in Personalization at Netflix
Recommendation at Netflix Scale
Netflix talk at ML Platform meetup Sep 2019
10 Lessons Learned from Building Machine Learning Systems
Context Aware Recommendations at Netflix
Homepage Personalization at Spotify
Personalized Page Generation for Browsing Recommendations
Sequential Decision Making in Recommendations
Time, Context and Causality in Recommender Systems
Recent Trends in Personalization: A Netflix Perspective
Personalizing "The Netflix Experience" with Deep Learning
RecSys 2020 A Human Perspective on Algorithmic Similarity Schendel 9-2020
Democratizing Data at Airbnb
Tableau Conference 2018: Binging on Data - Enabling Analytics at Netflix
[QCon.ai 2019] People You May Know: Fast Recommendations Over Massive Data
Recommender system algorithm and architecture
Recurrent Neural Networks for Recommendations and Personalization with Nick P...
Calibrated Recommendations
Ad

Similar to RecSysOps: Best Practices for Operating a Large-Scale Recommender System (20)

PDF
Network predictive analysis
PDF
ML Application Life Cycle
PPTX
Are you ready for Data science? A 12 point test
PPTX
From Duke of DevOps to Queen of Chaos - Api days 2018
PDF
The art of debugging
PDF
How to evaluate & manage machine learning model #daft
PDF
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
PDF
Mtc strategy-briefing-houston-pd m-05212018-3
PDF
CD in Machine Learning Systems
PDF
Optimization Direct: Introduction and recent case studies
PDF
Devoxx2017
PDF
Applying SRE techniques to micro service design
PDF
ML in Production at FunTech Meetup (Feb 2019)
PDF
How to SRE when you have no SRE
PDF
MLSEV Virtual. Predictions
PDF
Meetup Luglio - Operations Research.pdf
PPTX
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
PDF
From Data to Insights: How IT Operations Data Can Boost Quality
PDF
Practical use of defect detection and prediction
KEY
Software Engineering Course 2009 - Mining Software Archives
Network predictive analysis
ML Application Life Cycle
Are you ready for Data science? A 12 point test
From Duke of DevOps to Queen of Chaos - Api days 2018
The art of debugging
How to evaluate & manage machine learning model #daft
Agile Mumbai 2019 Conference | Intelligent DevOps enabling Enterprise Agilit...
Mtc strategy-briefing-houston-pd m-05212018-3
CD in Machine Learning Systems
Optimization Direct: Introduction and recent case studies
Devoxx2017
Applying SRE techniques to micro service design
ML in Production at FunTech Meetup (Feb 2019)
How to SRE when you have no SRE
MLSEV Virtual. Predictions
Meetup Luglio - Operations Research.pdf
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
From Data to Insights: How IT Operations Data Can Boost Quality
Practical use of defect detection and prediction
Software Engineering Course 2009 - Mining Software Archives
Ad

Recently uploaded (20)

PPTX
Big Data Technologies - Introduction.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Modernizing your data center with Dell and AMD
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Monthly Chronicles - July 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Modernizing your data center with Dell and AMD
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Chapter 3 Spatial Domain Image Processing.pdf
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

RecSysOps: Best Practices for Operating a Large-Scale Recommender System