RecSysOps: Best Practices for Operating a Large-Scale Recommender System

RecSysOps:
Best Practices for Operating a Large-Scale
Recommender System
Ehsan (Mohammad) Saberian,
Justin Basilico
RecSys 2021
2021-09-27
@ehsan_saberian, @JustinBasilico, @NetflixResearch

Large scale RecSys is a complex
operation

RecSys environment is dynamic
It changes every second!

New
members
New
items
New
member
interests

New
members
New
items
New
member
interests
New
ML models

New
members
New
items
New
member
interests
New
ML models
Libraries
updates

New
members
New
items
New
member
interests
New
ML models
Libraries
updates
New
codes

How to ensure that our
RecSys is working
correctly?

RecSysOps:
Lessons we learned while
operating a large RecSys

Benefits:
Reduce firefighting time
Focus on innovation
Build trust with our stakeholders

RecSysOps Components
Detection

Detection
Prediction

Detection
Prediction
Diagnosis

Detection
Prediction
Diagnosis
Resolution

Detect issues quickly
The most challenging part
There are endless potential issues
Some of them we don’t know yet!

Detection lesson 1:
Implement all the known best practices
Unit test, Integration test
MLOps: data/metrics check
CICD, regular retraining

Detection lesson 2:
Monitor end-to-end your own way

Detection lesson 2:
Monitor end-to-end your own way
Don’t rely only on partner teams’ audits
What does correct data
look like from your perspective?

Detection lesson 3:
Understand your stakeholders’ concerns
Stakeholders: members and items

Detection lesson 3.1:
Every time a member plays something
that is ranked low by the model; it is a
potential issue

Every time a member plays something
that is ranked low by the model; it is a
potential issue
Monitor and analyze them
Get inspiration for future innovations

Engage with teams responsible for items
and understand their concerns

Is an item cold-started properly?
Is production bias hurting an item?

Build tools to detect their concerns and
integrated them in your system

Can you predict
issues before they
happen?

Netflix case:
Is it possible to predict if an item is going
to cold-start properly 7 days before its
launch date?

Yes, we can train a model to predict
production model’s behaviour on day of
launch
Flag any item with unexpected prediction
and investigate

Prediction Lesson:
Try to predict issues before they happen

Step 1:
Reproduce issue in isolation

Step 1:
Reproduce issue in isolation
Need sufficient advanced logging

Step 2:
Input data issue or model issue?

Step 2.1:
Input data issue?
Are input values right?

Step 2.1:
Input data issue?
Trick: use similar items or members to
estimate range of typical values

Step 2.1:
Input data issue?
Example: language of an item was set up
incorrectly

Step 2.2:
Model issue?
Need to inspect/interpret model

Step 2.2:
Model issue?
There are many tools, SHAP, LIME ..

Step 2.2:
Model issue?
Example: missing values were handled
incorrectly

Diagnosis lessons:
Set up logging to reproduce issue
Develop tools to check validity of inputs
Develop tools to inspect models

It’s like software engineering
Hotfixes and long term
solutions

Hotfixes in ML are challenging!
Models are highly optimized and hotfix
modification will lead to suboptimality

Resolution lesson 1:
Have a collection of hotfixes ready
Understand their costs and trade-offs

Resolution lesson 2:
With every issue, make RecsysOps better

Final lesson:
Make RecSysOps frictionless

Run checks on a regular basis
If human judgment is needed,
make all required information ready
Be able to deploy hotfixes
with couple of clicks

Questions?
Ehsan Saberian
@ehsan_saberian
Justin Basilico
@JustinBasilico
@NetflixResearch

RecSysOps: Best Practices for Operating a Large-Scale Recommender System

More Related Content

What's hot (20)

Similar to RecSysOps: Best Practices for Operating a Large-Scale Recommender System (20)

Recently uploaded (20)

RecSysOps: Best Practices for Operating a Large-Scale Recommender System