SlideShare a Scribd company logo
Optimal Learning 
for Fun and Profit with MOE 
Scott Clark 
SF Machine Learning 
10/15/14 
Joint work with: Eric Liu, Peter Frazier, Norases Vesdapunt, Deniz Oktay, JaiLei Wang 
sclark@yelp.com @DrScottClark
Outline of Talk 
● Optimal Learning 
○ What is it? 
○ Why do we care? 
● Multi-armed bandits 
○ Definition and motivation 
○ Examples 
● Bayesian global optimization 
○ Optimal experiment design 
○ Uses to extend traditional A/B testing 
○ Examples 
● MOE: Metric Optimization Engine 
○ Examples and Features
What is optimal learning? 
Optimal learning addresses the challenge of 
how to collect information as efficiently as 
possible, primarily for settings where 
collecting information is time consuming 
and expensive. 
Prof. Warren Powell - optimallearning.princeton.edu 
What is the most efficient way to collect 
information? 
Prof. Peter Frazier - people.orie.cornell.edu/pfrazier 
How do we make the most money, as fast 
as possible? 
Me - @DrScottClark
Part I: 
Multi-Armed Bandits
What are multi-armed bandits? 
THE SETUP 
● Imagine you are in front of K slot machines. 
● Each one is set to "free play" (but you can still win $$$) 
● Each has a possibly different, unknown payout rate 
● You have a fixed amount of time to maximize payout 
GO!
What are multi-armed bandits? 
THE SETUP 
(math version)
Real World Bandits 
Why do we care? 
● Maps well onto Click Through Rate (CTR) 
○ Each arm is an ad or search result 
○ Each click is a success 
○ Want to maximize clicks 
● Can be used in experiments (A/B testing) 
○ Want to find the best solutions, fast 
○ Want to limit how often bad solutions are used
Tradeoffs 
Exploration vs. Exploitation 
Gaining more knowledge about the system 
vs. 
Getting largest payout with current knowledge
Naive Example 
Epsilon First Policy 
● Sample sequentially εT < T times 
○ only explore 
● Pick the best and sample for t = εT+1, ..., T 
○ only exploit
Example (K = 3, t = 0) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
0 
0 
- 
0 
0 
- 
0 
0 
- 
Observed Information
Example (K = 3, t = 1) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
1 
1 
1 
0 
0 
- 
0 
0 
- 
Observed Information
Example (K = 3, t = 2) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
1 
1 
1 
1 
1 
1 
0 
0 
- 
Observed Information
Example (K = 3, t = 3) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
1 
1 
1 
1 
1 
1 
1 
0 
0 
Observed Information
Example (K = 3, t = 4) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
2 
1 
0.5 
1 
1 
1 
1 
0 
0 
Observed Information
Example (K = 3, t = 5) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
2 
1 
0.5 
2 
2 
1 
1 
0 
0 
Observed Information
Example (K = 3, t = 6) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
2 
1 
0.5 
2 
2 
1 
2 
0 
0 
Observed Information
Example (K = 3, t = 7) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
3 
2 
0.66 
2 
2 
1 
2 
0 
0 
Observed Information
Example (K = 3, t = 8) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
3 
2 
0.66 
3 
3 
1 
2 
0 
0 
Observed Information
Example (K = 3, t = 9) 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
3 
2 
0.66 
3 
3 
1 
3 
1 
0.33 
Observed Information
Example (K = 3, t > 9) 
Exploit! 
Profit! 
Right?
What if our observed ratio is a poor approx? 
Unknown p = 0.5 p = 0.8 p = 0.2 
payout rate 
PULLS: 
WINS: 
RATIO: 
3 
2 
0.66 
3 
3 
1 
3 
1 
0.33 
Observed Information
What if our observed ratio is a poor approx? 
Unknown p = 0.9 p = 0.5 p = 0.5 
payout rate 
PULLS: 
WINS: 
RATIO: 
3 
2 
0.66 
3 
3 
1 
3 
1 
0.33 
Observed Information
Fixed exploration fails 
Regret is unbounded! 
Amount of exploration 
needs to depend on data 
We need better policies!
What should we do? 
Many different policies 
● Weighted random choice (another naive approach) 
● Epsilon-greedy 
○ Best arm so far with P=1-ε, random otherwise 
● Epsilon-decreasing* 
○ Best arm so far with P=1-(ε * exp(-rt)), random otherwise 
● UCB-exp* 
● UCB-tuned* 
● BLA* 
● SoftMax* 
● etc, etc, etc (60+ years of research) 
*Regret bounded as t->infinity
Bandits in the Wild 
What if... 
● Hardware constraints limit real-time knowledge? (batching) 
● Payoff noisy? Non-binary? Changes in time? (dynamic content) 
● Parallel sampling? (many concurrent users) 
● Arms expire? (events, news stories, etc) 
● You have knowledge of the user? (logged in, contextual history) 
● The number of arms increases? Continuous? (parameter search) 
Every problem is different. 
This is an active area of research.
Part I: 
Global Optimization
THE GOAL 
● Optimize some objective function 
○ CTR, revenue, delivery time, or some combination thereof 
● given some parameters 
○ config values, cuttoffs, ML parameters 
● CTR = f(parameters) 
○ Find best parameters 
● We want to sample the underlying function as few times as possible 
(more mathy version)
Metric Optimization Engine 
A global, black box method for parameter optimization 
History of how past parameters have performed 
MOE 
New, optimal parameters
What does MOE do? 
● MOE optimizes a metric (like CTR) given some 
parameters as inputs (like scoring weights) 
● Given the past performance of different parameters 
MOE suggests new, optimal parameters to test 
Results of A/B 
tests run so far 
MOE 
New, optimal 
values to A/B test
Example Experiment 
Biz details distance in ad 
● Setting a different distance cutoff for each category 
Parameters + Obj Func 
distance_cutoffs = { 
‘shopping’: 20.0, 
‘food’: 14.0, 
‘auto’: 15.0, 
…} 
objective_function = { 
‘value’: 0.012, 
‘std’: 0.00013 
} 
MOE New Parameters 
distance_cutoffs = { 
‘shopping’: 22.1, 
‘food’: 7.3, 
‘auto’: 12.6, 
…} 
to show “X miles away” text in biz_details ad 
● For each category we define a maximum distance 
Run A/B Test
Why do we need MOE? 
● Parameter optimization is hard 
○ Finding the perfect set of parameters takes a long time 
○ Hope it is well behaved and try to move in the right direction 
○ Not possible as number of parameters increases 
● Intractable to find best set of parameters in all situations 
○ Thousands of combinations of program type, flow, category 
○ Finding the best parameters manually is impossible 
● Heuristics quickly break down in the real world 
○ Dependent parameters (changes to one change all others) 
○ Many parameters at once (location, category, map, place, ...) 
○ Non-linear (complexity and chaos break assumptions) 
MOE solves all of these problems in an optimal way
How does it work? 
MOE 
1. Build Gaussian Process (GP) 
with points sampled so far 
2. Optimize covariance 
hyperparameters of GP 
3. Find point(s) of highest 
Expected Improvement 
within parameter domain 
4. Return optimal next best 
point(s) to sample
Rasmussen and 
Williams GPML 
gaussianprocess.org 
Gaussian Processes
Prior: 
Posterior: 
Gaussian Processes
Optimizing Covariance Hyperparameters 
Finding the GP model that fits best 
● All of these GPs are created with the same initial data 
○ with different hyperparameters (length scales) 
● Need to find the model that is most likely given the data 
○ Maximum likelihood, cross validation, priors, etc 
Rasmussen and Williams Gaussian Processes for Machine Learning
Optimizing Covariance Hyperparameters 
Rasmussen and Williams Gaussian Processes for Machine Learning
Find point(s) of highest expected improvement 
We want to find the point(s) that are expected to beat the best point seen so far, by the most. 
[Jones, Schonlau, Welsch 1998] 
[Clark, Frazier 2012]
Tying it all Together #1: A/B Testing 
Users 
Experiment 
Framework 
(users -> cohorts) 
(cohorts -> % traffic, 
params) 
● Optimally assign traffic fractions for 
experiments (Multi-Armed Bandits) 
● Optimally suggest new cohorts to be run 
(Bayesian Global Optimization) 
Metric System (batch) 
Logs, Metrics, Results 
MOE 
Multi-Armed Bandits 
Bayesian Global Opt 
App 
cohorts -> params 
params -> objective function 
optimal cohort % traffic 
optimal new params 
daily/hourly batch 
time consuming and expensive
Tying it all Together #2 
Expensive Batch Systems 
Machine Learning 
Framework 
complex regression, deep 
learning system, etc 
● Optimally suggest new hyperparameters 
for the framework to minimize loss 
(Bayesian Global Optimization) 
Metrics 
Error, Loss, Likelihood, etc 
MOE 
Bayesian Global Opt 
Big Data 
framework output 
time consuming and expensive Hyperparameters 
optimal hyperparameters
Tying it all Together #3 
Physical Experiments 
time consuming and expensive time consuming and expensive 
Drug Trial 
drug creation, 
FDA approval, 
expert admin 
● Optimally allocate trial sizes (MAB) 
● Optimally suggest parameters for new 
patients (Bayesian Global Optimization) 
Evaluation of Results 
Requires Expert 
MOE 
Multi-Armed Bandits 
Bayesian Global Opt 
asynchronous results 
Parameters 
dosage, frequency, 
composition 
optimal parameters 
conditions on outstanding experiments 
optimal experiments 
for new patients
What is MOE doing right now? 
MOE is now live in production 
● MOE is informing active experiments 
● MOE is successfully optimizing towards all given metrics 
● MOE treats the underlying system it is optimizing as a black box, 
allowing it to be easily extended to any system
MOE is Open Source! 
github.com/Yelp/MOE
MOE is Fully Documented 
yelp.github.io/MOE
MOE has Examples 
yelp.github.io/MOE/examples.html
● Multi-Armed Bandits 
○ Many policies implemented and more on the way 
● Global Optimization 
○ Bayesian Global Optimization via Expected Improvement on GPs
MOE is Easy to Install 
● yelp.github.io/MOE/install.html#install-in-docker 
● registry.hub.docker.com/u/yelpmoe/latest 
A MOE server is now running at http://localhost:6543
Questions? 
sclark@yelp.com 
@DrScottClark 
github.com/Yelp/MOE
References 
Gaussian Processes for Machine Learning 
Carl edward Rasmussen and Christopher K. I. Williams. 2006. 
Massachusetts Institute of Technology. 55 Hayward St., Cambridge, MA 02142. 
http://www.gaussianprocess.org/gpml/ (free electronic copy) 
Parallel Machine Learning Algorithms In Bioinformatics and Global Optimization 
(PhD Dissertation) 
Part II, EPI: Expected Parallel Improvement 
Scott Clark. 2012. 
Cornell University, Center for Applied Mathematics. Ithaca, NY. 
https://github.com/sc932/Thesis 
Differentiation of the Cholesky Algorithm 
S. P. Smith. 1995. 
Journal of Computational and Graphical Statistics. Volume 4. Number 2. p134-147 
A Multi-points Criterion for Deterministic Parallel Global Optimization based on 
Gaussian Processes. 
David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2008. 
D´epartement 3MI. Ecole Nationale Sup´erieure des Mines. 158 cours Fauriel, Saint-Etienne, France. 
{ginsbourger, leriche, carraro}@emse.fr 
Efficient Global Optimization of Expensive Black-Box Functions 
Jones, D.R., Schonlau, M., Welch,W.J. 1998. 
Journal of Global Optimization, 13, 455-492.
Use Cases 
● Optimizing a system's click-through or conversion rate (CTR). 
○ MOE is useful when evaluating CTR requires running an A/B test on real user traffic, and 
getting statistically significant results requires running this test for a substantial amount of time 
(hours, days, or even weeks). Examples include setting distance thresholds, ad unit properties, 
or internal configuration values. 
○ http://engineeringblog.yelp.com/2014/10/using-moe-the-metric-optimization-engine-to-optimize-an- 
ab-testing-experiment-framework.html 
● Optimizing tunable parameters of a machine-learning prediction method. 
○ MOE can be used when calculating the prediction error for one choice of the parameters takes a 
long time, which might happen because the prediction method is complex and takes a long 
time to train, or because the data used to evaluate the error is huge. Examples include deep 
learning methods or hyperparameters of features in logistic regression.
More Use Cases 
● Optimizing the design of an engineering system. 
○ MOE helps when evaluating a design requires running a complex physics-based numerical 
simulation on a supercomputer. Examples include designing and modeling airplanes, the 
traffic network of a city, a combustion engine, or a hospital. 
● Optimizing the parameters of a real-world experiment. 
○ MOE can help guide design when every experiment needs to be physically created in a lab or 
very few experiments can be run in parallel. Examples include chemistry, biology, or physics 
experiments or a drug trial. 
● Any time sampling a tunable, unknown function is time consuming or 
expensive.

More Related Content

PPTX
[BEDROCK] Claude Prompt Engineering Techniques.pptx
PPTX
SharePoint for Knowledge Management
PPTX
Recommendations for Building Machine Learning Software
PDF
Facebook Talk at Netflix ML Platform meetup Sep 2019
PDF
Transitions
PPTX
Recommendation Modeling with Impression Data at Netflix
PDF
Genetic Algorithm (GA) Optimization - Step-by-Step Example
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
[BEDROCK] Claude Prompt Engineering Techniques.pptx
SharePoint for Knowledge Management
Recommendations for Building Machine Learning Software
Facebook Talk at Netflix ML Platform meetup Sep 2019
Transitions
Recommendation Modeling with Impression Data at Netflix
Genetic Algorithm (GA) Optimization - Step-by-Step Example
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...

What's hot (10)

PDF
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
PDF
(Paper Seminar detailed version) BART: Denoising Sequence-to-Sequence Pre-tra...
PDF
PR-409: Denoising Diffusion Probabilistic Models
PDF
Paragraph structure
PDF
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
PPTX
Building, Evaluating, and Optimizing your RAG App for Production
PDF
Rhetorical Analysis Notes
PDF
Kafka and Machine Learning in Banking and Insurance Industry
PDF
MACHINE LEARNING(R17A0534).pdf
PPTX
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
(Paper Seminar detailed version) BART: Denoising Sequence-to-Sequence Pre-tra...
PR-409: Denoising Diffusion Probabilistic Models
Paragraph structure
RecSysOps: Best Practices for Operating a Large-Scale Recommender System
Building, Evaluating, and Optimizing your RAG App for Production
Rhetorical Analysis Notes
Kafka and Machine Learning in Banking and Insurance Industry
MACHINE LEARNING(R17A0534).pdf
Deep Learning Frameworks 2019 | Which Deep Learning Framework To Use | Deep L...
Ad

Viewers also liked (17)

PDF
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
PDF
ETL in Clojure
PDF
Hyperparameter optimization with approximate gradient
PPTX
That's like, so random! Monte Carlo for Data Science
PDF
SigOpt for Hedge Funds
PDF
Observing Dark Worlds
PDF
Lightning: large scale machine learning in python
PDF
Ministry of Education Strategic Action Plan
PPTX
Giving Design Critique
PPTX
CTR Prediction using Spark Machine Learning Pipelines
PDF
DSL in Clojure
PDF
Deep Learning for Computer Vision: Visualization (UPC 2016)
PPTX
Intro to Machine Learning
PDF
Feature Importance Analysis with XGBoost in Tax audit
PDF
PPTX
21st Century Education
PPTX
21st century Learning
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...
ETL in Clojure
Hyperparameter optimization with approximate gradient
That's like, so random! Monte Carlo for Data Science
SigOpt for Hedge Funds
Observing Dark Worlds
Lightning: large scale machine learning in python
Ministry of Education Strategic Action Plan
Giving Design Critique
CTR Prediction using Spark Machine Learning Pipelines
DSL in Clojure
Deep Learning for Computer Vision: Visualization (UPC 2016)
Intro to Machine Learning
Feature Importance Analysis with XGBoost in Tax audit
21st Century Education
21st century Learning
Ad

Similar to Optimal Learning for Fun and Profit with MOE (20)

PDF
Scott Clark, Software Engineer, Yelp at MLconf SF
PDF
Causal reasoning and Learning Systems
PDF
BKK16-300 Benchmarking 102
PPTX
User Payment Prediction in Free-to-Play
PDF
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
PDF
Lec6 nuts-and-bolts-deep-rl-research
PDF
Building useful models for imbalanced datasets (without resampling)
PDF
Setting up an A/B-testing framework
PDF
Machine Learning Lecture 3 Decision Trees
PDF
Simple rules for building robust machine learning models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Faster and cheaper, smart ab experiments - public ver.
PPT
Catapult DOE Case Study
PDF
Being Intentional: Privacy Engineering and A/B Testing
PDF
An introduction to machine learning and statistics
PPTX
Lecture 3 for Machine learning in IITIJ
PPTX
Online learning &amp; adaptive game playing
PDF
XGBoost @ Fyber
PDF
Multi-Armed Bandit: an algorithmic perspective
Scott Clark, Software Engineer, Yelp at MLconf SF
Causal reasoning and Learning Systems
BKK16-300 Benchmarking 102
User Payment Prediction in Free-to-Play
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
Lec6 nuts-and-bolts-deep-rl-research
Building useful models for imbalanced datasets (without resampling)
Setting up an A/B-testing framework
Machine Learning Lecture 3 Decision Trees
Simple rules for building robust machine learning models
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Faster and cheaper, smart ab experiments - public ver.
Catapult DOE Case Study
Being Intentional: Privacy Engineering and A/B Testing
An introduction to machine learning and statistics
Lecture 3 for Machine learning in IITIJ
Online learning &amp; adaptive game playing
XGBoost @ Fyber
Multi-Armed Bandit: an algorithmic perspective

More from Yelp Engineering (14)

PPTX
Human Ops
PPTX
Teeing Up Python - Code Golf
PPTX
Fluxx Streaming
PPTX
Building a World Class Security Team
PPTX
Microservices Summit - The Human Side of Services
PPTX
Humans by the hundred (DevOps Days Ohio)
PPTX
Humans by the hundred
PPTX
Yelp Tech Talks: Mobile Testing 1, 2, 3
PDF
Ensuring Consistency in a Replicated World
PDF
A Beginners Guide To Launching Yelp In Hong Kong
PPTX
MySQL At Yelp
PDF
Own Your Career
PPTX
Scaling Traffic from 0 to 139 Million Unique Visitors
PDF
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...
Human Ops
Teeing Up Python - Code Golf
Fluxx Streaming
Building a World Class Security Team
Microservices Summit - The Human Side of Services
Humans by the hundred (DevOps Days Ohio)
Humans by the hundred
Yelp Tech Talks: Mobile Testing 1, 2, 3
Ensuring Consistency in a Replicated World
A Beginners Guide To Launching Yelp In Hong Kong
MySQL At Yelp
Own Your Career
Scaling Traffic from 0 to 139 Million Unique Visitors
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...

Recently uploaded (20)

PPTX
Geodesy 1.pptx...............................................
PDF
Well-logging-methods_new................
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
Current and future trends in Computer Vision.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Artificial Intelligence
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
Project quality management in manufacturing
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
PPTX
Fundamentals of Mechanical Engineering.pptx
PDF
Categorization of Factors Affecting Classification Algorithms Selection
Geodesy 1.pptx...............................................
Well-logging-methods_new................
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Fundamentals of safety and accident prevention -final (1).pptx
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
UNIT 4 Total Quality Management .pptx
Current and future trends in Computer Vision.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Artificial Intelligence
Foundation to blockchain - A guide to Blockchain Tech
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Project quality management in manufacturing
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Fundamentals of Mechanical Engineering.pptx
Categorization of Factors Affecting Classification Algorithms Selection

Optimal Learning for Fun and Profit with MOE

  • 1. Optimal Learning for Fun and Profit with MOE Scott Clark SF Machine Learning 10/15/14 Joint work with: Eric Liu, Peter Frazier, Norases Vesdapunt, Deniz Oktay, JaiLei Wang sclark@yelp.com @DrScottClark
  • 2. Outline of Talk ● Optimal Learning ○ What is it? ○ Why do we care? ● Multi-armed bandits ○ Definition and motivation ○ Examples ● Bayesian global optimization ○ Optimal experiment design ○ Uses to extend traditional A/B testing ○ Examples ● MOE: Metric Optimization Engine ○ Examples and Features
  • 3. What is optimal learning? Optimal learning addresses the challenge of how to collect information as efficiently as possible, primarily for settings where collecting information is time consuming and expensive. Prof. Warren Powell - optimallearning.princeton.edu What is the most efficient way to collect information? Prof. Peter Frazier - people.orie.cornell.edu/pfrazier How do we make the most money, as fast as possible? Me - @DrScottClark
  • 5. What are multi-armed bandits? THE SETUP ● Imagine you are in front of K slot machines. ● Each one is set to "free play" (but you can still win $$$) ● Each has a possibly different, unknown payout rate ● You have a fixed amount of time to maximize payout GO!
  • 6. What are multi-armed bandits? THE SETUP (math version)
  • 7. Real World Bandits Why do we care? ● Maps well onto Click Through Rate (CTR) ○ Each arm is an ad or search result ○ Each click is a success ○ Want to maximize clicks ● Can be used in experiments (A/B testing) ○ Want to find the best solutions, fast ○ Want to limit how often bad solutions are used
  • 8. Tradeoffs Exploration vs. Exploitation Gaining more knowledge about the system vs. Getting largest payout with current knowledge
  • 9. Naive Example Epsilon First Policy ● Sample sequentially εT < T times ○ only explore ● Pick the best and sample for t = εT+1, ..., T ○ only exploit
  • 10. Example (K = 3, t = 0) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 0 0 - 0 0 - 0 0 - Observed Information
  • 11. Example (K = 3, t = 1) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 1 1 1 0 0 - 0 0 - Observed Information
  • 12. Example (K = 3, t = 2) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 1 1 1 1 1 1 0 0 - Observed Information
  • 13. Example (K = 3, t = 3) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 1 1 1 1 1 1 1 0 0 Observed Information
  • 14. Example (K = 3, t = 4) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 2 1 0.5 1 1 1 1 0 0 Observed Information
  • 15. Example (K = 3, t = 5) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 2 1 0.5 2 2 1 1 0 0 Observed Information
  • 16. Example (K = 3, t = 6) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 2 1 0.5 2 2 1 2 0 0 Observed Information
  • 17. Example (K = 3, t = 7) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 3 2 0.66 2 2 1 2 0 0 Observed Information
  • 18. Example (K = 3, t = 8) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 3 2 0.66 3 3 1 2 0 0 Observed Information
  • 19. Example (K = 3, t = 9) Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 3 2 0.66 3 3 1 3 1 0.33 Observed Information
  • 20. Example (K = 3, t > 9) Exploit! Profit! Right?
  • 21. What if our observed ratio is a poor approx? Unknown p = 0.5 p = 0.8 p = 0.2 payout rate PULLS: WINS: RATIO: 3 2 0.66 3 3 1 3 1 0.33 Observed Information
  • 22. What if our observed ratio is a poor approx? Unknown p = 0.9 p = 0.5 p = 0.5 payout rate PULLS: WINS: RATIO: 3 2 0.66 3 3 1 3 1 0.33 Observed Information
  • 23. Fixed exploration fails Regret is unbounded! Amount of exploration needs to depend on data We need better policies!
  • 24. What should we do? Many different policies ● Weighted random choice (another naive approach) ● Epsilon-greedy ○ Best arm so far with P=1-ε, random otherwise ● Epsilon-decreasing* ○ Best arm so far with P=1-(ε * exp(-rt)), random otherwise ● UCB-exp* ● UCB-tuned* ● BLA* ● SoftMax* ● etc, etc, etc (60+ years of research) *Regret bounded as t->infinity
  • 25. Bandits in the Wild What if... ● Hardware constraints limit real-time knowledge? (batching) ● Payoff noisy? Non-binary? Changes in time? (dynamic content) ● Parallel sampling? (many concurrent users) ● Arms expire? (events, news stories, etc) ● You have knowledge of the user? (logged in, contextual history) ● The number of arms increases? Continuous? (parameter search) Every problem is different. This is an active area of research.
  • 26. Part I: Global Optimization
  • 27. THE GOAL ● Optimize some objective function ○ CTR, revenue, delivery time, or some combination thereof ● given some parameters ○ config values, cuttoffs, ML parameters ● CTR = f(parameters) ○ Find best parameters ● We want to sample the underlying function as few times as possible (more mathy version)
  • 28. Metric Optimization Engine A global, black box method for parameter optimization History of how past parameters have performed MOE New, optimal parameters
  • 29. What does MOE do? ● MOE optimizes a metric (like CTR) given some parameters as inputs (like scoring weights) ● Given the past performance of different parameters MOE suggests new, optimal parameters to test Results of A/B tests run so far MOE New, optimal values to A/B test
  • 30. Example Experiment Biz details distance in ad ● Setting a different distance cutoff for each category Parameters + Obj Func distance_cutoffs = { ‘shopping’: 20.0, ‘food’: 14.0, ‘auto’: 15.0, …} objective_function = { ‘value’: 0.012, ‘std’: 0.00013 } MOE New Parameters distance_cutoffs = { ‘shopping’: 22.1, ‘food’: 7.3, ‘auto’: 12.6, …} to show “X miles away” text in biz_details ad ● For each category we define a maximum distance Run A/B Test
  • 31. Why do we need MOE? ● Parameter optimization is hard ○ Finding the perfect set of parameters takes a long time ○ Hope it is well behaved and try to move in the right direction ○ Not possible as number of parameters increases ● Intractable to find best set of parameters in all situations ○ Thousands of combinations of program type, flow, category ○ Finding the best parameters manually is impossible ● Heuristics quickly break down in the real world ○ Dependent parameters (changes to one change all others) ○ Many parameters at once (location, category, map, place, ...) ○ Non-linear (complexity and chaos break assumptions) MOE solves all of these problems in an optimal way
  • 32. How does it work? MOE 1. Build Gaussian Process (GP) with points sampled so far 2. Optimize covariance hyperparameters of GP 3. Find point(s) of highest Expected Improvement within parameter domain 4. Return optimal next best point(s) to sample
  • 33. Rasmussen and Williams GPML gaussianprocess.org Gaussian Processes
  • 35. Optimizing Covariance Hyperparameters Finding the GP model that fits best ● All of these GPs are created with the same initial data ○ with different hyperparameters (length scales) ● Need to find the model that is most likely given the data ○ Maximum likelihood, cross validation, priors, etc Rasmussen and Williams Gaussian Processes for Machine Learning
  • 36. Optimizing Covariance Hyperparameters Rasmussen and Williams Gaussian Processes for Machine Learning
  • 37. Find point(s) of highest expected improvement We want to find the point(s) that are expected to beat the best point seen so far, by the most. [Jones, Schonlau, Welsch 1998] [Clark, Frazier 2012]
  • 38. Tying it all Together #1: A/B Testing Users Experiment Framework (users -> cohorts) (cohorts -> % traffic, params) ● Optimally assign traffic fractions for experiments (Multi-Armed Bandits) ● Optimally suggest new cohorts to be run (Bayesian Global Optimization) Metric System (batch) Logs, Metrics, Results MOE Multi-Armed Bandits Bayesian Global Opt App cohorts -> params params -> objective function optimal cohort % traffic optimal new params daily/hourly batch time consuming and expensive
  • 39. Tying it all Together #2 Expensive Batch Systems Machine Learning Framework complex regression, deep learning system, etc ● Optimally suggest new hyperparameters for the framework to minimize loss (Bayesian Global Optimization) Metrics Error, Loss, Likelihood, etc MOE Bayesian Global Opt Big Data framework output time consuming and expensive Hyperparameters optimal hyperparameters
  • 40. Tying it all Together #3 Physical Experiments time consuming and expensive time consuming and expensive Drug Trial drug creation, FDA approval, expert admin ● Optimally allocate trial sizes (MAB) ● Optimally suggest parameters for new patients (Bayesian Global Optimization) Evaluation of Results Requires Expert MOE Multi-Armed Bandits Bayesian Global Opt asynchronous results Parameters dosage, frequency, composition optimal parameters conditions on outstanding experiments optimal experiments for new patients
  • 41. What is MOE doing right now? MOE is now live in production ● MOE is informing active experiments ● MOE is successfully optimizing towards all given metrics ● MOE treats the underlying system it is optimizing as a black box, allowing it to be easily extended to any system
  • 42. MOE is Open Source! github.com/Yelp/MOE
  • 43. MOE is Fully Documented yelp.github.io/MOE
  • 44. MOE has Examples yelp.github.io/MOE/examples.html
  • 45. ● Multi-Armed Bandits ○ Many policies implemented and more on the way ● Global Optimization ○ Bayesian Global Optimization via Expected Improvement on GPs
  • 46. MOE is Easy to Install ● yelp.github.io/MOE/install.html#install-in-docker ● registry.hub.docker.com/u/yelpmoe/latest A MOE server is now running at http://localhost:6543
  • 48. References Gaussian Processes for Machine Learning Carl edward Rasmussen and Christopher K. I. Williams. 2006. Massachusetts Institute of Technology. 55 Hayward St., Cambridge, MA 02142. http://www.gaussianprocess.org/gpml/ (free electronic copy) Parallel Machine Learning Algorithms In Bioinformatics and Global Optimization (PhD Dissertation) Part II, EPI: Expected Parallel Improvement Scott Clark. 2012. Cornell University, Center for Applied Mathematics. Ithaca, NY. https://github.com/sc932/Thesis Differentiation of the Cholesky Algorithm S. P. Smith. 1995. Journal of Computational and Graphical Statistics. Volume 4. Number 2. p134-147 A Multi-points Criterion for Deterministic Parallel Global Optimization based on Gaussian Processes. David Ginsbourger, Rodolphe Le Riche, and Laurent Carraro. 2008. D´epartement 3MI. Ecole Nationale Sup´erieure des Mines. 158 cours Fauriel, Saint-Etienne, France. {ginsbourger, leriche, carraro}@emse.fr Efficient Global Optimization of Expensive Black-Box Functions Jones, D.R., Schonlau, M., Welch,W.J. 1998. Journal of Global Optimization, 13, 455-492.
  • 49. Use Cases ● Optimizing a system's click-through or conversion rate (CTR). ○ MOE is useful when evaluating CTR requires running an A/B test on real user traffic, and getting statistically significant results requires running this test for a substantial amount of time (hours, days, or even weeks). Examples include setting distance thresholds, ad unit properties, or internal configuration values. ○ http://engineeringblog.yelp.com/2014/10/using-moe-the-metric-optimization-engine-to-optimize-an- ab-testing-experiment-framework.html ● Optimizing tunable parameters of a machine-learning prediction method. ○ MOE can be used when calculating the prediction error for one choice of the parameters takes a long time, which might happen because the prediction method is complex and takes a long time to train, or because the data used to evaluate the error is huge. Examples include deep learning methods or hyperparameters of features in logistic regression.
  • 50. More Use Cases ● Optimizing the design of an engineering system. ○ MOE helps when evaluating a design requires running a complex physics-based numerical simulation on a supercomputer. Examples include designing and modeling airplanes, the traffic network of a city, a combustion engine, or a hospital. ● Optimizing the parameters of a real-world experiment. ○ MOE can help guide design when every experiment needs to be physically created in a lab or very few experiments can be run in parallel. Examples include chemistry, biology, or physics experiments or a drug trial. ● Any time sampling a tunable, unknown function is time consuming or expensive.