SlideShare a Scribd company logo
Introductory Online Controlled Experiments
Bowen Li,
Staff Data Scientist @Vpon
2016/04/08
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 2 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 3 / 35
Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...
Motivations
Verify scientifically the hypothesis:
If a specific change is introduced, will it improve key metrics?
Establish causal relationship:
Unlike data mining techniques for finding correlation patterns
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netflix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Getting trustworthy results is hard
Shared pitfalls and puzzling results:
Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
Experiment Basics
Factor: Controlled variable thought to influence metric
Test factor: Its effect is of interest
Non-test factors: Its effects is of no interest
A/B test: Single factor with two levels
A vs. B
Control vs. Treatment
Existing vs. New
A/B/n test: 1 factor with more than two levels
Multivariable test: More than 1 factors
Variant: E.g. A/B test has 2 experimental variants
Randomization unit: Based on independent assumption
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 6 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 7 / 35
Online Experiment Procedure (1/2)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 8 / 35
Online Experiment Procedure (2/2)
A/B test procedure
1 Define
Overall Evaluation Criterion (OEC): Make decision
Metrics of interest: Find insights
2 Sample size calculation
3 Random assignment to Treatment & Control
4 Log data collection
5 Online monitor
6 Experiment analytics
7 Decision making based on OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 9 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc
Step 1.2: Define metrics for finding insights
Compute many metrics
Must control False Discovery Rate (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
Online Experiment Procedure: Details (2/5)
Step 3: Calculate sample size
Sample size based on 50%/50% of Treatment/Control
For maximum testing power
How long experiement runs
See later for details
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 11 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (3/5)
Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits
Step 5: Collect log data
Collect logs for online monitor & experiment analytics
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
Online Experiment Procedure: Details (4/5)
Step 6: Online monitor
Treatment ramp-up
Intiate Treatment with 0.1%/99.9% split
Ramp up from 0.1% to 0.5%, 2.5%, 10%, 50%
At each step (for hours), analyze data to prevent egregious problems
Could be detected quickly on small samples
Sample ratio mismatch (SRM) graph:
Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over time
Interactions between overlapping experiments (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 13 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effect
Estimation for experiment effect
See later for details
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Online Experiment Procedure: Details (5/5)
Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effect
Estimation for experiment effect
See later for details
May be defined with different units; for example
Experiment unit: User
Analysis unit: User-Session
Apply Bootstrapping Technique, among others (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 15 / 35
Statistics for Experiments (1/2)
Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (1/2)
Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control
Testing null hypothesis: H0 : OB = OA
OX : OEC for Treatment & Control for X = B & A respectively
OX : Estimated OEC
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
Statistics for Experiments (2/2)
Hypothesis testing basics
Type-I error: Pr(H1|H0) = α
Probability of rejecting H0 when H0 is true (common: 5%)
Type-II error: Pr(H0|H1) = β
Probability of not rejecting H0 when H0 is false
Confidence level: Pr(H0|H0) = 1 − α
Probability of not rejecting H0 when H0 is true (common: 95%)
Power: Pr(H1|H1) = 1 − β
Probability of rejecting H0 when H0 is false (common: 80-95%)
Decision/Condition H0 is true (H0) H0 is false (H1)
Reject H0 (H1) Type-I error Power
Not reject H0 (H0) Confidence level Type-II error
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 17 / 35
Sample Size Calculation
Hypothesis testing:
H0 : OB = OA, with desired confidence level: 1 − α
H1 : OB − OA = , with desired power: 1 − β
Minimum sample size:
0 + z1−α/2σ
2
n
= − z1−βσ
2
n
⇒ n =
σ2
∆2
2(z1−α/2 + z1−β)2
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 18 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 19 / 35
Experiment Effect Testing & Estimation (1/2)
Absolute effect: OB − OA
95% confidence interval (CI) for OB − OA:
OB − OA ± 1.96σd
σd : Estimated standard deviation of OB − OA
See Appendix for derivations
Hypothesis tesing for OB − OA: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 20 / 35
Experiment Effect Testing & Estimation (2/2)
Percent effect: OB−OA
OA
· 100%
95% confidence interval (CI):
OB − OA
OA
+ 1
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1 − 1.962CV
2
A
− 1
CV B = σB
OB
: Estimated coefficient of variation (CV)
σB: Estimated standard deviation of OB
See Appendix for derivations
Hypothesis tesing: Based on CI
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 21 / 35
Further Experiment Analytics
To reduce variance for increasing power
Increase sample size: Will increase experiment length
Adjust analysis units by features: May shorten experiment length
Pre-experiment user metrics
User demographics: gender, age, location
User behavior analytics: device, App
Among many others
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 22 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 23 / 35
Validation of Experiments
A/A test (null test):
To test experimental & randomization setups
Assign users to variant groups, but expose to the same experience
If system is working properly, H0 should be retained
rejected about only 5%
Other application: Software migration
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 24 / 35
Limitations of Experiments (1/2)
Quantitative metrics, but no explanations:
Possible to know which is better and by how much, but not why
Long-term effects:
Online tests are typically run for short periods, e.g. a few days/weeks
Find good OEC metrics predicting long-term effects
Run experiments longer: Hard in practice due to Survivorship Bias in
online cohorts:
When lots of cookies would churn, especially in anonymous settings
Primacy effect & newness effect:
Run experiment longer or compute OEC only for new users
Primacy effect:
Experienced users may be less efficient to get used to Treatment
Newness effect:
When Treatment is introduced, some users click everywhere
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 25 / 35
Limitations of Experiments (2/2)
Feature must be implemented:
In early stages, use paper prototyping for quick feedback/refinements
Consistency:
Need a consistent experience for users
Overlapping experiments:
Previous experiences: Strong interactions are rare in practice(?)
Avoid initially tests that cound interact
Perform Pairwise Tests: Flag interactions automatically
Launch event:
All users need to see it, and we cannot run experiment
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 26 / 35
Other Practical Concerns (1/2)
Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (1/2)
Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment effect estimates
Automatic optimization
Run experiments to optimize areas amenable to automated search
Once an organization has a clear OEC
Multi-Armed Bandit Algorithm / Hoeffding Races (later)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
Other Practical Concerns (2/2)
Robots removal
Their acitivity can severely bias results
Call Treatment assignment by JavaScript (client-side), not server-side
Exclude robots that reject cookies with unidentified requests
Exclude robots that do not delete cookies and have many actions
Robots removal approach:
List of known robots
Heuristics (Kohavi & Parekh, 2003)
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 28 / 35
Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 29 / 35
Discussions
Online experiments are extremely important for building data products
of various applications
For fast iteration, we will build online experiments platform with
Random assigment to Treatment or Control
Online monitor for ramp-up, SRM, and interactions
Experiment analytics with data query, ETL and statistical inference
Next: Segment Validation SOP as the 1st application
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 30 / 35
References
Box et al. (2005). Statistics for experiments: design, innovation and
discovery
Kohavi & Longbotham (Encyclopedia of MLDM, 2015). Online
controlled experiments and A/B tests
Kohavi et al. (DMKD, 2009). Controlled experiments on the web:
survey and practical guide
van Belle (2002). Statistical rule of thumb
Willan & Briggs (2006). Statistical analysis of cost-effective data
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 31 / 35
Thank you for your listening!
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 32 / 35
Appendix: Derivations of CI for Absolute Effect
Under H0 : OB = OA,
E(OB − OA) = 0
Var(OB − OA) can be estimated by σ2
d
As sample size is large, by Central Limit Theorem
OB − OA
σd
d
−→ N(0, 1)
Thus
Pr
OB − OA
σd
≤ 1.96 = 95%
CI for absolute effect:
OB − OA ± 1.96σd
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 33 / 35
Appendix: Derivations of CI for Percent Effect (1/2)
Fieller (1954):
Define R = OB
OA
Obtain CI for R based on OB − ROA
Apply Central Limit Theorem
OB − ROA
d
−→ N(0, Var[OB − ROA])
Var[OB − ROA] = σ2
B + R2σ2
A (since Cov(OB, OA) = 0)
Thus
OB − ROA
σ2
B + R2σ2
A
d
−→ N(0, 1)
Pr


OB − ROA
σ2
B + R2σ2
A
≤ 1.96

 = 95%
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 34 / 35
Appendix: Proof of CI for Percent Effect (2/2)
CI for R: By solving quadratic equation of R


OB − ROA
σ2
B + R2σ2
A


2
= 1.962
R =
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
Note: OB−OA
OA
= OB
OA
− 1
CI for percent effect:
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
− 1
Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 35 / 35

More Related Content

PPTX
Bad Testing Metrics—and What To Do About Them
PPTX
Fundamental of Quality Data - Anthony Ndungu
PPTX
Test Fest and the Tale of Too Many Post-its
PDF
Test Fest and the Tale of Too Many Post-its
PPTX
Test Fest: Catching up on Your Usability Testing Backlog
PPTX
High volume test automation in practice
PPTX
Too good to be true? How validate your data
PDF
Mr201401 consideration for indicators of malware likeness based on static fil...
Bad Testing Metrics—and What To Do About Them
Fundamental of Quality Data - Anthony Ndungu
Test Fest and the Tale of Too Many Post-its
Test Fest and the Tale of Too Many Post-its
Test Fest: Catching up on Your Usability Testing Backlog
High volume test automation in practice
Too good to be true? How validate your data
Mr201401 consideration for indicators of malware likeness based on static fil...

Viewers also liked (12)

PPTX
Chpt8 how to do an experiment
PPTX
eTail East 2014 - Developing an Email Testing Roadmap
PPTX
Hybridapp 161209030125
PPTX
Androidaop 170105090257
PDF
如何快速將Ui設計流程套入新專案
PDF
Building a Testing Roadmap by Hazjier Pourkhalkhali - Optimizely Experience L...
PDF
Ab Testing
PDF
20160315 網路星期二:數據會說話 - 從NPO的網站分析談起
PDF
Growth Hacking
PPTX
eMetrics London - The AB Testing Hype Cycle
PPTX
網路廣告基礎入門
PPT
How to Write a Weekly Report
Chpt8 how to do an experiment
eTail East 2014 - Developing an Email Testing Roadmap
Hybridapp 161209030125
Androidaop 170105090257
如何快速將Ui設計流程套入新專案
Building a Testing Roadmap by Hazjier Pourkhalkhali - Optimizely Experience L...
Ab Testing
20160315 網路星期二:數據會說話 - 從NPO的網站分析談起
Growth Hacking
eMetrics London - The AB Testing Hype Cycle
網路廣告基礎入門
How to Write a Weekly Report
Ad

Similar to Introductory Online Controlled Experiments (20)

PDF
Faster and cheaper, smart ab experiments - public ver.
PPTX
A/B Testing at Scale
PPTX
Design of Experiments
PPTX
Machine Learning Powered A/B Testing
PDF
Microsoft guide controlled experiments
PPTX
10 Guidelines for A/B Testing
PDF
Experimental Design and Analysis for Psychology 1st Edition Herve Abdi
PDF
Talks@Coursera - A/B Testing @ Internet Scale
PPTX
Experimental research
PPT
design of experiments 25 desigfnExDesign.ppt
PPTX
UPDATED-MODULE 3-RESEARCH DESIGN.pptxxxx
PDF
Why do a designed experiment
PPTX
Planning of experiment in industrial research
PDF
OPTIMIZATION TECHNIQUES IN PHARMACEUTICAL SCIENCES
PPTX
Common Shortcomings in SE Experiments (ICSE'14 Doctoral Symposium Keynote)
PPT
ABE057-Design-of-Experiments.ppt
PDF
Consistent Transformation of Ratio Metrics for Efficient Online Controlled Ex...
PPTX
Cadds8.pptx
PPTX
Behavior Based Approach to Experiment Design
Faster and cheaper, smart ab experiments - public ver.
A/B Testing at Scale
Design of Experiments
Machine Learning Powered A/B Testing
Microsoft guide controlled experiments
10 Guidelines for A/B Testing
Experimental Design and Analysis for Psychology 1st Edition Herve Abdi
Talks@Coursera - A/B Testing @ Internet Scale
Experimental research
design of experiments 25 desigfnExDesign.ppt
UPDATED-MODULE 3-RESEARCH DESIGN.pptxxxx
Why do a designed experiment
Planning of experiment in industrial research
OPTIMIZATION TECHNIQUES IN PHARMACEUTICAL SCIENCES
Common Shortcomings in SE Experiments (ICSE'14 Doctoral Symposium Keynote)
ABE057-Design-of-Experiments.ppt
Consistent Transformation of Ratio Metrics for Efficient Online Controlled Ex...
Cadds8.pptx
Behavior Based Approach to Experiment Design
Ad

Recently uploaded (20)

PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Predictive modeling basics in data cleaning process
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Leprosy and NLEP programme community medicine
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
annual-report-2024-2025 original latest.
PPT
Quality review (1)_presentation of this 21
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
Transcultural that can help you someday.
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
.pdf is not working space design for the following data for the following dat...
Predictive modeling basics in data cleaning process
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Clinical guidelines as a resource for EBP(1).pdf
Leprosy and NLEP programme community medicine
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
annual-report-2024-2025 original latest.
Quality review (1)_presentation of this 21
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Galatica Smart Energy Infrastructure Startup Pitch Deck
Transcultural that can help you someday.
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf

Introductory Online Controlled Experiments

  • 1. Introductory Online Controlled Experiments Bowen Li, Staff Data Scientist @Vpon 2016/04/08 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35
  • 2. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 2 / 35
  • 3. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 3 / 35
  • 4. Introduction Applications Validate segments for advertising Enhance algorithms: search, ads, personalization, recommendation Change apps, UI, content management system Among many others,... Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
  • 5. Introduction Applications Validate segments for advertising Enhance algorithms: search, ads, personalization, recommendation Change apps, UI, content management system Among many others,... Motivations Verify scientifically the hypothesis: If a specific change is introduced, will it improve key metrics? Establish causal relationship: Unlike data mining techniques for finding correlation patterns Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 4 / 35
  • 6. Why Online Experiment? Intuition for assessing idea value is not reliable Most ideas fail to improve key metrics: Google: Only about 10% of experiments led to business changes Netflix: 90% of what they try to be wrong Even small gains are aggregated across millions of users & events Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
  • 7. Why Online Experiment? Intuition for assessing idea value is not reliable Most ideas fail to improve key metrics: Google: Only about 10% of experiments led to business changes Netflix: 90% of what they try to be wrong Even small gains are aggregated across millions of users & events Getting trustworthy results is hard Shared pitfalls and puzzling results: Kohavi et al (2010, 2012); Kohavi & Longbotham (2010) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 5 / 35
  • 8. Experiment Basics Factor: Controlled variable thought to influence metric Test factor: Its effect is of interest Non-test factors: Its effects is of no interest A/B test: Single factor with two levels A vs. B Control vs. Treatment Existing vs. New A/B/n test: 1 factor with more than two levels Multivariable test: More than 1 factors Variant: E.g. A/B test has 2 experimental variants Randomization unit: Based on independent assumption Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 6 / 35
  • 9. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 7 / 35
  • 10. Online Experiment Procedure (1/2) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 8 / 35
  • 11. Online Experiment Procedure (2/2) A/B test procedure 1 Define Overall Evaluation Criterion (OEC): Make decision Metrics of interest: Find insights 2 Sample size calculation 3 Random assignment to Treatment & Control 4 Log data collection 5 Online monitor 6 Experiment analytics 7 Decision making based on OEC Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 9 / 35
  • 12. Online Experiment Procedure: Details (1/5) Step 1.1: Define OEC for decision making Single metric: Incorporate tradeoff between metrics Frequently, experiment will improve one metric but hurt another Must be decided in advance Otherwise induce Familywise Type-I Error (later) Guideline: Bad OEC: Short-term profit, but not long-term Good OEC: Drivers of lifetime value E.g. sessions per user, repeated visits, conversion rates, etc Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
  • 13. Online Experiment Procedure: Details (1/5) Step 1.1: Define OEC for decision making Single metric: Incorporate tradeoff between metrics Frequently, experiment will improve one metric but hurt another Must be decided in advance Otherwise induce Familywise Type-I Error (later) Guideline: Bad OEC: Short-term profit, but not long-term Good OEC: Drivers of lifetime value E.g. sessions per user, repeated visits, conversion rates, etc Step 1.2: Define metrics for finding insights Compute many metrics Must control False Discovery Rate (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 10 / 35
  • 14. Online Experiment Procedure: Details (2/5) Step 3: Calculate sample size Sample size based on 50%/50% of Treatment/Control For maximum testing power How long experiement runs See later for details Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 11 / 35
  • 15. Online Experiment Procedure: Details (3/5) Step 4: Assign randomly user to Treatment or Control George Box: "Block what you can control and randomize what you cannot" Blocking: (later) If can control some non-test factors Randomization: (later) If cannot control these non-test factors In consistent manner: Same experience in user’s repeated visits Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
  • 16. Online Experiment Procedure: Details (3/5) Step 4: Assign randomly user to Treatment or Control George Box: "Block what you can control and randomize what you cannot" Blocking: (later) If can control some non-test factors Randomization: (later) If cannot control these non-test factors In consistent manner: Same experience in user’s repeated visits Step 5: Collect log data Collect logs for online monitor & experiment analytics Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 12 / 35
  • 17. Online Experiment Procedure: Details (4/5) Step 6: Online monitor Treatment ramp-up Intiate Treatment with 0.1%/99.9% split Ramp up from 0.1% to 0.5%, 2.5%, 10%, 50% At each step (for hours), analyze data to prevent egregious problems Could be detected quickly on small samples Sample ratio mismatch (SRM) graph: Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over time Interactions between overlapping experiments (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 13 / 35
  • 18. Online Experiment Procedure: Details (5/5) Step 7: Experiment analytics Compare Treatment’s & Control’s OEC distributions Hypothesis testing for experiment effect Estimation for experiment effect See later for details Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
  • 19. Online Experiment Procedure: Details (5/5) Step 7: Experiment analytics Compare Treatment’s & Control’s OEC distributions Hypothesis testing for experiment effect Estimation for experiment effect See later for details May be defined with different units; for example Experiment unit: User Analysis unit: User-Session Apply Bootstrapping Technique, among others (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 14 / 35
  • 20. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 15 / 35
  • 21. Statistics for Experiments (1/2) Hypotheses for testing Null hypothesis: H0 Treatment and Control are of no difference Any observed differences are due to random fluctuations Alternative hypothesis: H1 Treatment is different from or better than Control Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
  • 22. Statistics for Experiments (1/2) Hypotheses for testing Null hypothesis: H0 Treatment and Control are of no difference Any observed differences are due to random fluctuations Alternative hypothesis: H1 Treatment is different from or better than Control Testing null hypothesis: H0 : OB = OA OX : OEC for Treatment & Control for X = B & A respectively OX : Estimated OEC Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 16 / 35
  • 23. Statistics for Experiments (2/2) Hypothesis testing basics Type-I error: Pr(H1|H0) = α Probability of rejecting H0 when H0 is true (common: 5%) Type-II error: Pr(H0|H1) = β Probability of not rejecting H0 when H0 is false Confidence level: Pr(H0|H0) = 1 − α Probability of not rejecting H0 when H0 is true (common: 95%) Power: Pr(H1|H1) = 1 − β Probability of rejecting H0 when H0 is false (common: 80-95%) Decision/Condition H0 is true (H0) H0 is false (H1) Reject H0 (H1) Type-I error Power Not reject H0 (H0) Confidence level Type-II error Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 17 / 35
  • 24. Sample Size Calculation Hypothesis testing: H0 : OB = OA, with desired confidence level: 1 − α H1 : OB − OA = , with desired power: 1 − β Minimum sample size: 0 + z1−α/2σ 2 n = − z1−βσ 2 n ⇒ n = σ2 ∆2 2(z1−α/2 + z1−β)2 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 18 / 35
  • 25. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 19 / 35
  • 26. Experiment Effect Testing & Estimation (1/2) Absolute effect: OB − OA 95% confidence interval (CI) for OB − OA: OB − OA ± 1.96σd σd : Estimated standard deviation of OB − OA See Appendix for derivations Hypothesis tesing for OB − OA: Based on CI Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 20 / 35
  • 27. Experiment Effect Testing & Estimation (2/2) Percent effect: OB−OA OA · 100% 95% confidence interval (CI): OB − OA OA + 1 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1 − 1.962CV 2 A − 1 CV B = σB OB : Estimated coefficient of variation (CV) σB: Estimated standard deviation of OB See Appendix for derivations Hypothesis tesing: Based on CI Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 21 / 35
  • 28. Further Experiment Analytics To reduce variance for increasing power Increase sample size: Will increase experiment length Adjust analysis units by features: May shorten experiment length Pre-experiment user metrics User demographics: gender, age, location User behavior analytics: device, App Among many others Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 22 / 35
  • 29. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 23 / 35
  • 30. Validation of Experiments A/A test (null test): To test experimental & randomization setups Assign users to variant groups, but expose to the same experience If system is working properly, H0 should be retained rejected about only 5% Other application: Software migration Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 24 / 35
  • 31. Limitations of Experiments (1/2) Quantitative metrics, but no explanations: Possible to know which is better and by how much, but not why Long-term effects: Online tests are typically run for short periods, e.g. a few days/weeks Find good OEC metrics predicting long-term effects Run experiments longer: Hard in practice due to Survivorship Bias in online cohorts: When lots of cookies would churn, especially in anonymous settings Primacy effect & newness effect: Run experiment longer or compute OEC only for new users Primacy effect: Experienced users may be less efficient to get used to Treatment Newness effect: When Treatment is introduced, some users click everywhere Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 25 / 35
  • 32. Limitations of Experiments (2/2) Feature must be implemented: In early stages, use paper prototyping for quick feedback/refinements Consistency: Need a consistent experience for users Overlapping experiments: Previous experiences: Strong interactions are rare in practice(?) Avoid initially tests that cound interact Perform Pairwise Tests: Flag interactions automatically Launch event: All users need to see it, and we cannot run experiment Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 26 / 35
  • 33. Other Practical Concerns (1/2) Triggering Example: Change to checkout page, only 10% of users arrive it Analyze only users who were exposed to the variants (checkout pages) Reduce variance of treatment effect estimates Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
  • 34. Other Practical Concerns (1/2) Triggering Example: Change to checkout page, only 10% of users arrive it Analyze only users who were exposed to the variants (checkout pages) Reduce variance of treatment effect estimates Automatic optimization Run experiments to optimize areas amenable to automated search Once an organization has a clear OEC Multi-Armed Bandit Algorithm / Hoeffding Races (later) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 27 / 35
  • 35. Other Practical Concerns (2/2) Robots removal Their acitivity can severely bias results Call Treatment assignment by JavaScript (client-side), not server-side Exclude robots that reject cookies with unidentified requests Exclude robots that do not delete cookies and have many actions Robots removal approach: List of known robots Heuristics (Kohavi & Parekh, 2003) Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 28 / 35
  • 36. Outline 1 Introduction 2 Online Experiment Procedure 3 Experiment Designs 4 Experiment Analytics 5 Further on Online Experiments 6 Discussions Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 29 / 35
  • 37. Discussions Online experiments are extremely important for building data products of various applications For fast iteration, we will build online experiments platform with Random assigment to Treatment or Control Online monitor for ramp-up, SRM, and interactions Experiment analytics with data query, ETL and statistical inference Next: Segment Validation SOP as the 1st application Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 30 / 35
  • 38. References Box et al. (2005). Statistics for experiments: design, innovation and discovery Kohavi & Longbotham (Encyclopedia of MLDM, 2015). Online controlled experiments and A/B tests Kohavi et al. (DMKD, 2009). Controlled experiments on the web: survey and practical guide van Belle (2002). Statistical rule of thumb Willan & Briggs (2006). Statistical analysis of cost-effective data Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 31 / 35
  • 39. Thank you for your listening! Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 32 / 35
  • 40. Appendix: Derivations of CI for Absolute Effect Under H0 : OB = OA, E(OB − OA) = 0 Var(OB − OA) can be estimated by σ2 d As sample size is large, by Central Limit Theorem OB − OA σd d −→ N(0, 1) Thus Pr OB − OA σd ≤ 1.96 = 95% CI for absolute effect: OB − OA ± 1.96σd Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 33 / 35
  • 41. Appendix: Derivations of CI for Percent Effect (1/2) Fieller (1954): Define R = OB OA Obtain CI for R based on OB − ROA Apply Central Limit Theorem OB − ROA d −→ N(0, Var[OB − ROA]) Var[OB − ROA] = σ2 B + R2σ2 A (since Cov(OB, OA) = 0) Thus OB − ROA σ2 B + R2σ2 A d −→ N(0, 1) Pr   OB − ROA σ2 B + R2σ2 A ≤ 1.96   = 95% Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 34 / 35
  • 42. Appendix: Proof of CI for Percent Effect (2/2) CI for R: By solving quadratic equation of R   OB − ROA σ2 B + R2σ2 A   2 = 1.962 R = OB OA 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1.962CV 2 A Note: OB−OA OA = OB OA − 1 CI for percent effect: OB OA 1 ± 1.96 CV 2 A + CV 2 B − 1.962CV 2 ACV 2 B 1.962CV 2 A − 1 Bowen Li, Staff Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 35 / 35