Introductory Online Controlled Experiments

Introductory Online Controlled Experiments
Bowen Li,
Staﬀ Data Scientist @Vpon
2016/04/08
Bowen Li, Staﬀ Data Scientist @Vpon Introductory Online Controlled Experiments 2016/04/08 1 / 35

Outline
1 Introduction
2 Online Experiment Procedure
3 Experiment Designs
4 Experiment Analytics
5 Further on Online Experiments
6 Discussions

Outline
1 Introduction
6 Discussions

Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...

Introduction
Applications
Validate segments for advertising
Enhance algorithms: search, ads, personalization, recommendation
Change apps, UI, content management system
Among many others,...
Motivations
Verify scientifically the hypothesis:
If a specific change is introduced, will it improve key metrics?
Establish causal relationship:
Unlike data mining techniques for finding correlation patterns

Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netﬂix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events

Why Online Experiment?
Intuition for assessing idea value is not reliable
Most ideas fail to improve key metrics:
Google: Only about 10% of experiments led to business changes
Netﬂix: 90% of what they try to be wrong
Even small gains are aggregated across millions of users & events
Getting trustworthy results is hard
Shared pitfalls and puzzling results:
Kohavi et al (2010, 2012); Kohavi & Longbotham (2010)

Experiment Basics
Factor: Controlled variable thought to influence metric
Test factor: Its effect is of interest
Non-test factors: Its effects is of no interest
A/B test: Single factor with two levels
A vs. B
Control vs. Treatment
Existing vs. New
A/B/n test: 1 factor with more than two levels
Multivariable test: More than 1 factors
Variant: E.g. A/B test has 2 experimental variants
Randomization unit: Based on independent assumption

Outline
1 Introduction
6 Discussions

Online Experiment Procedure (1/2)

Online Experiment Procedure (2/2)
A/B test procedure
1 Deﬁne
Overall Evaluation Criterion (OEC): Make decision
Metrics of interest: Find insights
2 Sample size calculation
3 Random assignment to Treatment & Control
4 Log data collection
5 Online monitor
6 Experiment analytics
7 Decision making based on OEC

Online Experiment Procedure: Details (1/5)
Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc

Step 1.1: Define OEC for decision making
Single metric: Incorporate tradeoff between metrics
Frequently, experiment will improve one metric but hurt another
Must be decided in advance
Otherwise induce Familywise Type-I Error (later)
Guideline:
Bad OEC: Short-term profit, but not long-term
Good OEC: Drivers of lifetime value
E.g. sessions per user, repeated visits, conversion rates, etc
Step 1.2: Define metrics for finding insights
Compute many metrics
Must control False Discovery Rate (later)

Step 3: Calculate sample size
Sample size based on 50%/50% of Treatment/Control
For maximum testing power
How long experiement runs
See later for details

Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits

Step 4: Assign randomly user to Treatment or Control
George Box: "Block what you can control and randomize what you cannot"
Blocking: (later)
If can control some non-test factors
Randomization: (later)
If cannot control these non-test factors
In consistent manner:
Same experience in user’s repeated visits
Step 5: Collect log data
Collect logs for online monitor & experiment analytics

Step 6: Online monitor
Treatment ramp-up
Intiate Treatment with 0.1%/99.9% split
Ramp up from 0.1% to 0.5%, 2.5%, 10%, 50%
At each step (for hours), analyze data to prevent egregious problems
Could be detected quickly on small samples
Sample ratio mismatch (SRM) graph:
Monitor (1) # users, (2) OEC/metrics, etc, in each variant, over time
Interactions between overlapping experiments (later)

Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment eﬀect
Estimation for experiment eﬀect

Step 7: Experiment analytics
Compare Treatment’s & Control’s OEC distributions
Hypothesis testing for experiment effect
Estimation for experiment effect
May be defined with different units; for example
Experiment unit: User
Analysis unit: User-Session
Apply Bootstrapping Technique, among others (later)

Outline
1 Introduction
6 Discussions

Statistics for Experiments (1/2)
Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control

Hypotheses for testing
Null hypothesis: H0
Treatment and Control are of no difference
Any observed differences are due to random fluctuations
Alternative hypothesis: H1
Treatment is different from or better than Control
Testing null hypothesis: H0 : OB = OA
OX : OEC for Treatment & Control for X = B & A respectively
OX : Estimated OEC

Hypothesis testing basics
Type-I error: Pr(H1|H0) = α
Probability of rejecting H0 when H0 is true (common: 5%)
Type-II error: Pr(H0|H1) = β
Probability of not rejecting H0 when H0 is false
Conﬁdence level: Pr(H0|H0) = 1 − α
Probability of not rejecting H0 when H0 is true (common: 95%)
Power: Pr(H1|H1) = 1 − β
Probability of rejecting H0 when H0 is false (common: 80-95%)
Decision/Condition H0 is true (H0) H0 is false (H1)
Reject H0 (H1) Type-I error Power
Not reject H0 (H0) Conﬁdence level Type-II error

Sample Size Calculation
Hypothesis testing:
H0 : OB = OA, with desired conﬁdence level: 1 − α
H1 : OB − OA = , with desired power: 1 − β
Minimum sample size:
0 + z1−α/2σ
2
n
= − z1−βσ
2
n
⇒ n =
σ2
∆2
2(z1−α/2 + z1−β)2

Outline
1 Introduction
6 Discussions

Experiment Effect Testing & Estimation (1/2)
Absolute effect: OB − OA
95% confidence interval (CI) for OB − OA:
OB − OA ± 1.96σd
σd : Estimated standard deviation of OB − OA
See Appendix for derivations
Hypothesis tesing for OB − OA: Based on CI

Experiment Effect Testing & Estimation (2/2)
Percent effect: OB−OA
OA
· 100%
95% confidence interval (CI):
OB − OA
OA
+ 1
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1 − 1.962CV
2
A
− 1
CV B = σB
OB
: Estimated coefficient of variation (CV)
σB: Estimated standard deviation of OB
See Appendix for derivations
Hypothesis tesing: Based on CI

Further Experiment Analytics
To reduce variance for increasing power
Increase sample size: Will increase experiment length
Adjust analysis units by features: May shorten experiment length
Pre-experiment user metrics
User demographics: gender, age, location
User behavior analytics: device, App
Among many others

Outline
1 Introduction
6 Discussions

Validation of Experiments
A/A test (null test):
To test experimental & randomization setups
Assign users to variant groups, but expose to the same experience
If system is working properly, H0 should be retained
rejected about only 5%
Other application: Software migration

Limitations of Experiments (1/2)
Quantitative metrics, but no explanations:
Possible to know which is better and by how much, but not why
Long-term effects:
Online tests are typically run for short periods, e.g. a few days/weeks
Find good OEC metrics predicting long-term effects
Run experiments longer: Hard in practice due to Survivorship Bias in
online cohorts:
When lots of cookies would churn, especially in anonymous settings
Primacy effect & newness effect:
Run experiment longer or compute OEC only for new users
Primacy effect:
Experienced users may be less efficient to get used to Treatment
Newness effect:
When Treatment is introduced, some users click everywhere

Limitations of Experiments (2/2)
Feature must be implemented:
In early stages, use paper prototyping for quick feedback/reﬁnements
Consistency:
Need a consistent experience for users
Overlapping experiments:
Previous experiences: Strong interactions are rare in practice(?)
Avoid initially tests that cound interact
Perform Pairwise Tests: Flag interactions automatically
Launch event:
All users need to see it, and we cannot run experiment

Other Practical Concerns (1/2)
Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment eﬀect estimates

Triggering
Example: Change to checkout page, only 10% of users arrive it
Analyze only users who were exposed to the variants (checkout pages)
Reduce variance of treatment eﬀect estimates
Automatic optimization
Run experiments to optimize areas amenable to automated search
Once an organization has a clear OEC
Multi-Armed Bandit Algorithm / Hoeﬀding Races (later)

Robots removal
Their acitivity can severely bias results
Call Treatment assignment by JavaScript (client-side), not server-side
Exclude robots that reject cookies with unidentiﬁed requests
Exclude robots that do not delete cookies and have many actions
Robots removal approach:
List of known robots
Heuristics (Kohavi & Parekh, 2003)

Outline
1 Introduction
6 Discussions

Discussions
Online experiments are extremely important for building data products
of various applications
For fast iteration, we will build online experiments platform with
Random assigment to Treatment or Control
Online monitor for ramp-up, SRM, and interactions
Experiment analytics with data query, ETL and statistical inference
Next: Segment Validation SOP as the 1st application

References
Box et al. (2005). Statistics for experiments: design, innovation and
discovery
Kohavi & Longbotham (Encyclopedia of MLDM, 2015). Online
controlled experiments and A/B tests
Kohavi et al. (DMKD, 2009). Controlled experiments on the web:
survey and practical guide
van Belle (2002). Statistical rule of thumb
Willan & Briggs (2006). Statistical analysis of cost-eﬀective data

Thank you for your listening!

Appendix: Derivations of CI for Absolute Eﬀect
Under H0 : OB = OA,
E(OB − OA) = 0
Var(OB − OA) can be estimated by σ2
d
As sample size is large, by Central Limit Theorem
OB − OA
σd
d
−→ N(0, 1)
Thus
Pr
OB − OA
σd
≤ 1.96 = 95%
CI for absolute eﬀect:
OB − OA ± 1.96σd

Appendix: Derivations of CI for Percent Eﬀect (1/2)
Fieller (1954):
Deﬁne R = OB
OA
Obtain CI for R based on OB − ROA
Apply Central Limit Theorem
OB − ROA
d
−→ N(0, Var[OB − ROA])
Var[OB − ROA] = σ2
B + R2σ2
A (since Cov(OB, OA) = 0)
Thus
OB − ROA
σ2
B + R2σ2
A
d
−→ N(0, 1)
Pr


OB − ROA
σ2
B + R2σ2
A
≤ 1.96

 = 95%

Appendix: Proof of CI for Percent Eﬀect (2/2)
CI for R: By solving quadratic equation of R


OB − ROA
σ2
B + R2σ2
A


2
= 1.962
R =
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
Note: OB−OA
OA
= OB
OA
− 1
CI for percent eﬀect:
OB
OA
1 ± 1.96 CV
2
A + CV
2
B − 1.962CV
2
ACV
2
B
1.962CV
2
A
− 1

Introductory Online Controlled Experiments

More Related Content

Viewers also liked (12)

Similar to Introductory Online Controlled Experiments (20)

Recently uploaded (20)

Introductory Online Controlled Experiments