A/B Testing Data-Driven Algorithms in the Cloud - Webinar

A/B Tes(ng Data-Driven 
Algorithms in the Cloud
cloudacademy.com
7/25/2016

About us
Roberto Turrin Luca Baroﬃo
Sr. Data Scien8st (PhD) Data Scien8st (PhD)
@robytur @lucabaroﬃo

Agenda
Data-driven algorithms
Evalua8on
A/B tes8ng
Challenges in A/B tes8ng data-driven algorithms
A/B tes8ng in the cloud
Data-driven A/B tes8ng
Conclusions
Q&A

Data-driven algorithms
Decision problems that
can be modeled from data

Data-driven problems - I
Image recogni8on
Document classiﬁca8on
Speech-to-text
Spam/fraud detec8on
Stock price predic8on
Content personaliza8on
Market basket
Search sugges8on
Playlist genera8on
Document clustering
User segmenta8on
Target Adver8sing

Data-driven problems - II
Image recogni8on
Document classiﬁca8on
Speech-to-text
Spam/fraud detec8on
Stock price predic8on
Content personaliza8on
Market basket
Search sugges8on
Playlist genera8on
Document clustering
User segmenta8on
Target Adver8sing
classiﬁca'on
regression clustering
rule extrac'on
?
170 
cm
group A group B
A, B C
Supervised Unsupervised

Data-driven algorithm pipeline
Training Predic6on
batch real-8me
Feature
extrac6on
batch
data set informa(on
features ML models
real-(me data

Oﬄine evalua8on - I
Training Predic6on
batch real-8me
Feature
extrac6on
batch
data set
features ML models
real (me 
data
informa(on
Oﬄine experiments are
run on a snapshot of the
collected data set.

Oﬄine evalua8on - I
PROS CONS
Quick
Large number of solu8ons
No impact on business
Applicable in most scenarios
They use past data
Risk to promote imita8on
Not considering the impact of  
the algorithm on the user context
Not suitable for “unpredictable” data 
(e.g., stock price)

Online evalua8on
Training Predic6on
batch real-8me
Feature
extrac6on
batch
data set
features ML models
real-(me 
data
informa(on
Online experiments use
live user feedback

Online: human-subject experiments - I
Controlled experiment
A B?
Human-subject
experiments work in a
controlled environment

Online: human-subject experiments - II
PROS CONS
Feedback of real users 
aﬀected by actual context
Implement controlled environment 
(back-end+front-end)
Mul8ple KPIs can be measured Environment is simulated
Recrui8ng non-biased users
Not scaling: limited number of users
Few solu8ons can be tested
Mo8vate users
Medium running 8me

Online: live A/B tes8ng - I
A
B
Live tes8ng works in
produc8on

Online: live A/B tes8ng - II
PROS CONS
Capture real, full impact of the 
data-driven solu8on
Very few solu8ons can be tested
Long running 8me
Large traﬃc required
May aﬀect business
Some KPIs are hard to measure

A/B tes8ng: under the hood
Sta8s8cal hypothesis tes8ng:
1. Formulate a hypothesis
2. Set up a tes)ng campaign
3. Make use of sta)s)cs to evaluate the
hypothesis

A/B tes8ng: real-world similari8es
Clinical trials
Product comparison
Quality assurance
Decision making

A/B tes8ng: UI examples
ADD TO CART ADD TO CART
Register Register (it’s FREE!)
Lorem ipsum dolor sit amet, ius an aperiri sapientem disputando,
legimus mandamus reprimique mei ea. In aliquam euripidis ius. Ei
sea dico interesset. Sit et veri brute. Eu sed populo option apeirian,
essent blandit ei pro. No quo integre delicatissimi. Eos ea nostro
fabulas neglegentur, vel dolor splendide eu, vel ei illud blandit
scripserit. Dolor detracto efﬁciendi ei vel. Ad per error nullam.
Nec id facer impetus deseruisse. Pri dicunt phaedrum te. Ad cum
munere consectetuer, has odio referrentur in. Elit atqui prodesset quo
eu. Eu mei ubique bonorum deseruisse. Habeo sonet disputando et
duo. Et vim homero vocibus, vel ut dicunt omnium.
Start free trial Lorem ipsum dolor sit amet, ius an aperiri sapientem disputando,
legimus mandamus reprimique mei ea. In aliquam euripidis ius. Ei
sea dico interesset. Sit et veri brute. Eu sed populo option apeirian,
essent blandit ei pro. No quo integre delicatissimi. Eos ea nostro
fabulas neglegentur, vel dolor splendide eu, vel ei illud blandit
scripserit. Dolor detracto efﬁciendi ei vel. Ad per error nullam.
Nec id facer impetus deseruisse. Pri dicunt phaedrum te. Ad cum
munere consectetuer, has odio referrentur in. Elit atqui prodesset quo
eu. Eu mei ubique bonorum deseruisse. Habeo sonet disputando et
duo. Et vim homero vocibus, vel ut dicunt omnium.
Start
free
trial
A B“Control” “Varia8on”

A/B tes8ng: ingredients
Hypothesis formula8on
• Everything starts with an idea
Deﬁne metrics:
• How to measure if something is “successful”?
Run a test, collect data and compute metrics
Compare the two alterna8ves

A/B tes8ng: 1) hypothesis formula8on
A red bu4on is clicked more o7en than a blue bu4on
Sta6s6cs lingo:
Null hypothesis:
There is no diﬀerence between
the red and the blue buLons
GOAL: reject the null hypothesis
The null hypothesis is true:
• we fail to reject the null hypothesis

A/B tes8ng: 2) deﬁne a metric
Choose a measure that reﬂects your goals
Examples:
Click Through Rate (CTR)
Open rate, click rate
Conversion rate (# subs/# visitors)
Customer sa8sfac8on
Returning rate

A/B tes8ng: 3) run a test
It may aﬀect your business!
1. Create the two alterna)ves
2. Assign a subset of users to each alterna8ve
3. Collect data and compute the metrics

A/B tes8ng: 4) compare the two alterna8ves
1 view, 0 click —> 0% CTR 1 view, 1 click —> 100% CTR
100% > 0%,the red bupon is beper, right?
Not so fast…
A B

A/B tes8ng: confidence
What is the variability of our measure?
How confident are we in the outcome of the test?
Model our measure resor8ng to a sta)s)cal distribu)on, e.g., a Gaussian
distribu8on
E.g., the average click through rate for the blue bupon is 20% ± 7%
Confidence interval

A/B tes8ng: confidence interval
A confidence interval is a range defined so that there is a given probability
that the value of your measure falls within such range
The confidence interval depends on the confidence level
The higher the confidence level, the larger the confidence interval
E.g., the average click through rate for the blue bupon is 20% ± 7% at 90%
confidence level
Confidence interval

A/B tes8ng: comparing distribu8ons
20%
p(CTR)
CTR40%

20%
p(CTR)
CTR40%
20% ± 7%
90% conﬁdence level

20%
p(CTR)
CTR40%
20% ± 10%

A/B tes8ng: rejec8ng the null hypothesis
20%
p(CTR)
CTR40%
20% ± 10%
The avg CTR for the varia8on falls outside the CI —> Null hypothesis rejected!

A/B tes8ng: errors
Null hypothesis
ACCEPTED
Null hypothesis
REJECTED
Null hypothesis
TRUE
True Nega)ve
The buLons are the
same, we acknowledge
it
Type I error
The buLons are the
same, we say the red
one is beLer
Null hypothesis
FALSE
Type II error
The red buLon is
beLer, we say they are
the same
True Posi)ve
The red buLon is
beLer, we
acknowledge it
Null hypothesis:
There is no diﬀerence between
the red and the blue buLons

20%
p(CTR)
CTR40%
20% ± 7%
⍺: type-I error rate

20%
p(CTR)
CTR40%
20% ± 7%
β: type-II error rate

20%
p(CTR)
CTR40%
20% ± 7%
power = 1 - β

A/B tes8ng: 8ps and common mistakes
DO NOT run the two varia8ons under diﬀerent condi)ons
DO NOT stop the test too early
Pay apen8on to external factors
DO NOT blind test without a hypothesis
DO NOT stop ater the ﬁrst failures
Choose the right metric
Consider the impact on your business
Randomly split the popula8on
Keep the assignment consistent

Tom
A/B tes8ng data-driven algorithms - I
A
B
Training Predic6on
Feature
extrac6on
Training Predic6on
Feature
extrac6on
Mike
People like you
ChrisLena
People like you
Targeted Ad.
Recommended users
A
B
A
B

A/B tes8ng data-driven algorithms - II
CTR not always is the
right metric
Search engine:
ideally no click at all
Tweet sugges8ons: what users click
is not necessarily what they want
E-commerce recommenda8ons: users click to ﬁnd 
products alterna8ve to the one proposed
Find long-term metrics Reten8on/churn
Returning users
Time spent
Upgrading users

A/B tes8ng data-driven algorithms - III
Mul8ple goals are
addressed
Relevance
Transparence
Diversity
Novelty
Coverage
Robustness
Consider all the steps of the pipeline
Do not vary UI and data-driven algorithm simultaneously

A/B tes8ng in the cloud - I
Cloud compu8ng makes A/B tes8ng simpler:
1. Create mul8ple environments/modules with different features
2. Split traffic
• e.g., Google App Engine’s traffic splivng feature
Do the same with the serverless paradigm

A/B tes8ng in the cloud - II
If unsure, use a third-party service
A/B tes8ng as a service:
• AWS A/B tes8ng service
• Google Analy8cs A/B tes8ng feature
• Op8mizely, VWO
A/B tes8ng libraries:
• Sixpack, Planout, Clutch.io, Alephbet
Build your own

Data-driven algorithms to support A/B tes8ng: mul8-armed bandit - I
A
B
A
D
E
C
D
E
A/B tes6ng Mul6-armed bandit
CD
B
A
F
D
E
C
B
A
F
D
E
C
B
A
F
D
E
C
B
A
F
G
Training Predic6on
Feature
extrac6on
B
A
F
G
(me (me
F

Data-driven algorithms to support A/B tes8ng: mul8-armed bandit - II
PROS CONS
Increased average KPI Longer 8me to reach sta8s8cal 
signiﬁcance
Harder implementa8on
Harder maintain consistence

Main takeaways
Evaluate data-driven solu(ons both offline and online
Define the correct KPIs
Prefer long-term metrics to short-term conversions
Do not forget A/B tes(ng is a sta(s(cal test, 
rely on some cloud services if you are not “confident”
Exploita(on/explora(on approaches can be an alterna(ve to A/B tes(ng
Conversion rate is not the only metric

Thank you for apending :)
cloudacademy.com
Q & A

A/B Testing Data-Driven Algorithms in the Cloud - Webinar

More Related Content

Viewers also liked (9)

Similar to A/B Testing Data-Driven Algorithms in the Cloud - Webinar (20)

Recently uploaded (20)

A/B Testing Data-Driven Algorithms in the Cloud - Webinar