A/B testing AI - Global Artificial Intelligence Conference 2019

www.globalbigdataconference.com
Twitter : @bigdataconf
#GAIC

A/B Testing AI
Pavel Dmitriev
VP of Data Science, Outreach

Intro to A/B Testing
Examples of Real Experiments
Why A/B Test AI Systems?
Pitfalls and Lessons
OUTLINE

© 2019 Snowflake Computing Inc. All Rights Reserved
THE LIFE OF AGREAT IDEA– TRUE BING STORY
4
4
Control – Existing Display Treatment – new idea called Long Ad Titles

THELIFEOFAGREATIDEA
5
• It was one of hundreds of ideas on the table, and it seemed…
• Stayed in the backlog in
• Many features were above it, it was clear the idea was not going to make it any time soon
• The engineer thought it was trivial to implement. He implemented it and started an A/B test
• Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug)
• But in this case there was no bug. The idea increased Bing’s revenue by 12% (over
$100M/year), without hurting user experience metrics!
Feb, March, April, May, June,…
Meh…

WEAREBADATASSESSINGTHEVALUEOFIDEAS
6
• The best revenue generating idea in Bing history was badly rated and delayed for months!
• The flip side is also true: a study from Bing showed that only ~1/3 of ideas developed were
actually good for users and business, ~1/3 were neutral, and ~1/3 were bad
• Only in Software Engineering?
• In Sales, contradicting “best practices” are abundant. For example, best day to contact
the prospect is …
• In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death.
FDA and EMA do not trust expert opinions and mandates the use of Randomized
Controlled Trials
We can’t trust our gut! To make the right choices we need data from real users!

A/BTESTSINONESLIDE
9
• Other names: Controlled Experiments, Randomized Clinical Trials (RCTs)
• Can have more than two variants: A/B/C/etc. tests are common
• Must run statistical tests to confirm differences are not due to chance
A/B Tests are the best scientific way to prove causality!

REALEXAMPLES
10
• Three experiments
• Each had enough users for statistical validity
• For each experiment I’ll tell you the success metric
• Your job is to guess the result
• Please stand up
• You’ll chose between three options by raising you left hand, right hand, or leave
both hand down
• If you get it wrong, please sit down
• Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4%
will get all three questions right. Let’s see how much better than random you can do.

EXAMPLE1:OUTREACHEMAIL(STEP9DAY7)
11
• Success metric: Reply Rate
Hey {{first_name}},
In short, we're a sales automation platform that makes your reps life
a lot easier. Our average companies (based on 1100+ companies)
have tripled their reply rates on cold outbound emails and boosted
rep productivity by 2x.
We take what your best reps are doing and automate that across
your entire team so your weaker reps can work at the highest
possible same level. We also solve the issue of follow up falling
through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have
no interest in. My goal is to provide enough value to warrant a 15 minute call with you.
What we do is put your sales process into a structured series of touch points which takes care of your
follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly
worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time,
logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing,
what activities they're doing, and what is and isn't working. This provides a solid foundation to
accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs
a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
• Left: shorter, more “salesy”
• Right: longer, more “socially mindful”
• Raise your left hand if you think the Left version wins (stat-sig)
• Raise your right hand if you think the Right version wins (stat-sig)
• Don’t raise your hand if they are the about the same (no stat-sig difference)

EXAMPLE1:OUTREACHEMAIL(STEP9DAY7)
12
Hey {{first_name}},
In short, we're a sales automation platform that makes your reps life
a lot easier. Our average companies (based on 1100+ companies)
have tripled their reply rates on cold outbound emails and boosted
rep productivity by 2x.
We take what your best reps are doing and automate that across
your entire team so your weaker reps can work at the highest
possible same level. We also solve the issue of follow up falling
through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have
no interest in. My goal is to provide enough value to warrant a 15 minute call with you.
What we do is put your sales process into a structured series of touch points which takes care of your
follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly
worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time,
logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing,
what activities they're doing, and what is and isn't working. This provides a solid foundation to
accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs
a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
• Left template has 70% higher reply rate…
• However, most replies are negative or unsubscribe requests. The right template has
higher positive reply rate!
• If you did not raise your hand, sit down…
• If you raised your right hand, sit down…

EXAMPLE2:SERPTRUNCATION
13
• SERP is a Search Engine Result Page
(shown on the right)
• Success Metric: Clickthrough Rate on first SERP
(ignore issues with click/back, page 2, etc.)
• Version A: show 10 algorithmic results
• Version B: show 8 algorithmic results by removing the
last two results (shown on the right)
• All else the same: task pane, ads, related searches
• Why truncate?
• Slightly faster page load time
• Fewer choices may make it easier for users to
choose what to click on
• Raise your left hand if you think version A wins (10 results)
• Raise your right hand if you think version B wins (8 results)
• Don’t raise your hand if they are the about the same

EXAMPLE2:SERPTRUNCATION
14
• If you raised your left hand, sit down…
• If you raised your right hand, sit down…
• With over 3M users in each variant, we could not
detect a stat-sig delta. Users simply shifted the clicks
from the last two algorithmic results to other
elements of the page.
• Rule of Thumb: Shifting clicks is easy. Reducing
abandonment is hard.

EXAMPLE3:WINDOWSSEARCHBOX
15
• The search box in the lower left corner of the screen on Windows machines
• Success metrics: more searches (and thus more Bing revenue)
• Raise your left hand if you think the Left version wins
• Raise your right hand if you think the Right version wins
• Don’t raise your hand if they are the about the same

EXAMPLE3:WINDOWSSEARCHBOX
16
• If you did not raise your hand, sit down…
• If you raised your left hand, sit down…
• The four variants we actually tested in order of performance are:
Type here to search (winner)
What can I help you find?
Ask me anything (Control - the design that shipped with Windows 10)
Search the web and Windows (worst)
Stop guessing – get the data!

A/BTESTINGAI
17
• ML algorithms are complex, some adapt their behavior on the fly. The all-up
impact on users is not known until tested in the real world. In this sense, AI is
not unlike medicine.
• In medicine, FDA does not trust experts who developed the drug - it requires
randomized clinical trials - because the all-up impact of the drug is not know
until tested in real world.
Why do we apply less rigor in developing AI than we do in medicine?
• It’s a moral obligation of AI developers to their users to rigorously evaluate ML
models using best available methods.

SEVENREASONSTOA/BTESTYOURMLMODELS
18
1. Train/Test sets get old quickly
2. Train/Test set misses entire
class of examples
3. Labels produced by human
annotators may be inaccurate
4. All errors are not equal
5. UI matters
6. Model is part of a bigger
system
7. Model implementation has a
bug
ML
Model
A/B
Testing
Learnings and new ideas
Measure end-user impact

AIATOUTREACH–PERSONALIZEDCUSTOMEREXPERIENCEATSCALE
19
• Outreach is a Sales
Engagement Platform – the
tool in which sales reps spend
their day
• Integrates with CRM, sales
solutions, content
solutions
• Allows sales managers
encodes their playbooks
• Orchestrates and
automates execution of
plays
Personalized
Action+Content
Recommendations
Sales Reps
Situation
Awareness
Action & Content
Effectiveness
Complete Activity Data
Communication (emails, messages, call & meeting
transcripts), prospect’s behavior, sales rep’s behavior, sales
team behavior, industry-wide success patterns

FIVEPITFALLSANDLESSONSFORA/BTESTINGAI
20
1. AI vs no-AI tests are tricky
2. Ensure equal learning opportunity
3. Zero in on the target scenario
4. Beware of side effects
5. Measure the full ML pipeline

LESSON#1:AIVSNO-AITESTSARETRICKY
21
• Introducing ML usually requires substantial changes to system architecture
• Example: introducing ML model to recommend sales rep best content to respond to
prospect’s email introduces extra backend processing and slows down other
functionality
• Many factors other than ML model accuracy can impact the outcome of the
test
Solutions:
• A/B/C test:
A. No AI. System without ML model.
B. Hidden AI. ML model runs in the backend, but results aren’t used changes made
to user experience.
C. Full AI. ML model runs in the backend and impacts user experience
• A vs B measures perf impact, B vs C measures impact on user experience

LESSON#2:ENSUREEQUALLEARNINGOPPORTUNITY
22
• For models that learn and adapt on the fly, their quality depends on the
amount of data they see
• Example: retraining rep content recommendation model daily based on user
feedback
• The variant that has larger fraction of users gets more data for model training
Solutions:
• Ensure treatment and control are exposed to the same fraction of users
• If the test is not 50/50, ensure the control and “default” populations do not
share data

LESSON#3:ZEROINONTHETARGETSCENARIO
23
• Often an ML model targets only a specific narrow user scenario
• Example: a rep content recommendation model may target only a specific type of
prospect objection, e.g. cost objection
• Only a subset of deals faces cost objection, so when we analyze the results “all
up”, the signal may be lost in the noise
Solutions:
• Triggering – restricting experiment analysis to only the affected population
• Requires counterfactual logging – the cost objection scenarios need to be
marked in the same way in both treatment and control variants.
• This may require running the model in control too, and logging the situations where
it fired, even though model results are not used: A/B/C test design from lesson #1.

LESSON#4:BEWAREOFSIDEEFFECTS
24
• While an improvement to an ML model may target a specific user scenario, it
may inadvertently negatively (or positively) impact other scenarios
• Example: an improvement to rep content recommendation model may have
targeted cost objection scenario, but since it’s a single model used for all scenarios it
may have degraded accuracy for not the right time scenario
• How to detect such unintended side effects?
Solutions:
• While triggering is key to understanding the impact in the targeted scenario,
an all-up analysis should be performed to detect side effects
• Use a large comprehensive set of metrics, including even the metrics you do
not expect to impact
• Use segments, such as objection scenario, persona, industry
• Use higher thresholds when analyzing segments to avoid multiple testing pitfall

LESSON#5:MEASURETHEFULLMLPIPELINE
25
• Often ML output is the result of multiple ML models stringed together
• Example: rep content recommendation model may consist of
• Named entity recognition model to detect if known entity, such as specific
competitor product, is talked about in the email
• Scenario detection model which determines the type of objection the sales rep
is facing
• Recommendation model which, given a competitor name, recommends an email
template to respond with
• To maximize learning need to understand how each part impacts the result
Solutions:
• Log for each part its result and confidence level
• Introduce segments based on confidence and prediction class of each part
• Note: assessment of confidence needs to remain the same in both variants

TAKEAWAYS
26
• We are bad at assessing the value of our ideas. Don’t trust experts – get the
data!
• A/B testing is the best scientific way to measure causal impact of your work on
users and business
• A/B testing isn’t just for UI. A/B testing your ML models will provide deeper
insights and may save you from embarrassment
• Five things to keep in mind when A/B testing AI:
1. AI vs no-AI tests are tricky
2. Ensure equal learning opportunity
3. Zero in on the target scenario
4. Beware of side effects
5. Measure the full ML pipeline

Thank You! | Questions?
Pavel Dmitriev
VP of Data Science, Outreach
https://www.linkedin.com/in/paveldmitriev/
We are hiring Data Scientists and ML Engineers!
Ping me on LinkedIn if interested

A/B testing AI - Global Artificial Intelligence Conference 2019

More Related Content

What's hot (20)

Similar to A/B testing AI - Global Artificial Intelligence Conference 2019 (20)

Recently uploaded (20)

A/B testing AI - Global Artificial Intelligence Conference 2019

Editor's Notes