SlideShare a Scribd company logo
www.globalbigdataconference.com
Twitter : @bigdataconf
#GAIC
A/B Testing AI
Pavel Dmitriev
VP of Data Science, Outreach
Intro to A/B Testing
Examples of Real Experiments
Why A/B Test AI Systems?
Pitfalls and Lessons
OUTLINE
© 2019 Snowflake Computing Inc. All Rights Reserved
THE LIFE OF AGREAT IDEA– TRUE BING STORY
4
4
Control – Existing Display Treatment – new idea called Long Ad Titles
© 2019 Snowflake Computing Inc. All Rights Reserved
THELIFEOFAGREATIDEA
5
• It was one of hundreds of ideas on the table, and it seemed…
• Stayed in the backlog in
• Many features were above it, it was clear the idea was not going to make it any time soon
• The engineer thought it was trivial to implement. He implemented it and started an A/B test
• Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug)
• But in this case there was no bug. The idea increased Bing’s revenue by 12% (over
$100M/year), without hurting user experience metrics!
Feb, March, April, May, June,…
Meh…
© 2019 Snowflake Computing Inc. All Rights Reserved
WEAREBADATASSESSINGTHEVALUEOFIDEAS
6
• The best revenue generating idea in Bing history was badly rated and delayed for months!
• The flip side is also true: a study from Bing showed that only ~1/3 of ideas developed were
actually good for users and business, ~1/3 were neutral, and ~1/3 were bad
• Only in Software Engineering?
• In Sales, contradicting “best practices” are abundant. For example, best day to contact
the prospect is …
• In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death.
FDA and EMA do not trust expert opinions and mandates the use of Randomized
Controlled Trials
We can’t trust our gut! To make the right choices we need data from real users!
© 2019 Snowflake Computing Inc. All Rights Reserved 7
© 2019 Snowflake Computing Inc. All Rights Reserved
A/BTESTSINONESLIDE
9
• Other names: Controlled Experiments, Randomized Clinical Trials (RCTs)
• Can have more than two variants: A/B/C/etc. tests are common
• Must run statistical tests to confirm differences are not due to chance
A/B Tests are the best scientific way to prove causality!
© 2019 Snowflake Computing Inc. All Rights Reserved
REALEXAMPLES
10
• Three experiments
• Each had enough users for statistical validity
• For each experiment I’ll tell you the success metric
• Your job is to guess the result
• Please stand up
• You’ll chose between three options by raising you left hand, right hand, or leave
both hand down
• If you get it wrong, please sit down
• Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4%
will get all three questions right. Let’s see how much better than random you can do.
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE1:OUTREACHEMAIL(STEP9DAY7)
11
• Success metric: Reply Rate
Hey {{first_name}},
In short, we're a sales automation platform that makes your reps life
a lot easier. Our average companies (based on 1100+ companies)
have tripled their reply rates on cold outbound emails and boosted
rep productivity by 2x.
We take what your best reps are doing and automate that across
your entire team so your weaker reps can work at the highest
possible same level. We also solve the issue of follow up falling
through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have
no interest in. My goal is to provide enough value to warrant a 15 minute call with you.
What we do is put your sales process into a structured series of touch points which takes care of your
follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly
worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time,
logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing,
what activities they're doing, and what is and isn't working. This provides a solid foundation to
accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs
a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
{{sender.first_name}}
• Left: shorter, more “salesy”
• Right: longer, more “socially mindful”
• Raise your left hand if you think the Left version wins (stat-sig)
• Raise your right hand if you think the Right version wins (stat-sig)
• Don’t raise your hand if they are the about the same (no stat-sig difference)
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE1:OUTREACHEMAIL(STEP9DAY7)
12
Hey {{first_name}},
In short, we're a sales automation platform that makes your reps life
a lot easier. Our average companies (based on 1100+ companies)
have tripled their reply rates on cold outbound emails and boosted
rep productivity by 2x.
We take what your best reps are doing and automate that across
your entire team so your weaker reps can work at the highest
possible same level. We also solve the issue of follow up falling
through the cracks and reps not going deep enough.
When can I get a few minutes on your calendar to discuss?
{{sender.first_name}}
{{first_name}},
I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have
no interest in. My goal is to provide enough value to warrant a 15 minute call with you.
What we do is put your sales process into a structured series of touch points which takes care of your
follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly
worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen.
Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time,
logs their activities, and gives you 100% accurate reporting.
Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing,
what activities they're doing, and what is and isn't working. This provides a solid foundation to
accurately forecast results, improve your outreach and train your team.
Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs
a day, and 2X's their productivity.
If you see value here can we set up a time next Tuesday or Wednesday to discuss?
{{sender.first_name}}
• Left template has 70% higher reply rate…
• However, most replies are negative or unsubscribe requests. The right template has
higher positive reply rate!
• If you did not raise your hand, sit down…
• If you raised your right hand, sit down…
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE2:SERPTRUNCATION
13
• SERP is a Search Engine Result Page
(shown on the right)
• Success Metric: Clickthrough Rate on first SERP
(ignore issues with click/back, page 2, etc.)
• Version A: show 10 algorithmic results
• Version B: show 8 algorithmic results by removing the
last two results (shown on the right)
• All else the same: task pane, ads, related searches
• Why truncate?
• Slightly faster page load time
• Fewer choices may make it easier for users to
choose what to click on
• Raise your left hand if you think version A wins (10 results)
• Raise your right hand if you think version B wins (8 results)
• Don’t raise your hand if they are the about the same
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE2:SERPTRUNCATION
14
• If you raised your left hand, sit down…
• If you raised your right hand, sit down…
• With over 3M users in each variant, we could not
detect a stat-sig delta. Users simply shifted the clicks
from the last two algorithmic results to other
elements of the page.
• Rule of Thumb: Shifting clicks is easy. Reducing
abandonment is hard.
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE3:WINDOWSSEARCHBOX
15
• The search box in the lower left corner of the screen on Windows machines
• Success metrics: more searches (and thus more Bing revenue)
• Raise your left hand if you think the Left version wins
• Raise your right hand if you think the Right version wins
• Don’t raise your hand if they are the about the same
© 2019 Snowflake Computing Inc. All Rights Reserved
EXAMPLE3:WINDOWSSEARCHBOX
16
• If you did not raise your hand, sit down…
• If you raised your left hand, sit down…
• The four variants we actually tested in order of performance are:
Type here to search (winner)
What can I help you find?
Ask me anything (Control - the design that shipped with Windows 10)
Search the web and Windows (worst)
Stop guessing – get the data!
© 2019 Snowflake Computing Inc. All Rights Reserved
A/BTESTINGAI
17
• ML algorithms are complex, some adapt their behavior on the fly. The all-up
impact on users is not known until tested in the real world. In this sense, AI is
not unlike medicine.
• In medicine, FDA does not trust experts who developed the drug - it requires
randomized clinical trials - because the all-up impact of the drug is not know
until tested in real world.
Why do we apply less rigor in developing AI than we do in medicine?
• It’s a moral obligation of AI developers to their users to rigorously evaluate ML
models using best available methods.
© 2019 Snowflake Computing Inc. All Rights Reserved
SEVENREASONSTOA/BTESTYOURMLMODELS
18
1. Train/Test sets get old quickly
2. Train/Test set misses entire
class of examples
3. Labels produced by human
annotators may be inaccurate
4. All errors are not equal
5. UI matters
6. Model is part of a bigger
system
7. Model implementation has a
bug
ML
Model
A/B
Testing
Learnings and new ideas
Measure end-user impact
© 2019 Snowflake Computing Inc. All Rights Reserved
AIATOUTREACH–PERSONALIZEDCUSTOMEREXPERIENCEATSCALE
19
• Outreach is a Sales
Engagement Platform – the
tool in which sales reps spend
their day
• Integrates with CRM, sales
solutions, content
solutions
• Allows sales managers
encodes their playbooks
• Orchestrates and
automates execution of
plays
Personalized
Action+Content
Recommendations
Sales Reps
Situation
Awareness
Action & Content
Effectiveness
Complete Activity Data
Communication (emails, messages, call & meeting
transcripts), prospect’s behavior, sales rep’s behavior, sales
team behavior, industry-wide success patterns
© 2019 Snowflake Computing Inc. All Rights Reserved
FIVEPITFALLSANDLESSONSFORA/BTESTINGAI
20
1. AI vs no-AI tests are tricky
2. Ensure equal learning opportunity
3. Zero in on the target scenario
4. Beware of side effects
5. Measure the full ML pipeline
© 2019 Snowflake Computing Inc. All Rights Reserved
LESSON#1:AIVSNO-AITESTSARETRICKY
21
• Introducing ML usually requires substantial changes to system architecture
• Example: introducing ML model to recommend sales rep best content to respond to
prospect’s email introduces extra backend processing and slows down other
functionality
• Many factors other than ML model accuracy can impact the outcome of the
test
Solutions:
• A/B/C test:
A. No AI. System without ML model.
B. Hidden AI. ML model runs in the backend, but results aren’t used changes made
to user experience.
C. Full AI. ML model runs in the backend and impacts user experience
• A vs B measures perf impact, B vs C measures impact on user experience
© 2019 Snowflake Computing Inc. All Rights Reserved
LESSON#2:ENSUREEQUALLEARNINGOPPORTUNITY
22
• For models that learn and adapt on the fly, their quality depends on the
amount of data they see
• Example: retraining rep content recommendation model daily based on user
feedback
• The variant that has larger fraction of users gets more data for model training
Solutions:
• Ensure treatment and control are exposed to the same fraction of users
• If the test is not 50/50, ensure the control and “default” populations do not
share data
© 2019 Snowflake Computing Inc. All Rights Reserved
LESSON#3:ZEROINONTHETARGETSCENARIO
23
• Often an ML model targets only a specific narrow user scenario
• Example: a rep content recommendation model may target only a specific type of
prospect objection, e.g. cost objection
• Only a subset of deals faces cost objection, so when we analyze the results “all
up”, the signal may be lost in the noise
Solutions:
• Triggering – restricting experiment analysis to only the affected population
• Requires counterfactual logging – the cost objection scenarios need to be
marked in the same way in both treatment and control variants.
• This may require running the model in control too, and logging the situations where
it fired, even though model results are not used: A/B/C test design from lesson #1.
© 2019 Snowflake Computing Inc. All Rights Reserved
LESSON#4:BEWAREOFSIDEEFFECTS
24
• While an improvement to an ML model may target a specific user scenario, it
may inadvertently negatively (or positively) impact other scenarios
• Example: an improvement to rep content recommendation model may have
targeted cost objection scenario, but since it’s a single model used for all scenarios it
may have degraded accuracy for not the right time scenario
• How to detect such unintended side effects?
Solutions:
• While triggering is key to understanding the impact in the targeted scenario,
an all-up analysis should be performed to detect side effects
• Use a large comprehensive set of metrics, including even the metrics you do
not expect to impact
• Use segments, such as objection scenario, persona, industry
• Use higher thresholds when analyzing segments to avoid multiple testing pitfall
© 2019 Snowflake Computing Inc. All Rights Reserved
LESSON#5:MEASURETHEFULLMLPIPELINE
25
• Often ML output is the result of multiple ML models stringed together
• Example: rep content recommendation model may consist of
• Named entity recognition model to detect if known entity, such as specific
competitor product, is talked about in the email
• Scenario detection model which determines the type of objection the sales rep
is facing
• Recommendation model which, given a competitor name, recommends an email
template to respond with
• To maximize learning need to understand how each part impacts the result
Solutions:
• Log for each part its result and confidence level
• Introduce segments based on confidence and prediction class of each part
• Note: assessment of confidence needs to remain the same in both variants
© 2019 Snowflake Computing Inc. All Rights Reserved
TAKEAWAYS
26
• We are bad at assessing the value of our ideas. Don’t trust experts – get the
data!
• A/B testing is the best scientific way to measure causal impact of your work on
users and business
• A/B testing isn’t just for UI. A/B testing your ML models will provide deeper
insights and may save you from embarrassment
• Five things to keep in mind when A/B testing AI:
1. AI vs no-AI tests are tricky
2. Ensure equal learning opportunity
3. Zero in on the target scenario
4. Beware of side effects
5. Measure the full ML pipeline
Thank You! | Questions?
Pavel Dmitriev
VP of Data Science, Outreach
https://www.linkedin.com/in/paveldmitriev/
We are hiring Data Scientists and ML Engineers!
Ping me on LinkedIn if interested

More Related Content

PDF
Introduction on Data Science
PPTX
Introduction to Data Science
PPTX
Big data ppt
PPTX
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 7 - Compute...
PDF
Big Data Ppt PowerPoint Presentation Slides
PPTX
Big data ppt
PPTX
Data mining
PDF
Generative-AI-in-Organizations-Refresh-Harnessing the value of generative AI
Introduction on Data Science
Introduction to Data Science
Big data ppt
Understanding Computers: Today and Tomorrow, 13th Edition Chapter 7 - Compute...
Big Data Ppt PowerPoint Presentation Slides
Big data ppt
Data mining
Generative-AI-in-Organizations-Refresh-Harnessing the value of generative AI

What's hot (20)

PDF
Ml conference slides
PDF
Data Science With Python
PDF
Ai and using ml in mobile apps
PPTX
ppt about chatgpt.pptx
PPTX
Data science life cycle
PPT
Presentation on telnet
PPTX
Big data Presentation
PPTX
introduction to data science
PPTX
Building responsible AI models in Azure Machine Learning.pptx
PDF
Dbm630_Lecture02-03
PDF
Trends and AI in PM v2 - Mar 2023.pdf
PPTX
Big data by Mithlesh sadh
PPTX
Chapter 1 big data
PPTX
1. Data Analytics-introduction
PPTX
Machine learning (webinar)
PPTX
Introduction to Data Analytics
PDF
Information Technology Careers
PDF
Data science - An Introduction
PDF
generative AI in healthcare.pdf
PPT
Introduction To Web Technology
Ml conference slides
Data Science With Python
Ai and using ml in mobile apps
ppt about chatgpt.pptx
Data science life cycle
Presentation on telnet
Big data Presentation
introduction to data science
Building responsible AI models in Azure Machine Learning.pptx
Dbm630_Lecture02-03
Trends and AI in PM v2 - Mar 2023.pdf
Big data by Mithlesh sadh
Chapter 1 big data
1. Data Analytics-introduction
Machine learning (webinar)
Introduction to Data Analytics
Information Technology Careers
Data science - An Introduction
generative AI in healthcare.pdf
Introduction To Web Technology
Ad

Similar to A/B testing AI - Global Artificial Intelligence Conference 2019 (20)

PPTX
A/B Testing for Everyone
PDF
Avoid the 7 Stages of ERP Grief
PDF
FAQ for the Predictive Testing of Opportunities
PPTX
Data Driven Product Management - ProductTank Boston Feb '14
PDF
Be A Great Product Leader (Square 2013)
PDF
Powering ABM with AI—Rollworks & People.ai
PDF
Be A Great Product Leader (Dropbox / AirBnB 2013)
PDF
Adobe User Group Amsterdam - Correlation between Innovation & Growth Hacking
PDF
Talent Institute - Frictionless Conversion (workshop)
PDF
How to avoid 6 deadly mistakes when building a digital product 2018
PDF
Be A Great Product Leader (Opower 2014)
PPTX
What Can Machine Learning Do For You?
PDF
Ways to ensure that you've picked the right or wrong recruitment software for...
PDF
Jan2015News
PPSX
Non-Sales Questions That Lead to Sales
PDF
Why A/B testing is probably damaging your business (and what you can do about...
PDF
The Demise of Duplicate Data Webinar (Part 1)
PPTX
Managing web analytics
PDF
Lean Analytics: Using Data to Build a Better Business Faster
PDF
ONE ECO-SYSTEM FOR LEGAL DEPARTMENT
A/B Testing for Everyone
Avoid the 7 Stages of ERP Grief
FAQ for the Predictive Testing of Opportunities
Data Driven Product Management - ProductTank Boston Feb '14
Be A Great Product Leader (Square 2013)
Powering ABM with AI—Rollworks & People.ai
Be A Great Product Leader (Dropbox / AirBnB 2013)
Adobe User Group Amsterdam - Correlation between Innovation & Growth Hacking
Talent Institute - Frictionless Conversion (workshop)
How to avoid 6 deadly mistakes when building a digital product 2018
Be A Great Product Leader (Opower 2014)
What Can Machine Learning Do For You?
Ways to ensure that you've picked the right or wrong recruitment software for...
Jan2015News
Non-Sales Questions That Lead to Sales
Why A/B testing is probably damaging your business (and what you can do about...
The Demise of Duplicate Data Webinar (Part 1)
Managing web analytics
Lean Analytics: Using Data to Build a Better Business Faster
ONE ECO-SYSTEM FOR LEGAL DEPARTMENT
Ad

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PPT
Predictive modeling basics in data cleaning process
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Microsoft Core Cloud Services powerpoint
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
IMPACT OF LANDSLIDE.....................
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
Introduction to Inferential Statistics.pptx
PDF
Global Data and Analytics Market Outlook Report
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
modul_python (1).pptx for professional and student
Predictive modeling basics in data cleaning process
Database Infoormation System (DBIS).pptx
Qualitative Qantitative and Mixed Methods.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
[EN] Industrial Machine Downtime Prediction
A Complete Guide to Streamlining Business Processes
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Microsoft Core Cloud Services powerpoint
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
IMPACT OF LANDSLIDE.....................
ISS -ESG Data flows What is ESG and HowHow
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
Introduction to Inferential Statistics.pptx
Global Data and Analytics Market Outlook Report
retention in jsjsksksksnbsndjddjdnFPD.pptx
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...

A/B testing AI - Global Artificial Intelligence Conference 2019

  • 2. A/B Testing AI Pavel Dmitriev VP of Data Science, Outreach
  • 3. Intro to A/B Testing Examples of Real Experiments Why A/B Test AI Systems? Pitfalls and Lessons OUTLINE
  • 4. © 2019 Snowflake Computing Inc. All Rights Reserved THE LIFE OF AGREAT IDEA– TRUE BING STORY 4 4 Control – Existing Display Treatment – new idea called Long Ad Titles
  • 5. © 2019 Snowflake Computing Inc. All Rights Reserved THELIFEOFAGREATIDEA 5 • It was one of hundreds of ideas on the table, and it seemed… • Stayed in the backlog in • Many features were above it, it was clear the idea was not going to make it any time soon • The engineer thought it was trivial to implement. He implemented it and started an A/B test • Immediately an alert fired: the Revenue was abnormally high (usually indicates a bug) • But in this case there was no bug. The idea increased Bing’s revenue by 12% (over $100M/year), without hurting user experience metrics! Feb, March, April, May, June,… Meh…
  • 6. © 2019 Snowflake Computing Inc. All Rights Reserved WEAREBADATASSESSINGTHEVALUEOFIDEAS 6 • The best revenue generating idea in Bing history was badly rated and delayed for months! • The flip side is also true: a study from Bing showed that only ~1/3 of ideas developed were actually good for users and business, ~1/3 were neutral, and ~1/3 were bad • Only in Software Engineering? • In Sales, contradicting “best practices” are abundant. For example, best day to contact the prospect is … • In Medicine, correctly evaluating an idea, e.g. a new drug, is a matter of life and death. FDA and EMA do not trust expert opinions and mandates the use of Randomized Controlled Trials We can’t trust our gut! To make the right choices we need data from real users!
  • 7. © 2019 Snowflake Computing Inc. All Rights Reserved 7
  • 8. © 2019 Snowflake Computing Inc. All Rights Reserved A/BTESTSINONESLIDE 9 • Other names: Controlled Experiments, Randomized Clinical Trials (RCTs) • Can have more than two variants: A/B/C/etc. tests are common • Must run statistical tests to confirm differences are not due to chance A/B Tests are the best scientific way to prove causality!
  • 9. © 2019 Snowflake Computing Inc. All Rights Reserved REALEXAMPLES 10 • Three experiments • Each had enough users for statistical validity • For each experiment I’ll tell you the success metric • Your job is to guess the result • Please stand up • You’ll chose between three options by raising you left hand, right hand, or leave both hand down • If you get it wrong, please sit down • Since there are 3 choices for each question, random guessing implies 100%/3^3 =~ 4% will get all three questions right. Let’s see how much better than random you can do.
  • 10. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE1:OUTREACHEMAIL(STEP9DAY7) 11 • Success metric: Reply Rate Hey {{first_name}}, In short, we're a sales automation platform that makes your reps life a lot easier. Our average companies (based on 1100+ companies) have tripled their reply rates on cold outbound emails and boosted rep productivity by 2x. We take what your best reps are doing and automate that across your entire team so your weaker reps can work at the highest possible same level. We also solve the issue of follow up falling through the cracks and reps not going deep enough. When can I get a few minutes on your calendar to discuss? {{sender.first_name}} {{first_name}}, I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have no interest in. My goal is to provide enough value to warrant a 15 minute call with you. What we do is put your sales process into a structured series of touch points which takes care of your follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen. Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time, logs their activities, and gives you 100% accurate reporting. Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing, what activities they're doing, and what is and isn't working. This provides a solid foundation to accurately forecast results, improve your outreach and train your team. Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs a day, and 2X's their productivity. If you see value here can we set up a time next Tuesday or Wednesday to discuss? {{sender.first_name}} • Left: shorter, more “salesy” • Right: longer, more “socially mindful” • Raise your left hand if you think the Left version wins (stat-sig) • Raise your right hand if you think the Right version wins (stat-sig) • Don’t raise your hand if they are the about the same (no stat-sig difference)
  • 11. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE1:OUTREACHEMAIL(STEP9DAY7) 12 Hey {{first_name}}, In short, we're a sales automation platform that makes your reps life a lot easier. Our average companies (based on 1100+ companies) have tripled their reply rates on cold outbound emails and boosted rep productivity by 2x. We take what your best reps are doing and automate that across your entire team so your weaker reps can work at the highest possible same level. We also solve the issue of follow up falling through the cracks and reps not going deep enough. When can I get a few minutes on your calendar to discuss? {{sender.first_name}} {{first_name}}, I'm sure in your role you get a ton of sales-driven emails, probably most of which are spam you have no interest in. My goal is to provide enough value to warrant a 15 minute call with you. What we do is put your sales process into a structured series of touch points which takes care of your follow-up process for you. This ramps up reps activities and ensures that every lead is thoroughly worked, never gets lost and receives the 5 to 12 touches where 80% of sales happen. Second, we do all the administrative work for in your CRM (Salesforce). This frees up your reps time, logs their activities, and gives you 100% accurate reporting. Finally, we open up the "Black Box" of sales and show you in real time how each rep is performing, what activities they're doing, and what is and isn't working. This provides a solid foundation to accurately forecast results, improve your outreach and train your team. Over 1100 companies (like CenturyLink, Adobe, and Marketo) use us and their average rep saves 2 hrs a day, and 2X's their productivity. If you see value here can we set up a time next Tuesday or Wednesday to discuss? {{sender.first_name}} • Left template has 70% higher reply rate… • However, most replies are negative or unsubscribe requests. The right template has higher positive reply rate! • If you did not raise your hand, sit down… • If you raised your right hand, sit down…
  • 12. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE2:SERPTRUNCATION 13 • SERP is a Search Engine Result Page (shown on the right) • Success Metric: Clickthrough Rate on first SERP (ignore issues with click/back, page 2, etc.) • Version A: show 10 algorithmic results • Version B: show 8 algorithmic results by removing the last two results (shown on the right) • All else the same: task pane, ads, related searches • Why truncate? • Slightly faster page load time • Fewer choices may make it easier for users to choose what to click on • Raise your left hand if you think version A wins (10 results) • Raise your right hand if you think version B wins (8 results) • Don’t raise your hand if they are the about the same
  • 13. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE2:SERPTRUNCATION 14 • If you raised your left hand, sit down… • If you raised your right hand, sit down… • With over 3M users in each variant, we could not detect a stat-sig delta. Users simply shifted the clicks from the last two algorithmic results to other elements of the page. • Rule of Thumb: Shifting clicks is easy. Reducing abandonment is hard.
  • 14. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE3:WINDOWSSEARCHBOX 15 • The search box in the lower left corner of the screen on Windows machines • Success metrics: more searches (and thus more Bing revenue) • Raise your left hand if you think the Left version wins • Raise your right hand if you think the Right version wins • Don’t raise your hand if they are the about the same
  • 15. © 2019 Snowflake Computing Inc. All Rights Reserved EXAMPLE3:WINDOWSSEARCHBOX 16 • If you did not raise your hand, sit down… • If you raised your left hand, sit down… • The four variants we actually tested in order of performance are: Type here to search (winner) What can I help you find? Ask me anything (Control - the design that shipped with Windows 10) Search the web and Windows (worst) Stop guessing – get the data!
  • 16. © 2019 Snowflake Computing Inc. All Rights Reserved A/BTESTINGAI 17 • ML algorithms are complex, some adapt their behavior on the fly. The all-up impact on users is not known until tested in the real world. In this sense, AI is not unlike medicine. • In medicine, FDA does not trust experts who developed the drug - it requires randomized clinical trials - because the all-up impact of the drug is not know until tested in real world. Why do we apply less rigor in developing AI than we do in medicine? • It’s a moral obligation of AI developers to their users to rigorously evaluate ML models using best available methods.
  • 17. © 2019 Snowflake Computing Inc. All Rights Reserved SEVENREASONSTOA/BTESTYOURMLMODELS 18 1. Train/Test sets get old quickly 2. Train/Test set misses entire class of examples 3. Labels produced by human annotators may be inaccurate 4. All errors are not equal 5. UI matters 6. Model is part of a bigger system 7. Model implementation has a bug ML Model A/B Testing Learnings and new ideas Measure end-user impact
  • 18. © 2019 Snowflake Computing Inc. All Rights Reserved AIATOUTREACH–PERSONALIZEDCUSTOMEREXPERIENCEATSCALE 19 • Outreach is a Sales Engagement Platform – the tool in which sales reps spend their day • Integrates with CRM, sales solutions, content solutions • Allows sales managers encodes their playbooks • Orchestrates and automates execution of plays Personalized Action+Content Recommendations Sales Reps Situation Awareness Action & Content Effectiveness Complete Activity Data Communication (emails, messages, call & meeting transcripts), prospect’s behavior, sales rep’s behavior, sales team behavior, industry-wide success patterns
  • 19. © 2019 Snowflake Computing Inc. All Rights Reserved FIVEPITFALLSANDLESSONSFORA/BTESTINGAI 20 1. AI vs no-AI tests are tricky 2. Ensure equal learning opportunity 3. Zero in on the target scenario 4. Beware of side effects 5. Measure the full ML pipeline
  • 20. © 2019 Snowflake Computing Inc. All Rights Reserved LESSON#1:AIVSNO-AITESTSARETRICKY 21 • Introducing ML usually requires substantial changes to system architecture • Example: introducing ML model to recommend sales rep best content to respond to prospect’s email introduces extra backend processing and slows down other functionality • Many factors other than ML model accuracy can impact the outcome of the test Solutions: • A/B/C test: A. No AI. System without ML model. B. Hidden AI. ML model runs in the backend, but results aren’t used changes made to user experience. C. Full AI. ML model runs in the backend and impacts user experience • A vs B measures perf impact, B vs C measures impact on user experience
  • 21. © 2019 Snowflake Computing Inc. All Rights Reserved LESSON#2:ENSUREEQUALLEARNINGOPPORTUNITY 22 • For models that learn and adapt on the fly, their quality depends on the amount of data they see • Example: retraining rep content recommendation model daily based on user feedback • The variant that has larger fraction of users gets more data for model training Solutions: • Ensure treatment and control are exposed to the same fraction of users • If the test is not 50/50, ensure the control and “default” populations do not share data
  • 22. © 2019 Snowflake Computing Inc. All Rights Reserved LESSON#3:ZEROINONTHETARGETSCENARIO 23 • Often an ML model targets only a specific narrow user scenario • Example: a rep content recommendation model may target only a specific type of prospect objection, e.g. cost objection • Only a subset of deals faces cost objection, so when we analyze the results “all up”, the signal may be lost in the noise Solutions: • Triggering – restricting experiment analysis to only the affected population • Requires counterfactual logging – the cost objection scenarios need to be marked in the same way in both treatment and control variants. • This may require running the model in control too, and logging the situations where it fired, even though model results are not used: A/B/C test design from lesson #1.
  • 23. © 2019 Snowflake Computing Inc. All Rights Reserved LESSON#4:BEWAREOFSIDEEFFECTS 24 • While an improvement to an ML model may target a specific user scenario, it may inadvertently negatively (or positively) impact other scenarios • Example: an improvement to rep content recommendation model may have targeted cost objection scenario, but since it’s a single model used for all scenarios it may have degraded accuracy for not the right time scenario • How to detect such unintended side effects? Solutions: • While triggering is key to understanding the impact in the targeted scenario, an all-up analysis should be performed to detect side effects • Use a large comprehensive set of metrics, including even the metrics you do not expect to impact • Use segments, such as objection scenario, persona, industry • Use higher thresholds when analyzing segments to avoid multiple testing pitfall
  • 24. © 2019 Snowflake Computing Inc. All Rights Reserved LESSON#5:MEASURETHEFULLMLPIPELINE 25 • Often ML output is the result of multiple ML models stringed together • Example: rep content recommendation model may consist of • Named entity recognition model to detect if known entity, such as specific competitor product, is talked about in the email • Scenario detection model which determines the type of objection the sales rep is facing • Recommendation model which, given a competitor name, recommends an email template to respond with • To maximize learning need to understand how each part impacts the result Solutions: • Log for each part its result and confidence level • Introduce segments based on confidence and prediction class of each part • Note: assessment of confidence needs to remain the same in both variants
  • 25. © 2019 Snowflake Computing Inc. All Rights Reserved TAKEAWAYS 26 • We are bad at assessing the value of our ideas. Don’t trust experts – get the data! • A/B testing is the best scientific way to measure causal impact of your work on users and business • A/B testing isn’t just for UI. A/B testing your ML models will provide deeper insights and may save you from embarrassment • Five things to keep in mind when A/B testing AI: 1. AI vs no-AI tests are tricky 2. Ensure equal learning opportunity 3. Zero in on the target scenario 4. Beware of side effects 5. Measure the full ML pipeline
  • 26. Thank You! | Questions? Pavel Dmitriev VP of Data Science, Outreach https://www.linkedin.com/in/paveldmitriev/ We are hiring Data Scientists and ML Engineers! Ping me on LinkedIn if interested

Editor's Notes