SlideShare a Scribd company logo
ML Experimentation at Sift
Alex Paino
atpaino@siftcience.com
Follow along at: http://go.siftscience.com/ml-experimentation
1
Agenda
Background
Motivation
Running experiments correctly
Comparing experiments correctly
Building tools to ensure correctness
2
About Sift Science
- Abuse prevention platform powered by machine learning
- Learns in real-time
- Several abuse prevention products and counting:
3
Payment Fraud Content Abuse Promo Abuse Account Abuse
About Sift Science
4
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
5
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
6
Motivation - Why is this important?
1. Experiments must happen to improve an ML system
2. Evaluation needs to correctly identify positive changes
Evaluation as a loss function for your stack
3. Getting this right is a subtle and tricky problem
7
How do we run experiments?
8
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
9
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Background
- Large delay in feedback for Sift - up to 90 days
- → offline experiments over historical data
- Need to simulate the online case as closely as possible
10
Created
account
Updated credit
card info
Updated
settings
Purchased
item
Chargeback
t
90 days
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
11
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
12
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: train & test set creation
- Can’t pick random splits
- Disjoint in time and set of users
- Watch for class skew - ours is over 50:1 → need to downsample
13
Train
Test
t
users
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
14
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: preventing cheating
- External data sources need to be versioned
- Can’t leak groundtruth into feature vectors
15
t
Created
account
Updated credit
card info
Login from IP
Address A
IP Address B
Known Tor
Exit Node
Tor Exit
Node DB
Login from IP
Address B
Login from IP
Address B
Transaction
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
16
t
Running experiments correctly - Lessons
Lesson: considering scores at key decision points
- Scores given for any event (e.g. user login)
- Need to evaluate scores our customers use to
make decisions
17
t
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
18
Running experiments correctly - Lessons
Lesson: parity with the online system
- Our system does online learning → so should the offline experiments
- Reusing the same code paths
19
How do we compare experiments?
20
Comparing Experiments Correctly - Background
21
Customer-specific
Global
Global
Models
Sift Score
Comparing Experiments Correctly - Background
22
Customer-specific
(Payment Abuse)
Global (Payment Abuse)
Global (Payment Abuse)
Payment Abuse Models
Payment
Abuse Score
Customer-specific
(Account Abuse)
Global (Account Abuse)
Global (Account Abuse)
Account Abuse Models
Account
Abuse Score
Customer-specific
(Promotion Abuse)
Global (Promotion Abuse)
Global (Promotion Abuse)
Promotion Abuse Models
Promotion
Abuse Score
Customer-specific
(Content Abuse)
Global (Content Abuse)
Global (Content Abuse)
Content Abuse Models
Content
Abuse Score
Comparing Experiments Correctly - Background
23
Thousands of
configurations
to evaluate!
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
24
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
25
Comparing Experiments Correctly - Background
Thousands of (customer, abuse type)
combinations to evaluate
Each with different features, models, class
skew, and noise levels
→ Need some way to consolidate these
evaluations
26
??
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
27
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: pitfalls with consolidating results
- Can’t throw all samples together → different score distributions
- Weighted averages are tricky
28
Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect
+ =
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
29
Comparing Experiments Correctly - Lessons
Lesson: require statistical significance everywhere
- Examine significant differences in per-customer summary stats
- Use confidence intervals where possible, e.g. for AUC ROC
30
http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf
http://www.cs.nyu.edu/~mohri/pub/area.pdf
How do we ensure correctness?
31
Building tools to ensure correctness
32
Building tools to ensure correctness
- Big productivity win
33
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
34
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
35
Building tools to ensure correctness
- Big productivity win
- Allows non-data scientists to conduct experiments safely
- Saves the team from drawing incorrect conclusions
36
vs
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
37
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
38
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
39
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
40
ROC
Building tools to ensure correctness - Examples
Example: Sift’s experiment evaluation page for high-level analysis
41
ROC Score distribution
Building tools to ensure correctness - Examples
Example: Jupyter notebooks
for deep-dives
42
Key Takeaways
43
Key Takeaways
1. Need to carefully design experiments to remove biases
44
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
45
Key Takeaways
1. Need to carefully design experiments to remove biases
2. Require statistical significance when comparing results to filter out noise
3. The right tools can help ensure all of your analyses are correct while
improving productivity
46
Questions?
47

More Related Content

PDF
Online learning talk
PPT
First 100k users are always the hardest
PDF
Freemium - Christian Kirsch - ProductCamp Boston 2012
PDF
Is Quality the new Freemium?
PDF
Platfora Data Visualization Meetup
PDF
Parsable's culture
PPTX
Final Wunderlist Presentation
PPT
Six Sigma, BPM, Digitalization -Different Paths to the Same Destination? | Bi...
Online learning talk
First 100k users are always the hardest
Freemium - Christian Kirsch - ProductCamp Boston 2012
Is Quality the new Freemium?
Platfora Data Visualization Meetup
Parsable's culture
Final Wunderlist Presentation
Six Sigma, BPM, Digitalization -Different Paths to the Same Destination? | Bi...

Viewers also liked (12)

PPTX
The Coupa Organic Platform from A to Z: Maximizing the Value
PDF
The Evolution of Hadoop at Stripe
PDF
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
PDF
Braintree and our new v.zero SDK for iOS
PDF
Django Zebra Lightning Talk
PDF
Paymill vs Stripe
PDF
Omise fintech研究会
PDF
Pay and Get Paid: How To Integrate Stripe Into Your App
PDF
[daddly] Stripe勉強会 運用編 2016/11/30
PDF
Entrepreneur + Developer Gangbang: Co-working
KEY
Payments using Stripe.com
PPTX
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
The Coupa Organic Platform from A to Z: Maximizing the Value
The Evolution of Hadoop at Stripe
Braintree v.zero: a modern foundation for accepting payments - Alberto Lopez ...
Braintree and our new v.zero SDK for iOS
Django Zebra Lightning Talk
Paymill vs Stripe
Omise fintech研究会
Pay and Get Paid: How To Integrate Stripe Into Your App
[daddly] Stripe勉強会 運用編 2016/11/30
Entrepreneur + Developer Gangbang: Co-working
Payments using Stripe.com
Hiring Hacks: How Stripe Creatively Finds Candidates and Builds a Recruiting ...
Ad

Similar to Machine Learning Experimentation at Sift Science (20)

PDF
Cro webinar what you're doing wrong in your cro program (sharable version)
 
PPTX
Beyond Simple A/B testing
PPTX
AI-900 - Fundamental Principles of ML.pptx
PPTX
Lessons learned from measuring software development processes
PPTX
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
PDF
Big Data Science - hype?
PDF
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
PPTX
Aspect Opinion Mining From User Reviews on the web
PDF
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
PPTX
You cant control what you cant measure - Measuring requirements quality
PDF
Machine learning in production
PDF
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
PPTX
PQF Overview
PDF
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
PDF
Sanitized tb swstmppp1516july
PDF
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
PDF
Barga Data Science lecture 10
PDF
ClickZ Live: Smart Analytics
PDF
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
PPTX
Modeling for the Non-Statistician
Cro webinar what you're doing wrong in your cro program (sharable version)
 
Beyond Simple A/B testing
AI-900 - Fundamental Principles of ML.pptx
Lessons learned from measuring software development processes
Developing Web-scale Machine Learning at LinkedIn - From Soup to Nuts
Big Data Science - hype?
SearchLove London 2016 | Stephen Pavlovich | Habits of Advanced Conversion Op...
Aspect Opinion Mining From User Reviews on the web
[QE 2018] Paul Gerrard – Automating Assurance: Tools, Collaboration and DevOps
You cant control what you cant measure - Measuring requirements quality
Machine learning in production
Hanno Jarvet - VSM, Planning and Problem Solving - ConFu
PQF Overview
GOKCE TOMBUL - HOW TO BUILD A SUCCESSFUL EXPERIMENTATION PROGRAM
Sanitized tb swstmppp1516july
Hanno Jarvet - The Lean Toolkit – Value Stream Mapping and Problem Solving
Barga Data Science lecture 10
ClickZ Live: Smart Analytics
Merchant Churn Prediction Using SparkML at PayPal with Chetan Nadgire and Ani...
Modeling for the Non-Statistician
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPT
Mechanical Engineering MATERIALS Selection
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
bas. eng. economics group 4 presentation 1.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Geodesy 1.pptx...............................................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Foundation to blockchain - A guide to Blockchain Tech
Embodied AI: Ushering in the Next Era of Intelligent Systems
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
CH1 Production IntroductoryConcepts.pptx
Mechanical Engineering MATERIALS Selection
R24 SURVEYING LAB MANUAL for civil enggi
bas. eng. economics group 4 presentation 1.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Geodesy 1.pptx...............................................
CYBER-CRIMES AND SECURITY A guide to understanding
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf

Machine Learning Experimentation at Sift Science

  • 1. ML Experimentation at Sift Alex Paino atpaino@siftcience.com Follow along at: http://go.siftscience.com/ml-experimentation 1
  • 2. Agenda Background Motivation Running experiments correctly Comparing experiments correctly Building tools to ensure correctness 2
  • 3. About Sift Science - Abuse prevention platform powered by machine learning - Learns in real-time - Several abuse prevention products and counting: 3 Payment Fraud Content Abuse Promo Abuse Account Abuse
  • 5. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 5
  • 6. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 6
  • 7. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 3. Getting this right is a subtle and tricky problem 7
  • 8. How do we run experiments? 8
  • 9. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data 9 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days
  • 10. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data - Need to simulate the online case as closely as possible 10 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days
  • 11. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits 11
  • 12. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users 12 Train Test t users
  • 13. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users - Watch for class skew - ours is over 50:1 → need to downsample 13 Train Test t users
  • 14. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned 14 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction
  • 15. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned - Can’t leak groundtruth into feature vectors 15 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction
  • 16. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) 16 t
  • 17. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) - Need to evaluate scores our customers use to make decisions 17 t
  • 18. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments 18
  • 19. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments - Reusing the same code paths 19
  • 20. How do we compare experiments? 20
  • 21. Comparing Experiments Correctly - Background 21 Customer-specific Global Global Models Sift Score
  • 22. Comparing Experiments Correctly - Background 22 Customer-specific (Payment Abuse) Global (Payment Abuse) Global (Payment Abuse) Payment Abuse Models Payment Abuse Score Customer-specific (Account Abuse) Global (Account Abuse) Global (Account Abuse) Account Abuse Models Account Abuse Score Customer-specific (Promotion Abuse) Global (Promotion Abuse) Global (Promotion Abuse) Promotion Abuse Models Promotion Abuse Score Customer-specific (Content Abuse) Global (Content Abuse) Global (Content Abuse) Content Abuse Models Content Abuse Score
  • 23. Comparing Experiments Correctly - Background 23 Thousands of configurations to evaluate!
  • 24. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate 24
  • 25. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels 25
  • 26. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels → Need some way to consolidate these evaluations 26 ??
  • 27. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions 27 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =
  • 28. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions - Weighted averages are tricky 28 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =
  • 29. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats 29
  • 30. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats - Use confidence intervals where possible, e.g. for AUC ROC 30 http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf http://www.cs.nyu.edu/~mohri/pub/area.pdf
  • 31. How do we ensure correctness? 31
  • 32. Building tools to ensure correctness 32
  • 33. Building tools to ensure correctness - Big productivity win 33
  • 34. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely 34
  • 35. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 35
  • 36. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 36 vs
  • 37. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 37
  • 38. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 38
  • 39. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 39
  • 40. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 40 ROC
  • 41. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 41 ROC Score distribution
  • 42. Building tools to ensure correctness - Examples Example: Jupyter notebooks for deep-dives 42
  • 44. Key Takeaways 1. Need to carefully design experiments to remove biases 44
  • 45. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 45
  • 46. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 3. The right tools can help ensure all of your analyses are correct while improving productivity 46

Editor's Notes

  • #2: ...today I’ll be talking to you about how we conduct machine learning experiments here at Sift.
  • #3: I’ll start with the necessary background on Sift, and then touch on why this is such an important topic before diving into our experiences with this topic, where I’ll cover how we run experiments correctly, how we compare experiments correctly, and how we have built tools that ensure all experiments have this correctness baked in.
  • #4: First, a little about Sift. Sift uses machine learning to prevent various forms of abuse on the internet for our customers. To do this, our customers send us three types of data: page view data sent via our Javascript snippet, event data for important events such as the creation of an order or account through our events API, and feedback through our labels API or our web Console. (this console is what our customers’ analysts use to investigate potential cases of abuse) Especially relevant to this discussion is the fact that we now offer 4 distinct abuse prevention products as of our launch last Tuesday, and that we do this for thousands of customers.
  • #5: Sample integration Have another slide w/ workflows?
  • #6: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  • #7: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  • #8: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
  • #9: Ok, so we’ve said it’s important to get evaluation right. The first step along that path is running correct, representative experiments. Here’s how we do this at Sift.
  • #10: When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
  • #11: When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
  • #12: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  • #13: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  • #14: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
  • #15: Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
  • #16: Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
  • #17: But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
  • #18: But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
  • #19: The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
  • #20: The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
  • #21: Now that we can execute correct experiments, how do we make sense of their results relative to the current state of the system?
  • #22: To understand why this is especially challenging for us at Sift, we need a little more background on our modeling setup. In its most basic form, a Sift Score is a combination of several different global models (for example, random forest and logistic regression models) along with one or more customer-specific models. However, with the recent launch of our 2 new abuse prevention products...
  • #23: ...we now have 4 of this same setup for each customer, each consisting of distinct models. So we’re up to 4 different scores, with over 10 different models, to evaluate for each customer...
  • #24: ...of which we have several thousand.
  • #25: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  • #26: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  • #27: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
  • #28: But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
  • #29: But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
  • #30: Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
  • #31: Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
  • #32: Ok, so we’ve figured out how to run and analyze experiments correctly in theory, but how do we ensure that this always happens in practice. Could also phrase as: Now that we’ve...we need to ensure...
  • #33: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  • #34: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  • #35: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  • #36: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  • #37: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
  • #38: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
  • #39: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  • #40: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  • #41: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  • #42: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
  • #43: ...for this use case, we’ve found iPython notebooks to be a perfect fit. One example where we found these tools useful was when we were investigating pulling in some new external data source at the request of a specific customer. When we ran an experiment with the new data, it didn’t help in aggregate -- no significant changes. But our intuition said it would help some, so we dug deeper through iPython to find some users who would be affected by this new data, and sure enough, were able to find a change.
  • #44: That does it for the topics I want to cover.
  • #45: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
  • #46: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
  • #47: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item