Machine Learning Experimentation at Sift Science

Download as PPTX, PDF

0 likes1,131 views

The document outlines best practices for conducting machine learning experiments at Sift Science, focusing on the importance of proper evaluation metrics and avoiding biases in experiment design. It emphasizes the need for statistical significance in results comparison and the development of tools to facilitate accurate analyses, enabling non-data scientists to safely conduct experiments. Key lessons include the consideration of class skew, prevention of cheating, and maintaining the integrity of online learning processes.

Engineering

Machine Learning Experimentation at Sift Science

1. ML Experimentation at Sift Alex Paino atpaino@siftcience.com Follow along at: http://go.siftscience.com/ml-experimentation 1

2. Agenda Background Motivation Running experiments correctly Comparing experiments correctly Building tools to ensure correctness 2

3. About Sift Science - Abuse prevention platform powered by machine learning - Learns in real-time - Several abuse prevention products and counting: 3 Payment Fraud Content Abuse Promo Abuse Account Abuse

4. About Sift Science 4

5. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 5

6. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 6

7. Motivation - Why is this important? 1. Experiments must happen to improve an ML system 2. Evaluation needs to correctly identify positive changes Evaluation as a loss function for your stack 3. Getting this right is a subtle and tricky problem 7

8. How do we run experiments? 8

9. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data 9 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days

10. Running experiments correctly - Background - Large delay in feedback for Sift - up to 90 days - → offline experiments over historical data - Need to simulate the online case as closely as possible 10 Created account Updated credit card info Updated settings Purchased item Chargeback t 90 days

11. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits 11

12. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users 12 Train Test t users

13. Running experiments correctly - Lessons Lesson: train & test set creation - Can’t pick random splits - Disjoint in time and set of users - Watch for class skew - ours is over 50:1 → need to downsample 13 Train Test t users

14. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned 14 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction

15. Running experiments correctly - Lessons Lesson: preventing cheating - External data sources need to be versioned - Can’t leak groundtruth into feature vectors 15 t Created account Updated credit card info Login from IP Address A IP Address B Known Tor Exit Node Tor Exit Node DB Login from IP Address B Login from IP Address B Transaction

16. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) 16 t

17. Running experiments correctly - Lessons Lesson: considering scores at key decision points - Scores given for any event (e.g. user login) - Need to evaluate scores our customers use to make decisions 17 t

18. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments 18

19. Running experiments correctly - Lessons Lesson: parity with the online system - Our system does online learning → so should the offline experiments - Reusing the same code paths 19

20. How do we compare experiments? 20

21. Comparing Experiments Correctly - Background 21 Customer-specific Global Global Models Sift Score

22. Comparing Experiments Correctly - Background 22 Customer-specific (Payment Abuse) Global (Payment Abuse) Global (Payment Abuse) Payment Abuse Models Payment Abuse Score Customer-specific (Account Abuse) Global (Account Abuse) Global (Account Abuse) Account Abuse Models Account Abuse Score Customer-specific (Promotion Abuse) Global (Promotion Abuse) Global (Promotion Abuse) Promotion Abuse Models Promotion Abuse Score Customer-specific (Content Abuse) Global (Content Abuse) Global (Content Abuse) Content Abuse Models Content Abuse Score

23. Comparing Experiments Correctly - Background 23 Thousands of configurations to evaluate!

24. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate 24

25. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels 25

26. Comparing Experiments Correctly - Background Thousands of (customer, abuse type) combinations to evaluate Each with different features, models, class skew, and noise levels → Need some way to consolidate these evaluations 26 ??

27. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions 27 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =

28. Comparing Experiments Correctly - Lessons Lesson: pitfalls with consolidating results - Can’t throw all samples together → different score distributions - Weighted averages are tricky 28 Customer 2: PerfectCustomer 1: Perfect Combined: Imperfect + =

29. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats 29

30. Comparing Experiments Correctly - Lessons Lesson: require statistical significance everywhere - Examine significant differences in per-customer summary stats - Use confidence intervals where possible, e.g. for AUC ROC 30 http://www.med.mcgill.ca/epidemiology/hanley/software/hanley_mcneil_radiology_82.pdf http://www.cs.nyu.edu/~mohri/pub/area.pdf

31. How do we ensure correctness? 31

32. Building tools to ensure correctness 32

33. Building tools to ensure correctness - Big productivity win 33

34. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely 34

35. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 35

36. Building tools to ensure correctness - Big productivity win - Allows non-data scientists to conduct experiments safely - Saves the team from drawing incorrect conclusions 36 vs

37. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 37

38. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 38

39. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 39

40. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 40 ROC

41. Building tools to ensure correctness - Examples Example: Sift’s experiment evaluation page for high-level analysis 41 ROC Score distribution

42. Building tools to ensure correctness - Examples Example: Jupyter notebooks for deep-dives 42

43. Key Takeaways 43

44. Key Takeaways 1. Need to carefully design experiments to remove biases 44

45. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 45

46. Key Takeaways 1. Need to carefully design experiments to remove biases 2. Require statistical significance when comparing results to filter out noise 3. The right tools can help ensure all of your analyses are correct while improving productivity 46

47. Questions? 47

Editor's Notes

#2: ...today I’ll be talking to you about how we conduct machine learning experiments here at Sift.
#3: I’ll start with the necessary background on Sift, and then touch on why this is such an important topic before diving into our experiences with this topic, where I’ll cover how we run experiments correctly, how we compare experiments correctly, and how we have built tools that ensure all experiments have this correctness baked in.
#4: First, a little about Sift. Sift uses machine learning to prevent various forms of abuse on the internet for our customers. To do this, our customers send us three types of data: page view data sent via our Javascript snippet, event data for important events such as the creation of an order or account through our events API, and feedback through our labels API or our web Console. (this console is what our customers’ analysts use to investigate potential cases of abuse) Especially relevant to this discussion is the fact that we now offer 4 distinct abuse prevention products as of our launch last Tuesday, and that we do this for thousands of customers.
#5: Sample integration Have another slide w/ workflows?
#6: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
#7: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
#8: Ok, so here is the motivation for the talk, starting with the basics: We must conduct experiments to improve a machine learning system We need our evaluation system to indicate experiments that help the system are good, and those that hurt the system are bad. You can think of your evaluation framework as a sort of meta loss function for your entire ML stack; you want changes to your system allowed by the evaluation framework to be minimizing error over time. However, conducting these experiments without introducing bias is often very tricky. Getting this wrong can lead to wasted effort and in the worst case optimizing a system away from its ideal operating point. E.g. ignoring class skew and using precision/recall of the dominant class leads to the always positive classifier Must run experiments Experiments must be correct Easy to get them wrong, which is why you should think about this
#9: Ok, so we’ve said it’s important to get evaluation right. The first step along that path is running correct, representative experiments. Here’s how we do this at Sift.
#10: When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
#11: When I say “correct”, what I mean is that these evaluations are not biased Unlike a problem like ad targeting, we don’t instantly receive feedback about our predictions -- often takes weeks or months. Because of this we have to run experiments offline over historical data. The problem is then: how do we run offline experiments that best simulate the live case? That is, how do we best measure the value that our system is providing online through an offline experiment? This is a very hard problem; for example, just take a look at how much work goes into backtesting systems for trading.
#12: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
#13: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
#14: The first thing you have to get right here is how you divide up your data into train and test sets. If you want to simulate the live case correctly, you can’t just pick random splits -- that could allow your training set to include information from “the future”, which is especially bad for us because a large source of value for our models is their ability to connect new accounts to accounts previously marked as fraudulent. For us, we need to additionally segment the users belonging to each the train and test sets so that we don’t give ourselves credit for just surfacing users who we already know to be bad. Beyond properly segmenting users, you also need to pay attention to class skew. This is especially true in a problem like payment fraud detection, where our customers commonly only see fraud in under 2% of transactions.
#15: Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
#16: Knowledge base versions external data so that we prevent our evals from using information from “the future”. Groundtruth leaking: e.g. where we do this is with computing fraud rate features out of sparse information such as email addresses. Example that hurt us was with a social data integration where we queried for social data primarily for fraudulent accounts.
#17: But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
#18: But this train test set split isn’t enough to run correct experiments; we still need to figure out how to analyze the scores given to the test side. We provide risk scores after any event for a user -- e.g. login, logout, account creation, account updated, item added to cart, etc. => don’t want to use all of them, as this heavily weights active users But most customers only care about the score after a certain event -- for most payment fraud customers, the score we give to a user when they try to checkout is all that matters Thus, in our offline experiments we need to only give ourselves credit for producing an accurate score at this point in time; giving a high score to a transaction that will result in a chargeback hours or days after the transaction was completed is of no value to the customer, and shouldn’t affect our evaluation of accuracy The trick here is knowing which event(s) or scenarios a customer cares about. To date we have hardcoded this set for each of our abuse prevention products, but we hope with the launch of our new Workflows product that we will be able to get more fine-grained information about how each customer is using us.
#19: The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
#20: The final point on running experiments correctly goes back to the point about accurately simulating the online case. In the online case, various parts of our modeling stack are learned online. Thus, to accurately simulate our online accuracy, we must simulate online learning. We actually weren’t doing this for a long time, which was underestimating our accuracy. We’ve also found it useful in general to aim to reuse the same code paths online and offline -- removes a potential source of difficult bugs and biases in the system
#21: Now that we can execute correct experiments, how do we make sense of their results relative to the current state of the system?
#22: To understand why this is especially challenging for us at Sift, we need a little more background on our modeling setup. In its most basic form, a Sift Score is a combination of several different global models (for example, random forest and logistic regression models) along with one or more customer-specific models. However, with the recent launch of our 2 new abuse prevention products...
#23: ...we now have 4 of this same setup for each customer, each consisting of distinct models. So we’re up to 4 different scores, with over 10 different models, to evaluate for each customer...
#24: ...of which we have several thousand.
#25: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
#26: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
#27: As you can see, this is a huge number of distinct evaluations to consider, and we commonly experiment with changes, such as feature engineering, that can affect all of them. This is made even more complicated by the diverse nature of our customer base -- each customer brings their own unique data, with their own class skew, and level of noise in their evaluations. To make sense of this, we had to come up with some means of summarizing these diverse results.
#28: But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
#29: But first, here are some things we have tried or considered and found to be flawed in one way or another. One lesson we learned is that we cannot rely on an evaluation that simply merges all samples across customers; this is because each customer’s score distribution can be shifted or scaled in their own way due to differences in integration, class skew, etc., as you can see in this image. Relatedly, when comparing two experiments, we need our summary metrics to not be tied to a single threshold as each customer will use their own thresholds dependent upon their fraud prior, appetite for risk, etc. Another thing we have learned is that it is difficult to correctly weight an average over some summary metric, such as AUC ROC, across all (customer, use case) pairs. One approach we determined to be flawed pretty early on was one that weighted each customer’s results by their overall volume; this led to our evals being heavily biased towards improving things for a very small number of super-large customers. This situation has improved over time as we’ve accumulated more and more customers, but is still problematic.
#30: Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
#31: Here we have a few techniques that have worked well for us. The most helpful thing we’ve done is to begin requiring statistical significance with all of our comparisons across experiments. This helps to cut through the noise of having several thousand evaluations to look at by only surfacing those changes that are meaningfully different. Applying this requirement of statistically significant improvements has given rise to a simple summarization technique of counting the number of customers significantly improved and comparing it to the count of those made significantly worse. We’ve also found that viewing cond Sometimes, however, an accuracy improving change may not conclusively improve the accuracy for a single customer due to small sample sizes, etc. For these cases, we have designed a separate top-level summary statistic that takes advantage of the thousand semi-correlated trials (i.e. from our thousands of customers) and aims to give us the probability that the expected increase in some summary statistic (e.g. AUC ROC) is non-zero. We can do this by calculating the z-score for the delta in AUC ROC for each customer and running a one-sided t-test over the resulting sample set, as demonstrated by these equations. Note that this approach could apply to any summary statistic that can yield a confidence interval. TODO: link to paper on auc roc confidence intervals. And break up into 2 slides?
#32: Ok, so we’ve figured out how to run and analyze experiments correctly in theory, but how do we ensure that this always happens in practice. Could also phrase as: Now that we’ve...we need to ensure...
#33: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
#34: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
#35: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
#36: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
#37: The best answer we’ve found is to design the right tools that bake in correctness and make it as difficult as possible for someone to incorrectly analyze an experiment. Doing this leads to big productivity wins for a machine learning or data science team, and also makes it easier for other engineers to safely conduct experiments. This could be useful, e.g., in the case of an infrastructure engineer wanting to test an idea of theirs that may allow the model size to be doubled without a negative performance impact. In both of these cases, you don’t want the engineers evaluating experiments to have to rethink all of the hard problems we’ve discussed today. So how do we do this at Sift? We’ve found that we need two classes of tools, the first being one that allows for quick, high level analysis of an experiment...
#38: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment…
#39: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
#40: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
#41: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
#42: ...an example of which is our experiment evaluation page. <describe eval page as depicted in image> However, for more complicated experimental analysis, we’ve also found it necessary to support some tools that allow us to drill more deeply into an experiment… TODO: add transitions
#43: ...for this use case, we’ve found iPython notebooks to be a perfect fit. One example where we found these tools useful was when we were investigating pulling in some new external data source at the request of a specific customer. When we ran an experiment with the new data, it didn’t help in aggregate -- no significant changes. But our intuition said it would help some, so we dug deeper through iPython to find some users who would be affected by this new data, and sure enough, were able to find a change.
#44: That does it for the topics I want to cover.
#45: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
#46: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item
#47: I hope you’ll take away from this talk that: running experiments correctly is very important TODO: make sure to add transitions for each item

Machine Learning Experimentation at Sift Science

More Related Content

Viewers also liked (12)

Similar to Machine Learning Experimentation at Sift Science (20)

Recently uploaded (20)

Machine Learning Experimentation at Sift Science

Editor's Notes