SlideShare a Scribd company logo
6
Most read
10
Most read
22
Most read
Python	
  as	
  part	
  of	
  a	
  produc0on	
  
machine	
  learning	
  stack	
  
	
  
	
  
	
  
Michael	
  Manapat	
  
@mlmanapat	
  
Stripe	
  
	
  
Outline	
  
	
  
-­‐Why	
  we	
  need	
  ML	
  at	
  Stripe	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
  with	
  Luigi	
  
-­‐Scoring	
  as	
  a	
  service	
  
	
  
Stripe	
  is	
  a	
  technology	
  company	
  
focusing	
  on	
  making	
  payments	
  easy	
  
	
  
-­‐Short	
  applica>on	
  
	
  
Tokeniza0on	
  
	
  
	
   Customer	
  
browser	
  
Stripe	
  
Stripe.js	
  
Token	
  
Merchant	
  
server	
  
Stripe	
  
API	
  call	
  
Result	
  
API	
  Call	
  
	
  
import stripe

stripe.Charge.create(

amount=400,

currency="usd",

card="tok_103xnl2gR5VxTSB”

email=customer@example.com"

)"
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  (weapons)	
  
-­‐Merchant	
  fraud	
  (card	
  “cashers”)	
  	
  	
  
-­‐Transac>on	
  fraud	
  
	
  
-­‐No	
  machine	
  learning	
  a	
  year	
  ago	
  
Fraud	
  /	
  business	
  viola0ons	
  
	
  
-­‐Terms	
  of	
  service	
  viola>ons	
  
	
  
E-­‐cigareMes,	
  drugs,	
  weapons,	
  etc.	
  
	
  
How	
  do	
  we	
  find	
  these	
  automa>cally?	
  
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
Merchant	
  sign	
  up	
  flow	
  
	
  
	
  
	
  
	
  
Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
Machine	
  
learning	
  
pipeline	
  and	
  
service	
  
Building	
  a	
  classifier:	
  e-­‐cigareIes	
  
	
  
data = pandas.from_pickle(‘ecigs’)

data.head()



text violator

0 " please verify your age i am 21 years or older ... True

1 coming soon toggle me drag me with your mouse ... False

2 drink moscow mules cart 0 log in or create an ... False

3 vapors electronic cigarette buy now insuper st... True

4 t-shirts shorts hawaii about us silver coll... False



[5 rows x 2 columns]	
  
Features	
  for	
  text	
  classifica0on	
  
	
  
cv = CountVectorizer



features = 

cv.fit_transform(data['text'])



Sparse	
  matrix	
  of	
  word	
  counts	
  from	
  
input	
  text	
  (omiSng	
  feature	
  selec>on)	
  
Features	
  for	
  text	
  classifica0on	
  


X_train, X_test, y_train, y_test = 

train_test_split(

features, data['violator'], 

test_size=0.2)



-­‐Avoid	
  leakage	
  
-Other	
  cross-­‐valida>on	
  methods	
  
Training	
  


model = LogisticRegression()

model.fit(X_train, y_train)



Serializer	
  reads	
  from	
  


model.intercept_

model.coef_

	
  
Valida0on	
  


probs = model.predict_proba(X_test)



fpr, tpr, thresholds =

roc_curve(y_test, probs[:, 1])



matplotlib.pyplot(fpr, tpr)	
  
ROC:	
  Receiver	
  opera0ng	
  characteris0c	
  




	
  
Pipeline	
  
	
  
-­‐Fetch	
  website	
  snapshots	
  from	
  S3	
  
-­‐Fetch	
  classifica>ons	
  from	
  SQL/Impala	
  
-­‐Sani>ze	
  text	
  (strip	
  HTML)	
  
-­‐Run	
  feature	
  genera>on	
  and	
  selec>on	
  
-­‐Train	
  and	
  serialize	
  model	
  
-­‐Export	
  valida>on	
  sta>s>cs	
  
Luigi	
  
	
  
class GetSnapshots(luigi.Task):

def run(self):

" "...



class GenFeatures(luigi.Task):

def requires(self):

return GetSnapshots()"
Luigi	
  runs	
  tasks	
  on	
  Hadoop	
  cluster	
  
"
Scoring	
  as	
  a	
  service	
  
	
  
"Applica>on	
  
submission	
  
Website	
  
scraped	
  
Text	
  scored	
  
Applica>on	
  
reviewed	
  
ThriO	
  
RPC	
  
Scoring	
  
Service	
  
Scoring	
  as	
  a	
  service	
  
	
  
struct ScoringRequest {

1: string text

2: optional string model_name

}



struct ScoringResponse {

1: double score" " "// Experiments?

2: double request_duration

}"
Why	
  a	
  service?	
  
	
  
-­‐Same	
  code	
  base	
  for	
  training/scoring	
  
	
  
-­‐Reduced	
  duplica>on/easier	
  deploys	
  
	
  
-­‐Experimenta>on	
  
	
  
-­‐Log	
  requests	
  
	
  and	
  responses	
  
	
  (Parquet/Impala)	
  
	
  
-­‐Centralized	
  
	
  monitoring	
  
	
  (Graphite)	
  
Summary	
  
	
  
-­‐Simple	
  models	
  with	
  sklearn	
  
-­‐Pipelines	
  with	
  Luigi	
  
-­‐Scoring	
  as	
  a	
  service	
  
	
  
Thanks!	
  
@mlmanapat	
  
	
  

More Related Content

PDF
Luigi presentation OA Summit
PDF
Luigi presentation NYC Data Science
PPTX
A Beginner's Guide to Building Data Pipelines with Luigi
PDF
Managing data workflows with Luigi
PDF
Saturn 2018: Managing data consistency in a microservice architecture using S...
PDF
PostgreSQL実行計画入門@関西PostgreSQL勉強会
PDF
Apache Kafka as Event Streaming Platform for Microservice Architectures
PDF
Why Task Queues - ComoRichWeb
Luigi presentation OA Summit
Luigi presentation NYC Data Science
A Beginner's Guide to Building Data Pipelines with Luigi
Managing data workflows with Luigi
Saturn 2018: Managing data consistency in a microservice architecture using S...
PostgreSQL実行計画入門@関西PostgreSQL勉強会
Apache Kafka as Event Streaming Platform for Microservice Architectures
Why Task Queues - ComoRichWeb

What's hot (20)

PPTX
Javascript this keyword
PDF
Getting Started with Confluent Schema Registry
PDF
How Apache Kafka® Works
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
PDF
Near real-time anomaly detection at Lyft
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PDF
Introduction to Spark Streaming
PDF
Introduction to Kafka Streams
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
[Meetup] a successful migration from elastic search to clickhouse
PDF
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
PDF
Test strategies for data processing pipelines
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
PDF
Understanding and Improving Code Generation
PDF
Presto anatomy
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Introduction to the Disruptor
PDF
An Approach to Data Quality for Netflix Personalization Systems
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Javascript this keyword
Getting Started with Confluent Schema Registry
How Apache Kafka® Works
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Near real-time anomaly detection at Lyft
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Introduction to Spark Streaming
Introduction to Kafka Streams
Unified Stream and Batch Processing with Apache Flink
[Meetup] a successful migration from elastic search to clickhouse
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Test strategies for data processing pipelines
A Deep Dive into Query Execution Engine of Spark SQL
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
Understanding and Improving Code Generation
Presto anatomy
Tame the small files problem and optimize data layout for streaming ingestion...
Introduction to the Disruptor
An Approach to Data Quality for Netflix Personalization Systems
Optimizing Delta/Parquet Data Lakes for Apache Spark
Ad

Viewers also liked (16)

PDF
Machine learning in production with scikit-learn
PPTX
Production machine learning_infrastructure
PDF
Using PySpark to Process Boat Loads of Data
PPTX
Production and Beyond: Deploying and Managing Machine Learning Models
PDF
Multi runtime serving pipelines for machine learning
PDF
Serverless machine learning operations
PDF
Machine learning in production
PPTX
Managing and Versioning Machine Learning Models in Python
PDF
Square's Machine Learning Infrastructure and Applications - Rong Yan
PDF
Building A Production-Level Machine Learning Pipeline
PDF
PostgreSQL + Kafka: The Delight of Change Data Capture
PPTX
Machine Learning In Production
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Machine Learning Pipelines
PDF
Spark and machine learning in microservices architecture
PPTX
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Machine learning in production with scikit-learn
Production machine learning_infrastructure
Using PySpark to Process Boat Loads of Data
Production and Beyond: Deploying and Managing Machine Learning Models
Multi runtime serving pipelines for machine learning
Serverless machine learning operations
Machine learning in production
Managing and Versioning Machine Learning Models in Python
Square's Machine Learning Infrastructure and Applications - Rong Yan
Building A Production-Level Machine Learning Pipeline
PostgreSQL + Kafka: The Delight of Change Data Capture
Machine Learning In Production
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Machine Learning Pipelines
Spark and machine learning in microservices architecture
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
Ad

Similar to Python as part of a production machine learning stack by Michael Manapat PyData SV 2014 (20)

PDF
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
PDF
MLflow at Company Scale
PPTX
Vertical Recommendation Using Collaborative Filtering
PPTX
Swift distributed tracing method and tools v2
PPT
Iswim for testing
PPT
Iswim for testing
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PPTX
Machine Learning with Microsoft Azure
PDF
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
PDF
Profiling Mondrian MDX Requests in a Production Environment
PDF
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
PDF
Payments On Rails
PDF
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
DOC
MaheshCV_Yepme
PPTX
Becoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby Meetup
PDF
Building Machine Learning Pipelines
PDF
IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an...
PPTX
Machine learning techniques in fraud prevention
PDF
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
PDF
Software Transactional Memory (STM) in Frege
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
MLflow at Company Scale
Vertical Recommendation Using Collaborative Filtering
Swift distributed tracing method and tools v2
Iswim for testing
Iswim for testing
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Machine Learning with Microsoft Azure
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
Profiling Mondrian MDX Requests in a Production Environment
A Practical Deep Dive into Observability of Streaming Applications with Kosta...
Payments On Rails
stackconf 2023 | How to reduce expenses on monitoring with VictoriaMetrics by...
MaheshCV_Yepme
Becoming a SOC2 Ruby Shop - Montreal.rb November, 5, 2022 Ruby Meetup
Building Machine Learning Pipelines
IRJET- Credit Card Fraud Detection : A Comparison using Random Forest, SVM an...
Machine learning techniques in fraud prevention
Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
Software Transactional Memory (STM) in Frege

More from PyData (20)

PDF
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PDF
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PDF
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PDF
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PDF
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PPTX
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PPTX
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PDF
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PDF
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PDF
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PDF
Words in Space - Rebecca Bilbro
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PPTX
Pydata beautiful soup - Monica Puerto
PDF
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PPTX
Extending Pandas with Custom Types - Will Ayd
PDF
Measuring Model Fairness - Stephen Hoover
PDF
What's the Science in Data Science? - Skipper Seabold
PDF
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PDF
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PDF
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Words in Space - Rebecca Bilbro
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
Pydata beautiful soup - Monica Puerto
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
Extending Pandas with Custom Types - Will Ayd
Measuring Model Fairness - Stephen Hoover
What's the Science in Data Science? - Skipper Seabold
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...

Recently uploaded (20)

PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
Chapter 5: Probability Theory and Statistics
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
WOOl fibre morphology and structure.pdf for textiles
PPTX
O2C Customer Invoices to Receipt V15A.pptx
PDF
project resource management chapter-09.pdf
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PPTX
TLE Review Electricity (Electricity).pptx
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PPTX
1. Introduction to Computer Programming.pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
DP Operators-handbook-extract for the Mautical Institute
PPT
Module 1.ppt Iot fundamentals and Architecture
cloud_computing_Infrastucture_as_cloud_p
Chapter 5: Probability Theory and Statistics
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
1 - Historical Antecedents, Social Consideration.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
OMC Textile Division Presentation 2021.pptx
WOOl fibre morphology and structure.pdf for textiles
O2C Customer Invoices to Receipt V15A.pptx
project resource management chapter-09.pdf
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
TLE Review Electricity (Electricity).pptx
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
1. Introduction to Computer Programming.pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
DP Operators-handbook-extract for the Mautical Institute
Module 1.ppt Iot fundamentals and Architecture

Python as part of a production machine learning stack by Michael Manapat PyData SV 2014

  • 1. Python  as  part  of  a  produc0on   machine  learning  stack         Michael  Manapat   @mlmanapat   Stripe    
  • 2. Outline     -­‐Why  we  need  ML  at  Stripe   -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service    
  • 3. Stripe  is  a  technology  company   focusing  on  making  payments  easy     -­‐Short  applica>on    
  • 4. Tokeniza0on       Customer   browser   Stripe   Stripe.js   Token   Merchant   server   Stripe   API  call   Result  
  • 5. API  Call     import stripe
 stripe.Charge.create(
 amount=400,
 currency="usd",
 card="tok_103xnl2gR5VxTSB”
 email=customer@example.com"
 )"
  • 6. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons  (weapons)   -­‐Merchant  fraud  (card  “cashers”)       -­‐Transac>on  fraud     -­‐No  machine  learning  a  year  ago  
  • 7. Fraud  /  business  viola0ons     -­‐Terms  of  service  viola>ons     E-­‐cigareMes,  drugs,  weapons,  etc.     How  do  we  find  these  automa>cally?  
  • 8. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed  
  • 9. Merchant  sign  up  flow           Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   Machine   learning   pipeline  and   service  
  • 10. Building  a  classifier:  e-­‐cigareIes     data = pandas.from_pickle(‘ecigs’)
 data.head()
 
 text violator
 0 " please verify your age i am 21 years or older ... True
 1 coming soon toggle me drag me with your mouse ... False
 2 drink moscow mules cart 0 log in or create an ... False
 3 vapors electronic cigarette buy now insuper st... True
 4 t-shirts shorts hawaii about us silver coll... False
 
 [5 rows x 2 columns]  
  • 11. Features  for  text  classifica0on     cv = CountVectorizer
 
 features = 
 cv.fit_transform(data['text'])
 
 Sparse  matrix  of  word  counts  from   input  text  (omiSng  feature  selec>on)  
  • 12. Features  for  text  classifica0on   
 X_train, X_test, y_train, y_test = 
 train_test_split(
 features, data['violator'], 
 test_size=0.2)
 
 -­‐Avoid  leakage   -Other  cross-­‐valida>on  methods  
  • 13. Training   
 model = LogisticRegression()
 model.fit(X_train, y_train)
 
 Serializer  reads  from   
 model.intercept_
 model.coef_
  
  • 14. Valida0on   
 probs = model.predict_proba(X_test)
 
 fpr, tpr, thresholds =
 roc_curve(y_test, probs[:, 1])
 
 matplotlib.pyplot(fpr, tpr)  
  • 15. ROC:  Receiver  opera0ng  characteris0c   
 
  
  • 16. Pipeline     -­‐Fetch  website  snapshots  from  S3   -­‐Fetch  classifica>ons  from  SQL/Impala   -­‐Sani>ze  text  (strip  HTML)   -­‐Run  feature  genera>on  and  selec>on   -­‐Train  and  serialize  model   -­‐Export  valida>on  sta>s>cs  
  • 17. Luigi     class GetSnapshots(luigi.Task):
 def run(self):
 " "...
 
 class GenFeatures(luigi.Task):
 def requires(self):
 return GetSnapshots()"
  • 18. Luigi  runs  tasks  on  Hadoop  cluster   "
  • 19. Scoring  as  a  service     "Applica>on   submission   Website   scraped   Text  scored   Applica>on   reviewed   ThriO   RPC   Scoring   Service  
  • 20. Scoring  as  a  service     struct ScoringRequest {
 1: string text
 2: optional string model_name
 }
 
 struct ScoringResponse {
 1: double score" " "// Experiments?
 2: double request_duration
 }"
  • 21. Why  a  service?     -­‐Same  code  base  for  training/scoring     -­‐Reduced  duplica>on/easier  deploys     -­‐Experimenta>on    
  • 22. -­‐Log  requests    and  responses    (Parquet/Impala)     -­‐Centralized    monitoring    (Graphite)  
  • 23. Summary     -­‐Simple  models  with  sklearn   -­‐Pipelines  with  Luigi   -­‐Scoring  as  a  service     Thanks!   @mlmanapat