SlideShare a Scribd company logo
Data Science @ PMI
Tools of The Trade
Best Practices to Start, Develop and Ship a Data Science Product
Manuel Valverde
Tokyo WebHack, 17th January 2019
‱ PhD.@Granada U. Spain: Physics modelling and MC simulations for
SuperKamiokande
‱ PostDoc@Osaka U. Osaka: Nuclear Structure Calculations. Think Gaussian
processes
‱ DataScientist@Rakuten, Tokyo: Search Relevancy for e-commerce
‱ DataScientist@PMI, Tokyo: Fraud prevention
2
About Me
About Philip Morris International
3
‱ Founded in 1847
‱ No. 108 in the 2018 Fortune 500
‱ 80,000 employees, 180+ markets, 150M consumers
‱ 6 of the world's top international 15 brands, including
Shifting from combustible cigarettes to smoke-free, reduced risk products (RRP)
https://www.pmi.com/smoke-free-products
‱ We are part of PMI's Enterprise Analytics and Data (EAD) group
‱ 40+ Data Scientists across 4 hubs
‱ Offices in Amsterdam (NL), Kraków (PL), Lausanne (CH) and Tokyo (JP)
‱ Profiles
‱ Education: 30% PhD, 70% MSc/BSc
‱ Data Science Experience: 7.4 yrs on average
‱ Experience in PMI: 88% under 2yrs
‱ Expertise in Machine Learning, Big Data Engineering, Insights Communication
‱ SCRUM certified (Professional Scrum Developer)
4
Data Science @ PMI
5
2 Labs
LA
2 Labs
North
America
Add 1
Lab
EU
2 Labs
EE
Add 2
Labs
Asia
A
( Data x Science x Communication ) = Insight
Data is only one part of the equation. We bring the scientific method. It materializes in the analytical code we
write. It is as valuable as the data itself.
B We are business driven
Whatever we do, it contributes to the business. We are diligent about making an impact.
C
We invest in people
We invest in the ability to ask questions. It can’t be achieved with tools only. Tools are for generating answers,
but questions are posed by people.
D
We self-organize
We choose coordination & cultivation over command & control. We believe this approach allows for the best
solutions to emerge.
E We iterate and improve
We embrace lean development, we learn from mistakes and we do it together with business.
F We co-create
Data insights ecosystem requires collaboration among all parties. We want to be active contributors.
Data Science Principles @ PMI
6
PMI’s Data Ocean
Why are we here?
Because a data scientist is not just someone who knows more statistics than a programmer
Data Science is Software.
The product of a Data Science effort (a Model or a Report)
is essentially a small but critical part of a large,
sophisticated business software. Data Products must
therefore be designed to play well with systems up- and
downstream.
Remember that the system can work without a model, but
a model is pretty much worthless without the system.
Writing code for implementing machine learning
algorithms is getting easier every year.
Building a scikit-learn Pipeline to implement a Random
Forest model with GridSearch is less than twenty lines of
code today. AutoML is around the corner.
We need to acknowledge and understand two things:
 The code, or even the model is not our end-goal.
 We're in the business of building intelligent
applications,
or data products.
Why are we here?
Because a data scientist is not just someone who knows more programming than a statistician
9
‱ Obtain
connect to DBs,
download flat files
‱ Scrub
outliers/missing data,
aggregations
‱ Explore
statistical analysis,
feature engineering
‱ Model
learning algorithms,
parameter optimization
‱ INdustrialize
reports, APIs
ExploratoryProduction
Smart
Application
An OSEMN Data Science Process
Explore, Model, Iterate.
Create a Data Product.
10
We define a data product as a system that
 takes raw data as input, đŸ“Č
 applies a machine-learned model to it, đŸ€–
 produces data as output to another system đŸ’»
Additionally, a data product must
 be dynamic and maintainable,
allowing periodic updates 🏃
 be responsive, performant and scalable 👹👹👩👩
What is a Data Product? đŸ€”
In a nutshell, it’s a software product with an ML Engine.
Examples
Amazon’s Product Recommendation Engine
LinkedIN’s “People You May Know”
Autonomous Vehicles
The Classic Data Science Workflow
Data Product Development Workflow
11
Challenges in Data Product Development đŸ€”
“Team programming isn’t a divide and conquer problem.
It is a divide, conquer, and integrate problem.”
1. The Process
Infrastructure Setup > Code > Build > Test >
Package > Release > Monitor
2. The Team
Cross-functional group of businesspeople, data
scientists, engineers and developers.
3. The Challenge
As an example, consider we have 2 groups,
 Team A consists of data engineers and
scientists and works on the Prediction Engine.
đŸ‘šđŸ€–đŸ’»đŸ‘šđŸ€–đŸŽ“
 Team B consists of software engineers and
front-end developers working on the UI. đŸ‘šđŸ€–đŸŽš
đŸ‘šđŸ€–đŸ”Ź
The goal is that every piece in the product should
integrate well into a larger codebase. đŸ»
12
Continuous Integration (CI), Delivery (CD) and
DeploymentDevelopment practices for overcoming integration challenges and moving faster to delivery
The CI/CD Cycle
 Continuous Integration requires multiple developers to
integrate code into a shared repository frequently.
Requested merges are automatically tested and
reviewed.
 Enabled by git-flow, code standards and
automated testing
 Continuous Delivery makes sure that the code that we
integrate is always in a deploy-ready state.
 Enabled by agile (iterative) methods,
testing and build automation
 Continuous Deployment is the actual act of pushing
updates out to the user – think of your iPhone apps or
Desktop browser that prompt for updates to be installed
periodically.
14
The Role of Data Scientists
Learn best practices to contribute effectively to data products
Write code that is
 Readable,
so others can understand and add to it
 Testable
so others can verify it does what it advertises
and integrate it into their work
 Reusable
so it may be included in other projects
 Reproducible
uses libraries/packages that are available on
production environments
 Usable
don’t write code in SAS or R,
most engineers don’t speak those languages.
Joel’s Tests
 Do you use source control?
 Can you make a build in one step?
 Do you make daily builds?
 Do you have a bug database?
 Do you fix bugs before writing new code?
15
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates
16
Data Science Best Practices @ PMI
Python
Style
Guides
Notebooks
to Modules
Testing
Code
Reviews
Docker
Virtual
Environments
Version
Control
Project
Templates
Agile Data Science Workflow
Our building blocks
Ocean Components
To create a workflow that is 

Our Vision
‱ Flexible
Adapts to specific needs of every use-case
Accommodates changing requirements
‱ Inspection
Transparency at all times
Artifacts can be audited at any time.
‱ Reproducible
Out-of-the-box dependency management
No more ‘But-it-works-on-my-machine’ or ‘Please-industrialize-this-
model’
‱ Easy to use
Frictionless development experience
Freedom to experiment
đŸ”„
Some things we always need to be mindful of.
Our Principles
 Sensitive Data must never leave the Ocean
 Restricted Open-Source libraries must be avoided
 Every use-case must be industrialization-ready
DS Prod Lab
Scanned by BlackDuck
Automation
On-demand
infrastructure
Data Read/Write
Data Product
Reproducible Containers
Version Control
System Architecture
The dots, connected.
We organize our workflow in 3 phases – Start, Develop and Ship
3 Steps to a Data Product
‱ Get Infrastructure
‱ DS Prod Lab
‱ Docker Container
‱ Python Environments
‱ Get Data
‱ Flat Files
‱ Database Connections
‱ Get Code
‱ Project repo
‱ Cookiecutter template
‱ Start Docker container
‱ Check out a Branch
‱ For each task in OSEMN,
write
Exploratory code in NBs,
‱ Standard Code Styles
‱ Documentation, Tests
‱ Maintain
dependencies
‱ Refactor into Modules
‱ Push
‱ Review, Merge
‱ Package Python code,
publish to PyPi on
Artifactory
‱ Persist models
‱ Build an API to industrialize
the model.
‱ Provide endpoints for
app-health checks.
‱ Set up Jenkins pipeline for
continuous integration
‱ Plan for the next iteration
Start Develop Ship
For Reproducibility
Docker Containers
Docker for Containerized Data Science
All your dependencies in one place.
Code guaranteed to run anywhere.
A container is a lightweight, stand-alone package of a software that
includes everything needed to run it: code, runtime, system tools,
system libraries, settings.
Containerized software will always run the same, regardless of the
environment.
Benefits for Data Scientists
 Freedom, install all your favorite tools and libraries
 Ease of installation, set up your toolbox once and it will always work
 Reproducibility and Portability,
your development environment can be reproduced anywhere
 Isolation, your Py2 setup doesn’t mess up your Py3 setup, installing
a new library doesn’t mess up system Python
 Speed, get up and running in minutes with images optimized for
specific applications like time-series analysis or deep-learning.
For organization and predictability
Project Templates
CookieCutter
Everything has a place and a purpose
The idea is borrowed from popular web-frameworks like Rails and Django
where each developer uses the same template when starting a new project.
This makes it easier for everyone on the team to figure out where they
would find or put the various moving parts.
We will use a standard project skeleton tailed for data science projects so
that every scientist knows where to put their code, notebooks, data, models,
figures and references.
Benefits of a standardized directory structure:
 allows people to collaborate more easily
 empowers reproducible analysis
 enforces a "data as immutable" design philosophy
Cookiecutters help us generate this folder structure automatically.
CookieCutter
The standard folder structure enforces a design philosophy for faster delivery
Treat Data as Immutable
Raw data should be stored inside /data/raw and should never be modified
by hand. The code you write should ingest the data from /raw and cleaned
or processed data should be written to /processed.
Reproducibility
Everyone on the team should be able to reproduce your analysis with
 the code in src/
 the data in data/raw/
 the dependencies in Dockerfile, requirements file
Notebooks for Exploration, Scripts for Production Code
Jupyter is great for exploratory analysis, but quite challenging for version
control (they're stored as json files.) Once your code works well, move it
from notebooks/ to src/ and package the functions and classes into
modules.
For being deploy-ready
Moving code from
Notebooks to Source Code
Notebooks for Exploration. Files for Production.
The case against Notebooks
 The main cause of unmaintainable code and bad structure in Data Science is the mixing
of exploratory "throw away" code with production code. Notebooks are being used to write
code that ultimately would be deployed in production.
 This is not what notebooks where invented for;
they are essentially browser-based shells and presentation tools with charts and code
blocks.
 Notebooks do not have refactoring tools, code structuring tools and are
notorious for version control management.
Motivation for Organizing Code
 Extract text and plots from notebooks into Markdown Reports for a business audience
 Notebooks with minimal code and clear narrative can be used as Technical Reports
 Move the core functionality into Python modules to speed up subsequent exploration
In the exploratory phase,
the code base is expanded through data analysis, feature
engineering and modelling.
In the refactoring phase,
the most useful results and tools from the exploratory phase are
translated into modules and packages.
The Production Codebase grows across sprints.
For integration and deployment
Automated Testing
 If your code is not performing as expected, will you
know?
 If your data are corrupted, do you notice?
 If you re-run your analysis on different data,
are the methods you used still valid?
Automated Testing
“Why do most developers fear to make continuous changes to their code? They are afraid they’ll break
it!
Why are they afraid they’ll break it? Because they don’t have tests!”
Two Types of Tests useful for DS
 Unit Testing to make sure individual pieces of code work
 Integration Testing to make sure your code works with everyone else's
Challenge with writing Tests for Data Science
For most software, the output is deterministic - a function for averaging numbers can be
Unit tested with a simple function that checks if result is accurate. You can then check your
changes in, and Integration tests can run against the new build with a fabricated set of
results to ensure that everything works as expected.
But not so with Data Science work – the output is probabilistic.
You can't always put in a 2 and 4 and expect a 3 to come out.
Automated Testing for Data Science
 First, implement a Unit Test framework within your code; use pytest or nose
 In some cases, you can set a deterministic value like number of rows or the
expected data type from a function, and write a test for it.
 But if you can't - pick the performance metric (p-value, F1-score, or AUC, etc.)
and check if it lies within an acceptable range.
Test-Driven Development (TDD)
First the developer writes an (initially failing)
automated test case that defines a desired
improvement or new function, then produces the
minimum amount of code to pass that test.” So, before
actually writing any code, you should write your tests.
All tests should go into the tests/ subdirectory of the
specific package. Write tests in three steps
 Get/make the input data
 Manually construct the result you expect
Compare the actual result
to the expected correct result
In Conclusion
 Engineering smart systems around a machine-learned
core is difficult
 It requires teams of exceptionally talented individuals to
work together.
 What makes data scientists special is their ability to work
with both business leaders and technology experts.
 We must acknowledge that we are a part of something
much bigger and learn to play well with each other and
with all parties involved.
Our hope is that these systems, principles and best
practices will help you take the first steps in that direction
Questions?

More Related Content

PPTX
Career opportunities in open source framework
PDF
Career opportunities in open source framework
PDF
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
PDF
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
PPTX
How a big company employs cutting edge tech
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PDF
[Europe merge world tour] Perforce Europe Merge World Tour Keynote
PDF
resume4
Career opportunities in open source framework
Career opportunities in open source framework
Maciej Marek (Philip Morris International) - The Tools of The Trade
 
Developing and deploying AI solutions on the cloud using Team Data Science Pr...
How a big company employs cutting edge tech
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
[Europe merge world tour] Perforce Europe Merge World Tour Keynote
resume4

What's hot (15)

PPTX
2019 Top Lessons Learned Since the Phoenix Project Was Released
PDF
Resume_Dip_Shah
PDF
Technology radar-may-2013
PPTX
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
PDF
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
PDF
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
PDF
Agile, Automated, Aware: How to Model for Success
PDF
2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!
PPTX
The Unicorn Project and The Five Ideals (older: see notes for newer version)
PDF
DOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy Environments
PPTX
Managing and Versioning Machine Learning Models in Python
PDF
Fit For Purpose: Preventing a Big Data Letdown
PDF
Introduction to Machine Learning - WeCloudData
PPTX
DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...
PPTX
Ensuring Cloud Native Success: The Greenfield Journey
2019 Top Lessons Learned Since the Phoenix Project Was Released
Resume_Dip_Shah
Technology radar-may-2013
2019 12 Clojure/conj: Love Letter To Clojure, and A Datomic Experience Report
PMI Thailand: DevOps / Roles of Project Manager (20-May-2020)
DataOps, DevOps and the Developer: Treating Database Code Just Like App Code
Agile, Automated, Aware: How to Model for Success
2013 Velocity DevOps Metrics -- It's Not Just For WebOps Any More!
The Unicorn Project and The Five Ideals (older: see notes for newer version)
DOES14 - Scott Prugh - CSG - DevOps and Lean in Legacy Environments
Managing and Versioning Machine Learning Models in Python
Fit For Purpose: Preventing a Big Data Letdown
Introduction to Machine Learning - WeCloudData
DOES15 - Scott Prugh & Erica Morrison - Conway & Taylor Meet the Strangler (v...
Ensuring Cloud Native Success: The Greenfield Journey
Ad

Similar to Data science tools of the trade (20)

PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PPTX
Software engineering practices for the data science and machine learning life...
PDF
Data Science at Scale - The DevOps Approach
PDF
Lean Analytics: How to get more out of your data science team
PDF
Real-World-Case-Studies-in-Data-Science.
PDF
Building successful data science teams
PDF
Become a citizen data scientist
PDF
Data Science meets Software Development
PDF
Data science course in madhapur,Hyderabad
PDF
Enabling Your Data Science Team with Modern Data Engineering
PDF
Continuous Improvement through Data Science From Products to Systems Beyond C...
 
PPTX
Data Science course in Hyderabad .
PPTX
Data Science course in Hyderabad .
PDF
Data science course in ameerpet Hyderabad
PPTX
data science course training in Hyderabad
PPTX
data science course in Hyderabad data science course in Hyderabad
PPTX
data science.pptx
PPTX
best data science course institutes in Hyderabad
PDF
DS Life Cycle
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Software engineering practices for the data science and machine learning life...
Data Science at Scale - The DevOps Approach
Lean Analytics: How to get more out of your data science team
Real-World-Case-Studies-in-Data-Science.
Building successful data science teams
Become a citizen data scientist
Data Science meets Software Development
Data science course in madhapur,Hyderabad
Enabling Your Data Science Team with Modern Data Engineering
Continuous Improvement through Data Science From Products to Systems Beyond C...
 
Data Science course in Hyderabad .
Data Science course in Hyderabad .
Data science course in ameerpet Hyderabad
data science course training in Hyderabad
data science course in Hyderabad data science course in Hyderabad
data science.pptx
best data science course institutes in Hyderabad
DS Life Cycle
Ad

More from Fangda Wang (11)

PDF
[WWCode] How aware are you of your deciding model?
PDF
Under the hood of architecture interviews at indeed
PDF
How Indeed asks coding interview questions
PDF
Types are eating the world
PDF
From ic to tech lead
PDF
Introduction to japanese tokenizer
PDF
Gentle Introduction to Scala
PDF
To pair or not to pair
PDF
Balanced Team
PDF
Functional programming and Elm
PDF
Elm at large (companies)
[WWCode] How aware are you of your deciding model?
Under the hood of architecture interviews at indeed
How Indeed asks coding interview questions
Types are eating the world
From ic to tech lead
Introduction to japanese tokenizer
Gentle Introduction to Scala
To pair or not to pair
Balanced Team
Functional programming and Elm
Elm at large (companies)

Recently uploaded (20)

PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
L1 - Introduction to python Backend.pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Transform Your Business with a Software ERP System
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
ai tools demonstartion for schools and inter college
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Odoo POS Development Services by CandidRoot Solutions
2025 Textile ERP Trends: SAP, Odoo & Oracle
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Odoo Companies in India – Driving Business Transformation.pdf
Understanding Forklifts - TECH EHS Solution
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
top salesforce developer skills in 2025.pdf
L1 - Introduction to python Backend.pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Transform Your Business with a Software ERP System
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Internet Downloader Manager (IDM) Crack 6.42 Build 41
ai tools demonstartion for schools and inter college
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
VVF-Customer-Presentation2025-Ver1.9.pptx
How to Choose the Right IT Partner for Your Business in Malaysia
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

Data science tools of the trade

  • 1. Data Science @ PMI Tools of The Trade Best Practices to Start, Develop and Ship a Data Science Product Manuel Valverde Tokyo WebHack, 17th January 2019
  • 2. ‱ PhD.@Granada U. Spain: Physics modelling and MC simulations for SuperKamiokande ‱ PostDoc@Osaka U. Osaka: Nuclear Structure Calculations. Think Gaussian processes ‱ DataScientist@Rakuten, Tokyo: Search Relevancy for e-commerce ‱ DataScientist@PMI, Tokyo: Fraud prevention 2 About Me
  • 3. About Philip Morris International 3 ‱ Founded in 1847 ‱ No. 108 in the 2018 Fortune 500 ‱ 80,000 employees, 180+ markets, 150M consumers ‱ 6 of the world's top international 15 brands, including Shifting from combustible cigarettes to smoke-free, reduced risk products (RRP) https://www.pmi.com/smoke-free-products
  • 4. ‱ We are part of PMI's Enterprise Analytics and Data (EAD) group ‱ 40+ Data Scientists across 4 hubs ‱ Offices in Amsterdam (NL), KrakĂłw (PL), Lausanne (CH) and Tokyo (JP) ‱ Profiles ‱ Education: 30% PhD, 70% MSc/BSc ‱ Data Science Experience: 7.4 yrs on average ‱ Experience in PMI: 88% under 2yrs ‱ Expertise in Machine Learning, Big Data Engineering, Insights Communication ‱ SCRUM certified (Professional Scrum Developer) 4 Data Science @ PMI
  • 5. 5 2 Labs LA 2 Labs North America Add 1 Lab EU 2 Labs EE Add 2 Labs Asia A ( Data x Science x Communication ) = Insight Data is only one part of the equation. We bring the scientific method. It materializes in the analytical code we write. It is as valuable as the data itself. B We are business driven Whatever we do, it contributes to the business. We are diligent about making an impact. C We invest in people We invest in the ability to ask questions. It can’t be achieved with tools only. Tools are for generating answers, but questions are posed by people. D We self-organize We choose coordination & cultivation over command & control. We believe this approach allows for the best solutions to emerge. E We iterate and improve We embrace lean development, we learn from mistakes and we do it together with business. F We co-create Data insights ecosystem requires collaboration among all parties. We want to be active contributors. Data Science Principles @ PMI
  • 7. Why are we here? Because a data scientist is not just someone who knows more statistics than a programmer Data Science is Software. The product of a Data Science effort (a Model or a Report) is essentially a small but critical part of a large, sophisticated business software. Data Products must therefore be designed to play well with systems up- and downstream. Remember that the system can work without a model, but a model is pretty much worthless without the system.
  • 8. Writing code for implementing machine learning algorithms is getting easier every year. Building a scikit-learn Pipeline to implement a Random Forest model with GridSearch is less than twenty lines of code today. AutoML is around the corner. We need to acknowledge and understand two things:  The code, or even the model is not our end-goal.  We're in the business of building intelligent applications, or data products. Why are we here? Because a data scientist is not just someone who knows more programming than a statistician
  • 9. 9 ‱ Obtain connect to DBs, download flat files ‱ Scrub outliers/missing data, aggregations ‱ Explore statistical analysis, feature engineering ‱ Model learning algorithms, parameter optimization ‱ INdustrialize reports, APIs ExploratoryProduction Smart Application An OSEMN Data Science Process Explore, Model, Iterate. Create a Data Product.
  • 10. 10 We define a data product as a system that  takes raw data as input, đŸ“Č  applies a machine-learned model to it, đŸ€–  produces data as output to another system đŸ’» Additionally, a data product must  be dynamic and maintainable, allowing periodic updates 🏃  be responsive, performant and scalable 👹👹👩👩 What is a Data Product? đŸ€” In a nutshell, it’s a software product with an ML Engine. Examples Amazon’s Product Recommendation Engine LinkedIN’s “People You May Know” Autonomous Vehicles The Classic Data Science Workflow Data Product Development Workflow
  • 11. 11 Challenges in Data Product Development đŸ€” “Team programming isn’t a divide and conquer problem. It is a divide, conquer, and integrate problem.” 1. The Process Infrastructure Setup > Code > Build > Test > Package > Release > Monitor 2. The Team Cross-functional group of businesspeople, data scientists, engineers and developers. 3. The Challenge As an example, consider we have 2 groups,  Team A consists of data engineers and scientists and works on the Prediction Engine. đŸ‘šđŸ€–đŸ’»đŸ‘šđŸ€–đŸŽ“  Team B consists of software engineers and front-end developers working on the UI. đŸ‘šđŸ€–đŸŽš đŸ‘šđŸ€–đŸ”Ź The goal is that every piece in the product should integrate well into a larger codebase. đŸ»
  • 12. 12 Continuous Integration (CI), Delivery (CD) and DeploymentDevelopment practices for overcoming integration challenges and moving faster to delivery The CI/CD Cycle  Continuous Integration requires multiple developers to integrate code into a shared repository frequently. Requested merges are automatically tested and reviewed.  Enabled by git-flow, code standards and automated testing  Continuous Delivery makes sure that the code that we integrate is always in a deploy-ready state.  Enabled by agile (iterative) methods, testing and build automation  Continuous Deployment is the actual act of pushing updates out to the user – think of your iPhone apps or Desktop browser that prompt for updates to be installed periodically.
  • 13. 14 The Role of Data Scientists Learn best practices to contribute effectively to data products Write code that is  Readable, so others can understand and add to it  Testable so others can verify it does what it advertises and integrate it into their work  Reusable so it may be included in other projects  Reproducible uses libraries/packages that are available on production environments  Usable don’t write code in SAS or R, most engineers don’t speak those languages. Joel’s Tests  Do you use source control?  Can you make a build in one step?  Do you make daily builds?  Do you have a bug database?  Do you fix bugs before writing new code?
  • 14. 15 Data Science Best Practices @ PMI Python Style Guides Notebooks to Modules Testing Code Reviews Docker Virtual Environments Version Control Project Templates
  • 15. 16 Data Science Best Practices @ PMI Python Style Guides Notebooks to Modules Testing Code Reviews Docker Virtual Environments Version Control Project Templates
  • 16. Agile Data Science Workflow
  • 18. To create a workflow that is 
 Our Vision ‱ Flexible Adapts to specific needs of every use-case Accommodates changing requirements ‱ Inspection Transparency at all times Artifacts can be audited at any time. ‱ Reproducible Out-of-the-box dependency management No more ‘But-it-works-on-my-machine’ or ‘Please-industrialize-this- model’ ‱ Easy to use Frictionless development experience Freedom to experiment đŸ”„
  • 19. Some things we always need to be mindful of. Our Principles  Sensitive Data must never leave the Ocean  Restricted Open-Source libraries must be avoided  Every use-case must be industrialization-ready
  • 20. DS Prod Lab Scanned by BlackDuck Automation On-demand infrastructure Data Read/Write Data Product Reproducible Containers Version Control System Architecture The dots, connected.
  • 21. We organize our workflow in 3 phases – Start, Develop and Ship 3 Steps to a Data Product ‱ Get Infrastructure ‱ DS Prod Lab ‱ Docker Container ‱ Python Environments ‱ Get Data ‱ Flat Files ‱ Database Connections ‱ Get Code ‱ Project repo ‱ Cookiecutter template ‱ Start Docker container ‱ Check out a Branch ‱ For each task in OSEMN, write Exploratory code in NBs, ‱ Standard Code Styles ‱ Documentation, Tests ‱ Maintain dependencies ‱ Refactor into Modules ‱ Push ‱ Review, Merge ‱ Package Python code, publish to PyPi on Artifactory ‱ Persist models ‱ Build an API to industrialize the model. ‱ Provide endpoints for app-health checks. ‱ Set up Jenkins pipeline for continuous integration ‱ Plan for the next iteration Start Develop Ship
  • 23. Docker for Containerized Data Science All your dependencies in one place. Code guaranteed to run anywhere. A container is a lightweight, stand-alone package of a software that includes everything needed to run it: code, runtime, system tools, system libraries, settings. Containerized software will always run the same, regardless of the environment. Benefits for Data Scientists  Freedom, install all your favorite tools and libraries  Ease of installation, set up your toolbox once and it will always work  Reproducibility and Portability, your development environment can be reproduced anywhere  Isolation, your Py2 setup doesn’t mess up your Py3 setup, installing a new library doesn’t mess up system Python  Speed, get up and running in minutes with images optimized for specific applications like time-series analysis or deep-learning.
  • 24. For organization and predictability Project Templates
  • 25. CookieCutter Everything has a place and a purpose The idea is borrowed from popular web-frameworks like Rails and Django where each developer uses the same template when starting a new project. This makes it easier for everyone on the team to figure out where they would find or put the various moving parts. We will use a standard project skeleton tailed for data science projects so that every scientist knows where to put their code, notebooks, data, models, figures and references. Benefits of a standardized directory structure:  allows people to collaborate more easily  empowers reproducible analysis  enforces a "data as immutable" design philosophy Cookiecutters help us generate this folder structure automatically.
  • 26. CookieCutter The standard folder structure enforces a design philosophy for faster delivery Treat Data as Immutable Raw data should be stored inside /data/raw and should never be modified by hand. The code you write should ingest the data from /raw and cleaned or processed data should be written to /processed. Reproducibility Everyone on the team should be able to reproduce your analysis with  the code in src/  the data in data/raw/  the dependencies in Dockerfile, requirements file Notebooks for Exploration, Scripts for Production Code Jupyter is great for exploratory analysis, but quite challenging for version control (they're stored as json files.) Once your code works well, move it from notebooks/ to src/ and package the functions and classes into modules.
  • 27. For being deploy-ready Moving code from Notebooks to Source Code
  • 28. Notebooks for Exploration. Files for Production. The case against Notebooks  The main cause of unmaintainable code and bad structure in Data Science is the mixing of exploratory "throw away" code with production code. Notebooks are being used to write code that ultimately would be deployed in production.  This is not what notebooks where invented for; they are essentially browser-based shells and presentation tools with charts and code blocks.  Notebooks do not have refactoring tools, code structuring tools and are notorious for version control management. Motivation for Organizing Code  Extract text and plots from notebooks into Markdown Reports for a business audience  Notebooks with minimal code and clear narrative can be used as Technical Reports  Move the core functionality into Python modules to speed up subsequent exploration
  • 29. In the exploratory phase, the code base is expanded through data analysis, feature engineering and modelling. In the refactoring phase, the most useful results and tools from the exploratory phase are translated into modules and packages. The Production Codebase grows across sprints.
  • 30. For integration and deployment Automated Testing
  • 31.  If your code is not performing as expected, will you know?  If your data are corrupted, do you notice?  If you re-run your analysis on different data, are the methods you used still valid?
  • 32. Automated Testing “Why do most developers fear to make continuous changes to their code? They are afraid they’ll break it! Why are they afraid they’ll break it? Because they don’t have tests!” Two Types of Tests useful for DS  Unit Testing to make sure individual pieces of code work  Integration Testing to make sure your code works with everyone else's Challenge with writing Tests for Data Science For most software, the output is deterministic - a function for averaging numbers can be Unit tested with a simple function that checks if result is accurate. You can then check your changes in, and Integration tests can run against the new build with a fabricated set of results to ensure that everything works as expected. But not so with Data Science work – the output is probabilistic. You can't always put in a 2 and 4 and expect a 3 to come out.
  • 33. Automated Testing for Data Science  First, implement a Unit Test framework within your code; use pytest or nose  In some cases, you can set a deterministic value like number of rows or the expected data type from a function, and write a test for it.  But if you can't - pick the performance metric (p-value, F1-score, or AUC, etc.) and check if it lies within an acceptable range. Test-Driven Development (TDD) First the developer writes an (initially failing) automated test case that defines a desired improvement or new function, then produces the minimum amount of code to pass that test.” So, before actually writing any code, you should write your tests. All tests should go into the tests/ subdirectory of the specific package. Write tests in three steps  Get/make the input data  Manually construct the result you expect Compare the actual result to the expected correct result
  • 35.  Engineering smart systems around a machine-learned core is difficult  It requires teams of exceptionally talented individuals to work together.  What makes data scientists special is their ability to work with both business leaders and technology experts.  We must acknowledge that we are a part of something much bigger and learn to play well with each other and with all parties involved. Our hope is that these systems, principles and best practices will help you take the first steps in that direction