SlideShare a Scribd company logo
A DATA SCIENTIST JOURNEY TO
INDUSTRIALIZATION OF MACHINE
LEARNING MODELS
DataXDay 2018
17th May 2018
@DataXDay
DATA SCIENCE
FOUNDATIONS FOR DATA SCIENCE AT AIR FRANCE
3
Adoption of Operations
Research for crew
scheduling
Extension to other
business domains:
Revenue Management,
Cargo, Ground
services, …
Adoption of
Hadoop
Focus on Machine
Learning
Ops Research is
now 120 engineers
in Paris and
Amsterdam
Adoption of data science within AFKL IT
was favored by existing Operations Research practice
@DataXDay
DATA SCIENCE
MACHINE LEARNING, SPONSORED BY ORGANIZATION
4
Organization, through Customer Data Management, is one of the key sponsors of
industrialized data science within AFKL
Customer
Data
Management
Customer data
strategy
Customer
knowledge
PersonalizationCoordinates IT efforts
@DataXDay
DATA SCIENCE
STARTING POINT FOR DATA SCIENCE PROJECT IS A POC LOGIC
DWH
Historical
Data
Business
Intelligence
LOCAL
Data
Sample
Proof of
Concept
5
@DataXDay
DATA SCIENCE
WHAT IS AN « INDUSTRIALIZED » ENGINE?
Jupyter notebook, R Executable package
On my own
Integrated within AFKL IT
live ecosystem
Manual launch or crontab
Automated calibration and
prediction
I guess my code is flawless Unit tested
Theoretical performance
Live feedback on
performance
6
@DataXDay
LOCAL
Data
Sample
Proof of
Concept
LIVE
Data feed
DATA SCIENCE
FROM LOCAL STUDIES… TO A ROBUST LIVE DATA PRODUCT
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA API
7
@DataXDay
DATA SCIENTISTS X DATA ENGINEERS
Fellowship
@DataXDay
DATA SCIENTISTS X DATA ENGINEERS
IT TAKES TWO TO BRING DATA PRODUCTS LIVE (AT LEAST)
9
PoC
Start of
industrialization
Help!
How to ingest and
expose data?
Live
Product
V1
Translates
business ideas into
data science
Stats,
ML, AI
Data Scientist
Dev,
Big data,
project
architecture
Data Engineer
@DataXDay
DATA SCIENTISTS X DATA ENGINEERS
KEEP THE FRONTIER LOOSE
10
Data scientist and data engineer
are roles, not persons
Awareness of data scientist role on
live environments is key
@DataXDay
LIVE
Data feed
DATA SCIENTISTS X DATA ENGINEERS
A LIVE ECOSYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA API
11
@DataXDay
PACKAGING DATA SCIENCE
Spark and PEX
@DataXDay
PACKAGING DATA SCIENCE
WHAT DO YOU EXPECT?
13
Features
engineering
Algorithm « Model »
Model Training data
Trained
model
Trained
model
Prediction
data
Predictions
Setup
Train
Predict
We are expecting two main functionalities, training and predicting
@DataXDay
PACKAGING DATA SCIENCE
STANDARDIZATION WITH THE PIPELINE PATTERN
14
LogisticRegressionModel
.transform(dataset)
LogisticRegression
.fit(dataset)
Model training
Dataset
Dataset
+
Predictions
SQLTransformer VectorAssembler
Feature Engineering
Pipeline Model
@DataXDay
PACKAGING DATA SCIENCE
PEX, JUST LIKE UBERJAR
15
PEX
Project
package
External
packages
Company
packages
Company
packages
Company
packages
Company
packages
External
packages
External
packages
External
packages
@DataXDay
LIVE
Data feed
PACKAGING DATA SCIENCE
A LIVE ECOSYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA
16
API
@DataXDay
LIVE
Data feed
PACKAGING DATA SCIENCE
A LIVE ECOSYSTEM… BUT TRAINING DATA AND LIVE DATA ARE DIFFERENT
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA
17
API
@DataXDay
FROM DWH TO DATALAKE
A detour
@DataXDay
FROM DWH TO DATALAKE
TRAINING DATA MUST BE THE SAME AS PRODUCTION
• Data warehouse has a full historical data
• Production platform processes just what is
needed from raw data for live apps
• Data processing on both side are not
identical
• Production platform has to create a full
historical data
19
@DataXDay
LIVE
Data feed
FROM DWH TO DATALAKE
FROM A HISTORICAL/LIVE SYSTEM
DWH
Historical
Data
Business
Intelligente
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA API
20
@DataXDay
LIVE
FROM DWH TO DATALAKE
TO A FULL LIVE SYSTEM
EXPLORATION
Historical
Data
Proof of
Concept
Predictions
DATA
21
Data feed Historical
Data
API
MODELS
Repository
@DataXDay
CONTINUOUS IMPROVEMENT
Growing up
22
@DataXDay
CONTINUOUS IMPROVEMENT
FROM BUD TO FLOWER
• Ease to deploy new model
• Ease to extract new feature
• Ease to access new data
• Stay innovative
• Time To Market
23
@DataXDay
CONTINUOUS IMPROVEMENT
CRAFTSMANSHIP FROM DATA SCIENTIST SIDE
24
@DataXDay
Goal
Make sure each code modification is
not breaking anything
What to do ?
Regularly fetch sources, build project
and run tests
Needs
Tools to automate all tedious
and repetitive tasks
Because we are lazy
CONTINUOUS IMPROVEMENT
CONTINUOUS INTEGRATION
25
@DataXDay
CDCIDevelopment
CONTINUOUS IMPROVEMENT
DATA SCIENTIST - SOFTWARE FACTORY
26
Exploration
Build PEX Expose PEX for
other IT teams
@DataXDay
CONTINUOUS IMPROVEMENT
TRACK MODEL VERSIONING
• Calibration meta data
• Dataset used
• Timestamp + Code version
• Keep track between models and
predictions
• Model used
• Unique ID of prediction
• Input dataset
27
@DataXDay
LIVE
CONTINUOUS IMPROVEMENT
FEEDBACK LOOP
EXPLORATION
Historical
Data
Proof of
Concept
MODELS
Repository
Predictions
DATA
28
Data feed Historical
Data
API
feedback
Metrics
@DataXDay
NEXT STEP
Improve and share best practices
@DataXDay
NEXT STEP
TOO MANY JOURNEYS
• How to maintain the momentum, after few
teams started the adventure ?
• Every teams experienced a different
journey
• But every teams find different paths
30
DataXDay - A data scientist journey to industrialization of machine learning

More Related Content

PDF
II-SDV 2014 Patent Intelligence with Bibliographic, Legal Status and Patent R...
PDF
Eligotech presents @ Data Donderdag on 24 April 2014
PDF
Data Skipping Technology
PPTX
Flink Meetup Septmeber 2017 2018
PDF
Cwin16 tls-datalab for scientists
PPTX
State of enterprise data science
PDF
Big data Europe: concept, platform and pilots
PPTX
Cloudera Customer Success Story
II-SDV 2014 Patent Intelligence with Bibliographic, Legal Status and Patent R...
Eligotech presents @ Data Donderdag on 24 April 2014
Data Skipping Technology
Flink Meetup Septmeber 2017 2018
Cwin16 tls-datalab for scientists
State of enterprise data science
Big data Europe: concept, platform and pilots
Cloudera Customer Success Story

What's hot (20)

PDF
PDF
PPTX
WF ED 540, Class Meeting 7, 8 October 2015
PPTX
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
PDF
ProteomeXchange update
PDF
Collected List of Business Intelligence Software
PPTX
Rule-based dispatching of events to a serverless services armada
PPT
2013 DataCite Summer Meeting - Figshare (Mark Hahnel - Figshare)
PPTX
Big data and hadoop lightining talk
PDF
Webinar: SpagoBI & Big Data, a smart approach to turn data into knowledge
PPTX
Cloud computing major project
PPTX
Big Data Analytics @ Munich Re - VIII. International Istanbul Insurance Confe...
PDF
ICIC 2017: New product presentation minesoft
PDF
How to Create the Google for Earth Data (XLDB 2015, Stanford)
PPTX
Airline traffic management analysis
PDF
OVH Analytics Data Compute and Apache Spark as a Service
PDF
Metadata catalogues survey results, EOSCpilot H2020 EU project
DOC
VINEET_ANAND_CV_HADOOP_VA_V3
PDF
SpagoBI and Big Data: next Open Source Information Management suite, OW2con'1...
 
PPTX
Real Time Reporting Platform
WF ED 540, Class Meeting 7, 8 October 2015
DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...
ProteomeXchange update
Collected List of Business Intelligence Software
Rule-based dispatching of events to a serverless services armada
2013 DataCite Summer Meeting - Figshare (Mark Hahnel - Figshare)
Big data and hadoop lightining talk
Webinar: SpagoBI & Big Data, a smart approach to turn data into knowledge
Cloud computing major project
Big Data Analytics @ Munich Re - VIII. International Istanbul Insurance Confe...
ICIC 2017: New product presentation minesoft
How to Create the Google for Earth Data (XLDB 2015, Stanford)
Airline traffic management analysis
OVH Analytics Data Compute and Apache Spark as a Service
Metadata catalogues survey results, EOSCpilot H2020 EU project
VINEET_ANAND_CV_HADOOP_VA_V3
SpagoBI and Big Data: next Open Source Information Management suite, OW2con'1...
 
Real Time Reporting Platform
Ad

Similar to DataXDay - A data scientist journey to industrialization of machine learning (20)

PDF
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
PDF
Data science a practitioner's perspective
PDF
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
PDF
Building successful data science teams
PPTX
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
PDF
00-01 DSnDA.pdf
PDF
Data Science and Culture
PDF
How to become a data scientist
PDF
Thinkful DC - Intro to Data Science
PDF
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
PDF
Data Science in 2016: Moving Up
PDF
Data-X-Sparse-v2
PPTX
Data Science presentation for explanation of numpy and pandas
PDF
Data-X-v3.1
PDF
A Beginner’s Guide to An Incredible Technology Data Science.pdf
PDF
a-beginner-guide-to-an-incredible-technology-data-science.pdf
PDF
Blended learning and flipped classrooms for data science at Dallas Startup Week
PPTX
Behind the scenes of data science
PDF
Sql saturday el salvador 2016 - Me, A Data Scientist?
PDF
Data Science for Business Managers by TektosData
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
Data science a practitioner's perspective
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Building successful data science teams
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
00-01 DSnDA.pdf
Data Science and Culture
How to become a data scientist
Thinkful DC - Intro to Data Science
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving Up
Data-X-Sparse-v2
Data Science presentation for explanation of numpy and pandas
Data-X-v3.1
A Beginner’s Guide to An Incredible Technology Data Science.pdf
a-beginner-guide-to-an-incredible-technology-data-science.pdf
Blended learning and flipped classrooms for data science at Dallas Startup Week
Behind the scenes of data science
Sql saturday el salvador 2016 - Me, A Data Scientist?
Data Science for Business Managers by TektosData
Ad

More from DataXDay Conference by Xebia (6)

PDF
DataXDay - Exploring graphs: looking for communities & leaders
PDF
DataXDay - The wonders of deep learning: how to leverage it for natural langu...
PDF
DataXDay - Real-Time Access log analysis
PDF
DataXDay - Tensors in the sky with CloudML
PDF
DataXDay - Building a Real Time Analytics API at Scale
PDF
DataXDay - Machine learning models at scale with Amazon SageMaker
DataXDay - Exploring graphs: looking for communities & leaders
DataXDay - The wonders of deep learning: how to leverage it for natural langu...
DataXDay - Real-Time Access log analysis
DataXDay - Tensors in the sky with CloudML
DataXDay - Building a Real Time Analytics API at Scale
DataXDay - Machine learning models at scale with Amazon SageMaker

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
A comparative analysis of optical character recognition models for extracting...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
MIND Revenue Release Quarter 2 2025 Press Release
Programs and apps: productivity, graphics, security and other tools
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
A comparative analysis of optical character recognition models for extracting...

DataXDay - A data scientist journey to industrialization of machine learning

  • 1. A DATA SCIENTIST JOURNEY TO INDUSTRIALIZATION OF MACHINE LEARNING MODELS DataXDay 2018 17th May 2018
  • 2. @DataXDay DATA SCIENCE FOUNDATIONS FOR DATA SCIENCE AT AIR FRANCE 3 Adoption of Operations Research for crew scheduling Extension to other business domains: Revenue Management, Cargo, Ground services, … Adoption of Hadoop Focus on Machine Learning Ops Research is now 120 engineers in Paris and Amsterdam Adoption of data science within AFKL IT was favored by existing Operations Research practice
  • 3. @DataXDay DATA SCIENCE MACHINE LEARNING, SPONSORED BY ORGANIZATION 4 Organization, through Customer Data Management, is one of the key sponsors of industrialized data science within AFKL Customer Data Management Customer data strategy Customer knowledge PersonalizationCoordinates IT efforts
  • 4. @DataXDay DATA SCIENCE STARTING POINT FOR DATA SCIENCE PROJECT IS A POC LOGIC DWH Historical Data Business Intelligence LOCAL Data Sample Proof of Concept 5
  • 5. @DataXDay DATA SCIENCE WHAT IS AN « INDUSTRIALIZED » ENGINE? Jupyter notebook, R Executable package On my own Integrated within AFKL IT live ecosystem Manual launch or crontab Automated calibration and prediction I guess my code is flawless Unit tested Theoretical performance Live feedback on performance 6
  • 6. @DataXDay LOCAL Data Sample Proof of Concept LIVE Data feed DATA SCIENCE FROM LOCAL STUDIES… TO A ROBUST LIVE DATA PRODUCT DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 7
  • 7. @DataXDay DATA SCIENTISTS X DATA ENGINEERS Fellowship
  • 8. @DataXDay DATA SCIENTISTS X DATA ENGINEERS IT TAKES TWO TO BRING DATA PRODUCTS LIVE (AT LEAST) 9 PoC Start of industrialization Help! How to ingest and expose data? Live Product V1 Translates business ideas into data science Stats, ML, AI Data Scientist Dev, Big data, project architecture Data Engineer
  • 9. @DataXDay DATA SCIENTISTS X DATA ENGINEERS KEEP THE FRONTIER LOOSE 10 Data scientist and data engineer are roles, not persons Awareness of data scientist role on live environments is key
  • 10. @DataXDay LIVE Data feed DATA SCIENTISTS X DATA ENGINEERS A LIVE ECOSYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 11
  • 12. @DataXDay PACKAGING DATA SCIENCE WHAT DO YOU EXPECT? 13 Features engineering Algorithm « Model » Model Training data Trained model Trained model Prediction data Predictions Setup Train Predict We are expecting two main functionalities, training and predicting
  • 13. @DataXDay PACKAGING DATA SCIENCE STANDARDIZATION WITH THE PIPELINE PATTERN 14 LogisticRegressionModel .transform(dataset) LogisticRegression .fit(dataset) Model training Dataset Dataset + Predictions SQLTransformer VectorAssembler Feature Engineering Pipeline Model
  • 14. @DataXDay PACKAGING DATA SCIENCE PEX, JUST LIKE UBERJAR 15 PEX Project package External packages Company packages Company packages Company packages Company packages External packages External packages External packages
  • 15. @DataXDay LIVE Data feed PACKAGING DATA SCIENCE A LIVE ECOSYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 16 API
  • 16. @DataXDay LIVE Data feed PACKAGING DATA SCIENCE A LIVE ECOSYSTEM… BUT TRAINING DATA AND LIVE DATA ARE DIFFERENT DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 17 API
  • 17. @DataXDay FROM DWH TO DATALAKE A detour
  • 18. @DataXDay FROM DWH TO DATALAKE TRAINING DATA MUST BE THE SAME AS PRODUCTION • Data warehouse has a full historical data • Production platform processes just what is needed from raw data for live apps • Data processing on both side are not identical • Production platform has to create a full historical data 19
  • 19. @DataXDay LIVE Data feed FROM DWH TO DATALAKE FROM A HISTORICAL/LIVE SYSTEM DWH Historical Data Business Intelligente EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA API 20
  • 20. @DataXDay LIVE FROM DWH TO DATALAKE TO A FULL LIVE SYSTEM EXPLORATION Historical Data Proof of Concept Predictions DATA 21 Data feed Historical Data API MODELS Repository
  • 22. @DataXDay CONTINUOUS IMPROVEMENT FROM BUD TO FLOWER • Ease to deploy new model • Ease to extract new feature • Ease to access new data • Stay innovative • Time To Market 23
  • 24. @DataXDay Goal Make sure each code modification is not breaking anything What to do ? Regularly fetch sources, build project and run tests Needs Tools to automate all tedious and repetitive tasks Because we are lazy CONTINUOUS IMPROVEMENT CONTINUOUS INTEGRATION 25
  • 25. @DataXDay CDCIDevelopment CONTINUOUS IMPROVEMENT DATA SCIENTIST - SOFTWARE FACTORY 26 Exploration Build PEX Expose PEX for other IT teams
  • 26. @DataXDay CONTINUOUS IMPROVEMENT TRACK MODEL VERSIONING • Calibration meta data • Dataset used • Timestamp + Code version • Keep track between models and predictions • Model used • Unique ID of prediction • Input dataset 27
  • 27. @DataXDay LIVE CONTINUOUS IMPROVEMENT FEEDBACK LOOP EXPLORATION Historical Data Proof of Concept MODELS Repository Predictions DATA 28 Data feed Historical Data API feedback Metrics
  • 28. @DataXDay NEXT STEP Improve and share best practices
  • 29. @DataXDay NEXT STEP TOO MANY JOURNEYS • How to maintain the momentum, after few teams started the adventure ? • Every teams experienced a different journey • But every teams find different paths 30