SlideShare a Scribd company logo
Deeplearning in production
the Data Engineer part
1
2
whoami
Scauglog
Data Engineer, Xebia
init project
3
Team Astro
Product Owner Scrum Master Data Scientists, Data Engineers,
Machine Learning Engineers
4
Choose your model
5
And the winner is
XGBoost
6
7
And the winner is
8
What is Deep Learning?
9
What is a Deep Learning model?
10
▼ Distributed Prediction
▼ Can create complexe network
▼ Documentation
▼ Community
Choose your Framework
11
And the winner is
12
Wait
13
And the winner is
14
How to Deep Learn
15
Train Workflow
Input Data train model
16
Predict Workflow
Input Data predict Output Data
model
17
18
Preprocessing
▼ Scaling (normalisation, min max, …)
▼ Replace null
▼ Lagging
19
Train Workflow
Input Data train
model
preprocess
scaler
20
Predict Workflow
Input Data predict Output Data
model
preprocess
scaler
21
Predict at scale
22
Scaling Prediction: naïve approach
def predict(input: RDD[PreprocessRow], modelPath: String): RDD[SinglePredictionRow]
= {
input.map { row =>
// load model
val hdfs = FileSystem.get(new Configuration())
val source = new Path(modelPath)
val model = ModelSerializer.restoreComputationGraph(hdfs.open(source), true)
// make prediction
val prediction = model.output(row.features)(0).getColumn(0).toFloatVector
// return prediction
SinglePredictionRow.fromPreprocessRow(row, prediction(0))
}
}
23
Scaling Prediction: faster
def predict(input: RDD[PreprocessRow], modelPath: String): RDD[SinglePredictionRow]
= {
// load model
lazy val hdfs = FileSystem.get(new Configuration())
lazy val source = new Path(modelPath)
lazy val model = ModelSerializer.restoreComputationGraph(hdfs.open(source), true)
input.map { row =>
// make prediction
val prediction = model.output(row.features)(0).getColumn(0).toFloatVector
// return prediction
SinglePredictionRow.fromPreprocessRow(row, prediction(0))
}
}
24
Scaling Prediction: fastest
def predict(input: RDD[PreprocessRow], modelPath: String):
RDD[SinglePredictionRow] = {
// load model
lazy val hdfs = FileSystem.get(new Configuration())
lazy val source = new Path(modelPath)
lazy val model = ModelSerializer.restoreComputationGraph(hdfs.open(source),
true)
input.mapPartitions { partition =>
val partitionSeq = partition.toSeq
if (partitionSeq.isEmpty) {
val emptySeq: Seq[SinglePredictionRow] = Seq()
emptySeq.toIterator
} else {
val features = partitionSeq.map(_.features).reduce( (x, y) => Nd4j.concat(0,
x, y))
val predictions = model.output(features)(0).getColumn(0).toFloatVector
partitionSeq.zip(predictions).map { case (row, prediction) =>
SinglePredictionRow.fromPreprocessRow(row, prediction)
}.toIterator
}
}
}
25
OOM
▼ ND4J Array are C++ offheap object
▼ Cache your dataframe or look stage details to estimate memory usage
▼ Set spark.yarn.executor.memoryOverhead
▼ Use ND4J workspace to properly manage memory deallocation
▼ Repartition your dataframe before prediction to ensure equals partition
▼ Set spark.sql.shuffle.partitions
26
OOM
def predict(input: RDD[PreprocessRow], modelPath: String, numFeatures: Int, timeSteps: Int): RDD[SinglePredictionRow] =
{
// load model ...
lazy val basicConfig: WorkspaceConfiguration = WorkspaceConfiguration.builder().initialSize(0)
.policyLearning(LearningPolicy.NONE).policyAllocation(AllocationPolicy.STRICT).build()
lazy val workspace = Nd4j.getWorkspaceManager.getAndActivateWorkspace(basicConfig, "myWorkspace")
input.mapPartitions { partition =>
val partitionSeq = partition.toSeq
if (partitionSeq.isEmpty) {
val emptySeq: Seq[SinglePredictionRow] = Seq()
emptySeq.toIterator
} else {
workspace.notifyScopeEntered()
val features = Nd4j.create(partitionSeq.flatMap(_.features).toArray, Array(partitionSeq.size, numFeatures,
timeSteps))
val predictions = model.output(false, workspace, features)(0).toFloatVector
workspace.notifyScopeLeft()
partitionSeq.zip(predictions).map { case (row, prediction) =>
SinglePredictionRow.fromPreprocessRow(row, prediction)
}.toIterator
}
}
} 27
Compile
▼ Maven
▼ -Djavacpp.platform=linux-x86_64
▼ Exclude
▽ deeplearning4j-datasets
▽ deeplearning4j-datavec-iterators
▽ deeplearning4j-ui-components
28
Training at scale
29
Retrain again and again and again...
▼ Model performance decline over time
▼ Hyperparameter tuning
▼ Deep Learning model rarely comes alone (clustering)
30
Predict Workflow
Input Data predict Output Data
model
preprocess
scaler
Preprocessed
Data
31
Train Workflow
train
model
Preprocessed
Data
32
Training at scale: AWS EC2
33
Training at scale: AWS EC2
▼ Create VPC
▼ Create Subnet associated to VPC
▼ Create an IGW associated to VPC
▼ Create a route table associated to IGW
▼ Create a Security Group associated to VPC
▽ Authorize ssh only for my IP
▼ Create a key pair
▼ Create EC2 server with EBS volume
34
Training at scale: AWS EC2
▼ Add ssh keys of team members
▼ Install cuda, cudnn, nccl and configure them
▼ Deploy train jar to EC2 instance
▼ Deploy train pipeline to EC2 instance
▼ Deploy preprocessed data to EC2 instance
▼ Deploy auto shutdown script
35
Training at scale: AWS EC2
▼ Ansible
▼ Transfert preprocess data to S3
▼ Store model in S3
▼ Check CPU vs GPU training time
▼ Keep track of training config and performance
▼ Share knowledge with Data Scientist
▼ Put your data in ESB volume if they fit
36
Training with DL4J: Lessons learned
▼ Beware of tensor order
▼ Prefetch data in memory (InMemoryDatasetIterator)
▼ Add listener to monitor your training compute
performance
▼ Use the UI
37
Keras to DL4J
def execute(config: Config): Unit = {
val kerasModel = KerasModelImport.importKerasModelAndWeights(
config.kerasModelPath, false)
ModelSaver.writeModel(kerasModel, config.outputModelPath)
}
▼ Data Scientist love Keras
▼ Keras is easier to import on notebook
▼ Training on Keras is faster
▼ Keras is compliant with cloud training
(Sagemaker, CloudML)
38
Workflow Train
train
model.h5
Preprocessed
Data
39
Keras to DL4J model.zip
Monitoring
40
Monitoring: mlflow
41
Monitoring: mlflow
42
▼ Ensure your training machine can reach mlflow server
▼ Keep track of your experiment
▽ Training parameter
▽ Performance
▼ Compare results
▼ (model repository, standardize model packaging, easy deployment)
Monitoring: Zeppelin
43
Monitoring: Zeppelin
44
▼ Already in HDP
▼ Authentication
▼ Scheduling
▼ Report View
▼ Auto shutdown
▼ Can mix sources (Scala, JDBC, C*, …)
▼ API to automate deployment
45
Thank you for your attention
Any questions ?
▼ Data Science to Production: https://youtu.be/Gr2SS0xv0xE
▼ DL4J: https://youtu.be/QfnCcPcZogI
▼ Zeppelin: https://youtu.be/w78gZW6BQJI
▼ cloudML: https://youtu.be/oDpBRdjwNik
▼
46

More Related Content

PPTX
Database Agnostic Workload Management (CIDR 2019)
PDF
Chapter 1 Basic Concepts
PDF
PDF
High performance GPU computing with Ruby
PDF
Look Mommy, No GC! (TechDays NL 2017)
PDF
Java & OOP Core Concept
PPTX
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
PDF
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...
Database Agnostic Workload Management (CIDR 2019)
Chapter 1 Basic Concepts
High performance GPU computing with Ruby
Look Mommy, No GC! (TechDays NL 2017)
Java & OOP Core Concept
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Iterative Spark Developmen...
Engineering Fast Indexes for Big-Data Applications: Spark Summit East talk by...

What's hot (13)

PPTX
Scala meetup - Intro to spark
PDF
Spark - Alexis Seigneurin (English)
PPT
Maps&hash tables
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
PPT
Heaps & Adaptable priority Queues
ODP
IIUG 2016 Gathering Informix data into R
PDF
NUS-ISS Learning Day 2019-Pandas in the cloud
PPT
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
PDF
Introduce spark (by 조창원)
PDF
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
PDF
Declarative Infrastructure Tools
PDF
Python at Ordnance Survey
PDF
Influxdb and time series data
Scala meetup - Intro to spark
Spark - Alexis Seigneurin (English)
Maps&hash tables
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Heaps & Adaptable priority Queues
IIUG 2016 Gathering Informix data into R
NUS-ISS Learning Day 2019-Pandas in the cloud
Algorithm analysis basics - Seven Functions/Big-Oh/Omega/Theta
Introduce spark (by 조창원)
NUS-ISS Learning Day 2019-Deploying AI apps using tensor flow lite in mobile ...
Declarative Infrastructure Tools
Python at Ordnance Survey
Influxdb and time series data
Ad

Similar to Deeplearning in production (20)

PDF
Spark Structured APIs
PDF
PySpark in practice slides
PDF
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
PDF
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
ODP
Slickdemo
PDF
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
PPTX
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
PPTX
Azure machine learning service
PPTX
Smart Data Conference: DL4J and DataVec
PDF
Viktor Tsykunov: Azure Machine Learning Service
PPTX
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
PPTX
Deep Learning with Apache Spark: an Introduction
PDF
Spark & Cassandra - DevFest Córdoba
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
PDF
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
PDF
Improving Apache Spark Downscaling
PDF
No more struggles with Apache Spark workloads in production
Spark Structured APIs
PySpark in practice slides
SE2016 BigData Vitalii Bondarenko "HD insight spark. Advanced in-memory Big D...
Vitalii Bondarenko HDinsight: spark. advanced in memory big-data analytics wi...
Slickdemo
Deep Learning for Computer Vision: Software Frameworks (UPC 2016)
Apache Spark in Depth: Core Concepts, Architecture & Internals
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Azure machine learning service
Smart Data Conference: DL4J and DataVec
Viktor Tsykunov: Azure Machine Learning Service
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Deep Learning with Apache Spark: an Introduction
Spark & Cassandra - DevFest Córdoba
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Improving Apache Spark Downscaling
No more struggles with Apache Spark workloads in production
Ad

More from Paris Data Engineers ! (11)

PDF
Spark tools by Jonathan Winandy
PDF
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
PDF
SCIO : Apache Beam API
PDF
Apache Beam de A à Z
PDF
REX : pourquoi et comment développer son propre scheduler
PDF
Utilisation de MLflow pour le cycle de vie des projet Machine learning
PDF
Introduction à Apache Pulsar
PDF
10 things i wish i'd known before using spark in production
PDF
Change Data Capture with Data Collector @OVH
PDF
Building highly reliable data pipeline @datadog par Quentin François
PDF
Scala pour le Data Engineering par Jonathan Winandy
Spark tools by Jonathan Winandy
Delta Lake OSS: Create reliable and performant Data Lake by Quentin Ambard
SCIO : Apache Beam API
Apache Beam de A à Z
REX : pourquoi et comment développer son propre scheduler
Utilisation de MLflow pour le cycle de vie des projet Machine learning
Introduction à Apache Pulsar
10 things i wish i'd known before using spark in production
Change Data Capture with Data Collector @OVH
Building highly reliable data pipeline @datadog par Quentin François
Scala pour le Data Engineering par Jonathan Winandy

Recently uploaded (20)

PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Electronic commerce courselecture one. Pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPT
Teaching material agriculture food technology
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PPTX
sap open course for s4hana steps from ECC to s4
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Network Security Unit 5.pdf for BCA BBA.
Electronic commerce courselecture one. Pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
NewMind AI Weekly Chronicles - August'25 Week I
The AUB Centre for AI in Media Proposal.docx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Teaching material agriculture food technology
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
sap open course for s4hana steps from ECC to s4

Deeplearning in production

  • 1. Deeplearning in production the Data Engineer part 1
  • 4. Team Astro Product Owner Scrum Master Data Scientists, Data Engineers, Machine Learning Engineers 4
  • 6. And the winner is XGBoost 6
  • 7. 7
  • 9. What is Deep Learning? 9
  • 10. What is a Deep Learning model? 10
  • 11. ▼ Distributed Prediction ▼ Can create complexe network ▼ Documentation ▼ Community Choose your Framework 11
  • 12. And the winner is 12
  • 14. And the winner is 14
  • 15. How to Deep Learn 15
  • 16. Train Workflow Input Data train model 16
  • 17. Predict Workflow Input Data predict Output Data model 17
  • 18. 18
  • 19. Preprocessing ▼ Scaling (normalisation, min max, …) ▼ Replace null ▼ Lagging 19
  • 20. Train Workflow Input Data train model preprocess scaler 20
  • 21. Predict Workflow Input Data predict Output Data model preprocess scaler 21
  • 23. Scaling Prediction: naïve approach def predict(input: RDD[PreprocessRow], modelPath: String): RDD[SinglePredictionRow] = { input.map { row => // load model val hdfs = FileSystem.get(new Configuration()) val source = new Path(modelPath) val model = ModelSerializer.restoreComputationGraph(hdfs.open(source), true) // make prediction val prediction = model.output(row.features)(0).getColumn(0).toFloatVector // return prediction SinglePredictionRow.fromPreprocessRow(row, prediction(0)) } } 23
  • 24. Scaling Prediction: faster def predict(input: RDD[PreprocessRow], modelPath: String): RDD[SinglePredictionRow] = { // load model lazy val hdfs = FileSystem.get(new Configuration()) lazy val source = new Path(modelPath) lazy val model = ModelSerializer.restoreComputationGraph(hdfs.open(source), true) input.map { row => // make prediction val prediction = model.output(row.features)(0).getColumn(0).toFloatVector // return prediction SinglePredictionRow.fromPreprocessRow(row, prediction(0)) } } 24
  • 25. Scaling Prediction: fastest def predict(input: RDD[PreprocessRow], modelPath: String): RDD[SinglePredictionRow] = { // load model lazy val hdfs = FileSystem.get(new Configuration()) lazy val source = new Path(modelPath) lazy val model = ModelSerializer.restoreComputationGraph(hdfs.open(source), true) input.mapPartitions { partition => val partitionSeq = partition.toSeq if (partitionSeq.isEmpty) { val emptySeq: Seq[SinglePredictionRow] = Seq() emptySeq.toIterator } else { val features = partitionSeq.map(_.features).reduce( (x, y) => Nd4j.concat(0, x, y)) val predictions = model.output(features)(0).getColumn(0).toFloatVector partitionSeq.zip(predictions).map { case (row, prediction) => SinglePredictionRow.fromPreprocessRow(row, prediction) }.toIterator } } } 25
  • 26. OOM ▼ ND4J Array are C++ offheap object ▼ Cache your dataframe or look stage details to estimate memory usage ▼ Set spark.yarn.executor.memoryOverhead ▼ Use ND4J workspace to properly manage memory deallocation ▼ Repartition your dataframe before prediction to ensure equals partition ▼ Set spark.sql.shuffle.partitions 26
  • 27. OOM def predict(input: RDD[PreprocessRow], modelPath: String, numFeatures: Int, timeSteps: Int): RDD[SinglePredictionRow] = { // load model ... lazy val basicConfig: WorkspaceConfiguration = WorkspaceConfiguration.builder().initialSize(0) .policyLearning(LearningPolicy.NONE).policyAllocation(AllocationPolicy.STRICT).build() lazy val workspace = Nd4j.getWorkspaceManager.getAndActivateWorkspace(basicConfig, "myWorkspace") input.mapPartitions { partition => val partitionSeq = partition.toSeq if (partitionSeq.isEmpty) { val emptySeq: Seq[SinglePredictionRow] = Seq() emptySeq.toIterator } else { workspace.notifyScopeEntered() val features = Nd4j.create(partitionSeq.flatMap(_.features).toArray, Array(partitionSeq.size, numFeatures, timeSteps)) val predictions = model.output(false, workspace, features)(0).toFloatVector workspace.notifyScopeLeft() partitionSeq.zip(predictions).map { case (row, prediction) => SinglePredictionRow.fromPreprocessRow(row, prediction) }.toIterator } } } 27
  • 28. Compile ▼ Maven ▼ -Djavacpp.platform=linux-x86_64 ▼ Exclude ▽ deeplearning4j-datasets ▽ deeplearning4j-datavec-iterators ▽ deeplearning4j-ui-components 28
  • 30. Retrain again and again and again... ▼ Model performance decline over time ▼ Hyperparameter tuning ▼ Deep Learning model rarely comes alone (clustering) 30
  • 31. Predict Workflow Input Data predict Output Data model preprocess scaler Preprocessed Data 31
  • 33. Training at scale: AWS EC2 33
  • 34. Training at scale: AWS EC2 ▼ Create VPC ▼ Create Subnet associated to VPC ▼ Create an IGW associated to VPC ▼ Create a route table associated to IGW ▼ Create a Security Group associated to VPC ▽ Authorize ssh only for my IP ▼ Create a key pair ▼ Create EC2 server with EBS volume 34
  • 35. Training at scale: AWS EC2 ▼ Add ssh keys of team members ▼ Install cuda, cudnn, nccl and configure them ▼ Deploy train jar to EC2 instance ▼ Deploy train pipeline to EC2 instance ▼ Deploy preprocessed data to EC2 instance ▼ Deploy auto shutdown script 35
  • 36. Training at scale: AWS EC2 ▼ Ansible ▼ Transfert preprocess data to S3 ▼ Store model in S3 ▼ Check CPU vs GPU training time ▼ Keep track of training config and performance ▼ Share knowledge with Data Scientist ▼ Put your data in ESB volume if they fit 36
  • 37. Training with DL4J: Lessons learned ▼ Beware of tensor order ▼ Prefetch data in memory (InMemoryDatasetIterator) ▼ Add listener to monitor your training compute performance ▼ Use the UI 37
  • 38. Keras to DL4J def execute(config: Config): Unit = { val kerasModel = KerasModelImport.importKerasModelAndWeights( config.kerasModelPath, false) ModelSaver.writeModel(kerasModel, config.outputModelPath) } ▼ Data Scientist love Keras ▼ Keras is easier to import on notebook ▼ Training on Keras is faster ▼ Keras is compliant with cloud training (Sagemaker, CloudML) 38
  • 42. Monitoring: mlflow 42 ▼ Ensure your training machine can reach mlflow server ▼ Keep track of your experiment ▽ Training parameter ▽ Performance ▼ Compare results ▼ (model repository, standardize model packaging, easy deployment)
  • 44. Monitoring: Zeppelin 44 ▼ Already in HDP ▼ Authentication ▼ Scheduling ▼ Report View ▼ Auto shutdown ▼ Can mix sources (Scala, JDBC, C*, …) ▼ API to automate deployment
  • 45. 45 Thank you for your attention Any questions ?
  • 46. ▼ Data Science to Production: https://youtu.be/Gr2SS0xv0xE ▼ DL4J: https://youtu.be/QfnCcPcZogI ▼ Zeppelin: https://youtu.be/w78gZW6BQJI ▼ cloudML: https://youtu.be/oDpBRdjwNik ▼ 46