SlideShare a Scribd company logo
Docker Datascience Pipeline
Running datascience models on Docker at ING DataWorks Summit Berlin 18 -04-2018
• Big Data Engineer at ING
• DB2, Oracle, AIX, Linux, Hadoop, Hive, Sqoop, Ansible
• Let it all work on the Exploration Environment
• @chiefware
Lets introduce myself
2
Asking yourself how to get to production
Depending on different tools
Why I think you are here
3
You are running Datascience models
Interested in Docker
Asking yourself how to get to production
What is Docker
4
Docker is a tool designed to make it easier to create, deploy, and
run applications by using containers. Containers allow a
developer to package up an application with all of the parts it
needs, such as libraries and other dependencies, and ship it all
out as one package. By doing so, thanks to the container, the
developer can rest assured that the application will run on any
other Linux machine regardless of any customized settings that
machine might have that could differ from the machine used for
writing and testing the code
Agile way of working in Squads,Chapters and Tribes
5
big data exploration
6
Gitlab runner
Citrix Server
Dev node
Hadoop
Artifactory
Gitlab
Automation Server
Access to cluster rdp, browser, putty
Node for datascientists with tools and xrdp
Datanode cluster
Datanode cluster
Pip repo
Datasciences projects
Details about processServer
Pipeline
7
Prototyping
Development
Test
Acceptance Production
waittime
waittime
waittime
failure
failure
failure
Docker on big data exploration
8
Gitlab runner
Citrix Server
Dev node
Hadoop
Openshift
POD/JOB
Artifactory
Gitlab
Automation Server
Access to cluster rdp, browser, putty
Node for datascientists with tools and xrdp
Datanode cluster
Orchestration Containers
docker containers
Docker base Images
Docker files and Datasciences projects
Netezza
Hive
Rapid
experimentation
Live model
Deploymen
t
Code
versioning
Build Docker
image(s) through
Gitlab CI/Jenkins
Put serialized
model in shared
directory
Save Docker
image(s) to
Docker
Registry
Move serialized
model to HDFS
Data
versioning
On each
commit
Train and
validate
model
Setup
environment
Docker
Registry
Tivoli
scheduler
triggers
Monithor
HDFS
Add link to file
to Monithor
Add model
metrics to
Monithor
Git repository
with Hive
queries
Feature
engineering
Data
exploration
Cookiecutter
Workspace
management
Data service
Retraining
Revalidatio
n
Monitoring
Fallback
DS code
library
Structured
way of
keeping
track of dev
models
Continuous
Integration
(testing)
Early CD
facilitation
Construct
pipeline
(Pachyderm or
APApeline)
Create virtual
environment
per project
Checkpointin
g
Development Continuous Deployement Live
Dockerfile
10
Gitlab runner
11
To register run:
gitlab-runner register
gitlab-runner list
Gitlab runner
12
.gitlab-ci.yml
Spark
13
• Only submit and forget works in Docker
• spark-submit deploy-mode cluster master yarn
• kinit your keytab file for Kerberos
• pip in virtual env
Openshift
14
PODS and JOBS
Demo Time
15
considerations
16
Jenkins instead of gitlab-runner
Add gpu nodes to openshift
How to use scheduler tool
How to handle Kerberos files
17
Questions?

More Related Content

PDF
Present and future of unified, portable, and efficient data processing with A...
PDF
Realizing the promise of portable data processing with Apache Beam
PPTX
Presto query optimizer: pursuit of performance
PPTX
Why Kubernetes as a container orchestrator is a right choice for running spar...
PPTX
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
PPTX
SDLC with Apache NiFi
PDF
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
PDF
Audi‘s Hadoop Journey into the Hybrid Cloud
Present and future of unified, portable, and efficient data processing with A...
Realizing the promise of portable data processing with Apache Beam
Presto query optimizer: pursuit of performance
Why Kubernetes as a container orchestrator is a right choice for running spar...
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
SDLC with Apache NiFi
Real-Time Machine Learning with Redis, Apache Spark, Tensor Flow, and more wi...
Audi‘s Hadoop Journey into the Hybrid Cloud

What's hot (20)

PPTX
Enabling Modern Application Architecture using Data.gov open government data
PDF
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
PPTX
Apache Spark Crash Course
PDF
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
PPTX
Building a modern end-to-end open source Big Data reference application
PDF
Continus sql with sql stream builder
PPTX
Optimizing your SparkML pipelines using the latest features in Spark 2.3
PPTX
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
PPTX
What’s new in Apache Spark 2.3
PPTX
Saving the elephant—now, not later
PDF
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
PPTX
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
PPTX
Apache deep learning 101
PPTX
Zero ETL analytics with LLAP in Azure HDInsight
PPTX
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
PPTX
Built-In Security for the Cloud
PPTX
Bootstrapping state in Apache Flink
PPTX
PPTX
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Enabling Modern Application Architecture using Data.gov open government data
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Apache Spark Crash Course
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Building a modern end-to-end open source Big Data reference application
Continus sql with sql stream builder
Optimizing your SparkML pipelines using the latest features in Spark 2.3
Lessons Learned Migrating from IBM BigInsights to Hortonworks Data Platform
What’s new in Apache Spark 2.3
Saving the elephant—now, not later
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Apache deep learning 101
Zero ETL analytics with LLAP in Azure HDInsight
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Built-In Security for the Cloud
Bootstrapping state in Apache Flink
Microsoft Data Platform Airlift 2017 Rui Quintino Machine Learning with SQL S...
Ad

Similar to Docker data science pipeline (20)

PPTX
Docker datascience pipeline
PPTX
Docker with devops program
PPTX
Docker with devops program
PDF
Containers - Portable, repeatable user-oriented application delivery. Build, ...
PPTX
Webinar Docker Tri Series
PPTX
Oscon 2017: Build your own container-based system with the Moby project
PDF
IAU workshop 2018 day one
PPTX
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
PDF
Docker for dev
PPTX
Azure ai on premises with docker
PDF
Docker Application to Scientific Computing
PDF
Docker Introduction
PPTX
Docker Bday #5, SF Edition: Introduction to Docker
PPTX
Chugging Our Own "Craft Brew” – HPE’s Journey Towards Containers-as-a-Service...
PPTX
The world of Docker and Kubernetes
 
PDF
What is Docker & Why is it Getting Popular?
PDF
Docker Birthday #5 Meetup Cluj - Presentation
PDF
Faster and Easier Software Development using Docker Platform
PPTX
Docker & aPaaS: Enterprise Innovation and Trends for 2015
PDF
Docker Containers Deep Dive
Docker datascience pipeline
Docker with devops program
Docker with devops program
Containers - Portable, repeatable user-oriented application delivery. Build, ...
Webinar Docker Tri Series
Oscon 2017: Build your own container-based system with the Moby project
IAU workshop 2018 day one
Write Once and REALLY Run Anywhere | OpenStack Summit HK 2013
Docker for dev
Azure ai on premises with docker
Docker Application to Scientific Computing
Docker Introduction
Docker Bday #5, SF Edition: Introduction to Docker
Chugging Our Own "Craft Brew” – HPE’s Journey Towards Containers-as-a-Service...
The world of Docker and Kubernetes
 
What is Docker & Why is it Getting Popular?
Docker Birthday #5 Meetup Cluj - Presentation
Faster and Easier Software Development using Docker Platform
Docker & aPaaS: Enterprise Innovation and Trends for 2015
Docker Containers Deep Dive
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Encapsulation theory and applications.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPT
Teaching material agriculture food technology
PPTX
Cloud computing and distributed systems.
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Encapsulation theory and applications.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf
Spectral efficient network and resource selection model in 5G networks
Teaching material agriculture food technology
Cloud computing and distributed systems.
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
NewMind AI Weekly Chronicles - August'25 Week I
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton

Docker data science pipeline

  • 1. Docker Datascience Pipeline Running datascience models on Docker at ING DataWorks Summit Berlin 18 -04-2018
  • 2. • Big Data Engineer at ING • DB2, Oracle, AIX, Linux, Hadoop, Hive, Sqoop, Ansible • Let it all work on the Exploration Environment • @chiefware Lets introduce myself 2
  • 3. Asking yourself how to get to production Depending on different tools Why I think you are here 3 You are running Datascience models Interested in Docker Asking yourself how to get to production
  • 4. What is Docker 4 Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code
  • 5. Agile way of working in Squads,Chapters and Tribes 5
  • 6. big data exploration 6 Gitlab runner Citrix Server Dev node Hadoop Artifactory Gitlab Automation Server Access to cluster rdp, browser, putty Node for datascientists with tools and xrdp Datanode cluster Datanode cluster Pip repo Datasciences projects Details about processServer
  • 8. Docker on big data exploration 8 Gitlab runner Citrix Server Dev node Hadoop Openshift POD/JOB Artifactory Gitlab Automation Server Access to cluster rdp, browser, putty Node for datascientists with tools and xrdp Datanode cluster Orchestration Containers docker containers Docker base Images Docker files and Datasciences projects
  • 9. Netezza Hive Rapid experimentation Live model Deploymen t Code versioning Build Docker image(s) through Gitlab CI/Jenkins Put serialized model in shared directory Save Docker image(s) to Docker Registry Move serialized model to HDFS Data versioning On each commit Train and validate model Setup environment Docker Registry Tivoli scheduler triggers Monithor HDFS Add link to file to Monithor Add model metrics to Monithor Git repository with Hive queries Feature engineering Data exploration Cookiecutter Workspace management Data service Retraining Revalidatio n Monitoring Fallback DS code library Structured way of keeping track of dev models Continuous Integration (testing) Early CD facilitation Construct pipeline (Pachyderm or APApeline) Create virtual environment per project Checkpointin g Development Continuous Deployement Live
  • 11. Gitlab runner 11 To register run: gitlab-runner register gitlab-runner list
  • 13. Spark 13 • Only submit and forget works in Docker • spark-submit deploy-mode cluster master yarn • kinit your keytab file for Kerberos • pip in virtual env
  • 16. considerations 16 Jenkins instead of gitlab-runner Add gpu nodes to openshift How to use scheduler tool How to handle Kerberos files

Editor's Notes

  • #7: Stand van zaken / gebied
  • #8: Stand van zaken / gebied
  • #9: Stand van zaken / gebied