SlideShare a Scribd company logo
Docker Datascience Pipeline
Running datascience models on Docker at ING DataWorks Summit San Jose 19-06-2018
• Lennard Cornelis
• Big Data Engineer at ING
• DB2, Oracle, AIX, Linux, Hadoop, Hive, Sqoop, Ansible
• Let it all work on the Exploration Environment
• @chiefware
Lets introduce myself
2
Asking yourself how to get to production
Depending on different tools
Why I think you are here
3
You are running Datascience models
Interested in Docker
Asking yourself how to get to production
What is Docker
4
Docker is a tool designed to make it easier to create, deploy, and
run applications by using containers. Containers allow a
developer to package up an application with all of the parts it
needs, such as libraries and other dependencies, and ship it all
out as one package. By doing so, thanks to the container, the
developer can rest assured that the application will run on any
other Linux machine regardless of any customized settings that
machine might have that could differ from the machine used for
writing and testing the code
Agile way of working in Squads,Chapters and Tribes
5
big data exploration
6
Gitlab runner
Citrix Server
Dev node
Hadoop
Artifactory
Gitlab
Automation Server
Access to cluster rdp, browser, putty
Node for datascientists with tools and xrdp
Datanode cluster
Datanode cluster
Pip repo
Datasciences projects
Details about processServer
Pipeline
7
Prototyping
Development
Test
Acceptance Production
waittime
waittime
waittime
failure
failure
failure
Docker on big data exploration
8
Gitlab runner
Citrix Server
Dev node
Hadoop
Openshift
POD/JOB
Artifactory
Gitlab
Automation Server
Access to cluster rdp, browser, putty
Node for datascientists with tools and xrdp
Datanode cluster
Orchestration Containers
docker containers
Docker base Images
Docker files and Datasciences projects
Netezza
Hive
Rapid
experimentation
Live model
Deploymen
t
Code
versioning
Build Docker
image(s) through
Gitlab CI/Jenkins
Put serialized
model in shared
directory
Save Docker
image(s) to
Docker
Registry
Move serialized
model to HDFS
Data
versioning
On each
commit
Train and
validate
model
Setup
environment
Docker
Registry
Tivoli
scheduler
triggers
Monithor
HDFS
Add link to file
to Monithor
Add model
metrics to
Monithor
Git repository
with Hive
queries
Feature
engineering
Data
exploration
Cookiecutter
Workspace
management
Data service
Retraining
Revalidatio
n
Monitoring
Fallback
DS code
library
Structured
way of
keeping
track of dev
models
Continuous
Integration
(testing)
Early CD
facilitation
Construct
pipeline
(Pachyderm or
APApeline)
Create virtual
environment
per project
Checkpointin
g
Development Continuous Deployement Live
9
Dockerfile
10
Gitlab runner
11
To register run:
gitlab-runner register
gitlab-runner list
Gitlab runner
12
.gitlab-ci.yml
Spark
13
• Only submit and forget works in Docker
• spark-submit deploy-mode cluster master yarn
• kinit your keytab file for Kerberos
• create virtual env with conda and zip
Openshift
14
PODS and JOBS
Demo Time
15
considerations
16
Jenkins instead of gitlab-runner
Add gpu nodes to openshift
How to use scheduler tool
How to handle Kerberos files
17
Questions?

More Related Content

PPT
Data Privacy at Scale
PPTX
Saving the elephant—now, not later
PPTX
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
PPTX
Containers and Big Data
PPTX
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
PPTX
Big Data Platform Industrialization
PPTX
GDPR compliance application architecture and implementation using Hadoop and ...
PDF
Big Data Ready Enterprise
Data Privacy at Scale
Saving the elephant—now, not later
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Containers and Big Data
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Big Data Platform Industrialization
GDPR compliance application architecture and implementation using Hadoop and ...
Big Data Ready Enterprise

What's hot (20)

PPTX
Breaking the Silos: Storage for Analytics & AI
PPTX
Hybrid Data Platform
PPTX
Dynamic DDL: Adding structure to streaming IoT data on the fly
PPTX
Built-In Security for the Cloud
PPTX
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
PPTX
Insights into Real-world Data Management Challenges
PPTX
Innovation in the Enterprise Rent-A-Car Data Warehouse
PPTX
Scaling Data Science on Big Data
PPTX
Hadoop in the Cloud - The what, why and how from the experts
PPTX
LLAP: Sub-Second Analytical Queries in Hive
PDF
What's New in Apache Hive 3.0?
PPT
Migrating legacy ERP data into Hadoop
PDF
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
PPTX
Adding structure to your streaming pipelines: moving from Spark streaming to ...
PDF
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
PPTX
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
PPTX
Designing data pipelines for analytics and machine learning in industrial set...
PPTX
Big Data in the Cloud - The What, Why and How from the Experts
PPTX
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
PPTX
Ozone and HDFS’s evolution
Breaking the Silos: Storage for Analytics & AI
Hybrid Data Platform
Dynamic DDL: Adding structure to streaming IoT data on the fly
Built-In Security for the Cloud
Hadoop in the Cloud: Real World Lessons from Enterprise Customers
Insights into Real-world Data Management Challenges
Innovation in the Enterprise Rent-A-Car Data Warehouse
Scaling Data Science on Big Data
Hadoop in the Cloud - The what, why and how from the experts
LLAP: Sub-Second Analytical Queries in Hive
What's New in Apache Hive 3.0?
Migrating legacy ERP data into Hadoop
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Designing data pipelines for analytics and machine learning in industrial set...
Big Data in the Cloud - The What, Why and How from the Experts
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Ozone and HDFS’s evolution
Ad

Similar to Docker datascience pipeline (20)

PPTX
Docker data science pipeline
PDF
Deploying deep learning models with Docker and Kubernetes
PPTX
Software engineering practices for the data science and machine learning life...
PDF
Containers and microservices for realists
PDF
Containers and Microservices for Realists
PDF
Extending DevOps to Big Data Applications with Kubernetes
PDF
Introduction to DevOps and the Practical Use Cases at Credit OK
PDF
Containers, microservices and serverless for realists
PDF
Docker: Containers for Data Science
PDF
Predicting Space Weather with Docker
PDF
Build and automate your machine learning application with docker and jenkins
PPTX
Fits docker into devops
PPTX
DevOps State of the Union 2015
PDF
Docker dev ops for cd meetup 12-14
PDF
Docker in Production at the Aurora Team
PDF
Data Science Meets DevOps: GitOps with OpenShift (1).pdf
PDF
The Self-Service Developer - GOTOCon CPH
PPTX
Container DevOps in Azure
PPTX
Docker Containers for Continuous Delivery
PDF
The Complexity to "Yes" in Analytics Software and the Possibilities with Dock...
Docker data science pipeline
Deploying deep learning models with Docker and Kubernetes
Software engineering practices for the data science and machine learning life...
Containers and microservices for realists
Containers and Microservices for Realists
Extending DevOps to Big Data Applications with Kubernetes
Introduction to DevOps and the Practical Use Cases at Credit OK
Containers, microservices and serverless for realists
Docker: Containers for Data Science
Predicting Space Weather with Docker
Build and automate your machine learning application with docker and jenkins
Fits docker into devops
DevOps State of the Union 2015
Docker dev ops for cd meetup 12-14
Docker in Production at the Aurora Team
Data Science Meets DevOps: GitOps with OpenShift (1).pdf
The Self-Service Developer - GOTOCon CPH
Container DevOps in Azure
Docker Containers for Continuous Delivery
The Complexity to "Yes" in Analytics Software and the Possibilities with Dock...
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
Teaching material agriculture food technology
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Spectroscopy.pptx food analysis technology
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
Understanding_Digital_Forensics_Presentation.pptx
sap open course for s4hana steps from ECC to s4
Spectroscopy.pptx food analysis technology
Spectral efficient network and resource selection model in 5G networks
NewMind AI Weekly Chronicles - August'25 Week I
Chapter 3 Spatial Domain Image Processing.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)

Docker datascience pipeline

  • 1. Docker Datascience Pipeline Running datascience models on Docker at ING DataWorks Summit San Jose 19-06-2018
  • 2. • Lennard Cornelis • Big Data Engineer at ING • DB2, Oracle, AIX, Linux, Hadoop, Hive, Sqoop, Ansible • Let it all work on the Exploration Environment • @chiefware Lets introduce myself 2
  • 3. Asking yourself how to get to production Depending on different tools Why I think you are here 3 You are running Datascience models Interested in Docker Asking yourself how to get to production
  • 4. What is Docker 4 Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Containers allow a developer to package up an application with all of the parts it needs, such as libraries and other dependencies, and ship it all out as one package. By doing so, thanks to the container, the developer can rest assured that the application will run on any other Linux machine regardless of any customized settings that machine might have that could differ from the machine used for writing and testing the code
  • 5. Agile way of working in Squads,Chapters and Tribes 5
  • 6. big data exploration 6 Gitlab runner Citrix Server Dev node Hadoop Artifactory Gitlab Automation Server Access to cluster rdp, browser, putty Node for datascientists with tools and xrdp Datanode cluster Datanode cluster Pip repo Datasciences projects Details about processServer
  • 8. Docker on big data exploration 8 Gitlab runner Citrix Server Dev node Hadoop Openshift POD/JOB Artifactory Gitlab Automation Server Access to cluster rdp, browser, putty Node for datascientists with tools and xrdp Datanode cluster Orchestration Containers docker containers Docker base Images Docker files and Datasciences projects
  • 9. Netezza Hive Rapid experimentation Live model Deploymen t Code versioning Build Docker image(s) through Gitlab CI/Jenkins Put serialized model in shared directory Save Docker image(s) to Docker Registry Move serialized model to HDFS Data versioning On each commit Train and validate model Setup environment Docker Registry Tivoli scheduler triggers Monithor HDFS Add link to file to Monithor Add model metrics to Monithor Git repository with Hive queries Feature engineering Data exploration Cookiecutter Workspace management Data service Retraining Revalidatio n Monitoring Fallback DS code library Structured way of keeping track of dev models Continuous Integration (testing) Early CD facilitation Construct pipeline (Pachyderm or APApeline) Create virtual environment per project Checkpointin g Development Continuous Deployement Live 9
  • 11. Gitlab runner 11 To register run: gitlab-runner register gitlab-runner list
  • 13. Spark 13 • Only submit and forget works in Docker • spark-submit deploy-mode cluster master yarn • kinit your keytab file for Kerberos • create virtual env with conda and zip
  • 16. considerations 16 Jenkins instead of gitlab-runner Add gpu nodes to openshift How to use scheduler tool How to handle Kerberos files

Editor's Notes

  • #5: Docker is a container technology for Linux that allows a developer to package up an application with all of the parts it needs
  • #7: Stand van zaken / gebied
  • #8: Stand van zaken / gebied
  • #9: Stand van zaken / gebied
  • #15: OpenShift is an open source container application platform by Red Hat based on top of Docker containers and the Kubernetes container cluster manager for enterprise app development and deployment.