SlideShare a Scribd company logo
Delivering Agile Data Science on Openshift
Audrey Reznik
Data Scientist
May 9th, 2019
John Archer
Principal Energy Solution Architect
How to create Instant Business Value
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MEET THE SPEAKERS
John Archer
Principal Solution Energy Architect
Red Hat since 2015
BEA Systems, BSI Consulting,
DocuQuest, Andrews & Kurth,
SilverStream, Petris and Oracle
Upstream Data Management, DoD,
APIs, eCommerce, IoT, data science
and blockchain
SPE, SEG, PPDM, HJUG, HDUG, HAL-
PC, Energistics
Audrey Reznik
Data Scientist
Upstream Research Center
ExxonMobil since 2007
Chevron, Akamai, Entriq, Digital Medical
Registrar, Spider Technologies, Ziff
Davis
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DATA SCIENCE TEAM PRESSURES
EXPLOSIVE GROWTH
in data analytics teams and analytic
tools
MULTIPLE TEAMS COMPETING
for use of the same storage and
computing resources
CONGESTION
in busy analytic clusters causing
frustration and missed SLAs
EMERGING DATAOPS
Data Scientist Developers vs Full Stack
Developer agility and enablement gaps
What can you envision and share?
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
NEED: SHARE CODE (PRODUCT) WITH USERS
Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with
customers.
Start of a Interactive Data Science environment.
Red Hat OpenShift PoC at ExxonMobil. Could this new technology benefit us in
creating a Reproducible & Interactive Data Science environment?
Prize: This would enable the team to not only quickly obtain customer feedback,
but also easily utilize Agile Methodology; therefore, quickly delivering MVPs.
Drawback: how does
one avoid the
setup/configuration
issues and reliably
deploy the notebook? Pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
PC Setup
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
LOCAL PC VS OPENSHIFT PROJECT CONTAINERS
Jupyter Notebook
Python 3.x
(image)
Libraries
• Numpy
• Pandas
• Matplotlib
• IPyWidgets
• SciPy
• Lmfit
• Seaborne
• Plotly
SQLite
Container v2.0
GIT
Image project
Code project
OpenShift
URL
to PoCCode
Local PC Setup
pip install required
Anaconda libraries
Jupyter Notebook Python 3.x
(load onto PC – or setup server)
Local admin access
Access to latest source code
OS?SQL
Server
Reproducible Data Science environment that users interact with via Chrome.
Hardware Freedom
& easier
Reproduction!
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a user is extremely
valuable and Agile! Openshift facilitates these capabilities!
REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT
1. Understand
the
Problem
2. Suggest
Solutions
Deliver POC
3. Refine the
Problem
Agile
How to Deploy?
URL
to
PoC
Code
GIT
Image project
Code project
OpenShift
“Interactive” feedback!
Nexus
Image
As a user I want to
provide frequent
feedback!
Python
(Pypi)
Security
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DEPLOY SOURCE CODE WITH SOURCE TO IMAGE (S2I)
• Re-useable Data Science Applications: data location
• To re-useable Data Science Images: can they be re-consumed or modified for particular use cases?
• E.g. we have a base python image that has been modified to provide TensorFlow, SciKit Learn for Data
Science projects.
• Reusable data access containers: SQL Server, Oracle, PI, SAP HANA.
Git
RepositoryBUILD APP
(OpenShift) Developer
code	
Source-to-Image
(S2I)
Builder	Image
Image
Registry
BUILD IMAGE
(OpenShift)
DEPLOY
(OpenShift)
deployApplication	
Container
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MATURING THE CI/CD PIPELINE
Seeing an emerging notion of Data ScienceOps workflows. Current OS production CI/CD in progress.
Challenges we are experiencing include:
1. OnPrem databases in different countries
2. Development/Deployment in Jupyter notebooks
GIT
Jenkins
build
Package
Jenkins
Archive
Artifacts in
Nexus
Nexus
OS build image
deploy to TEST
OS build image
deploy to PROD
Test
build
Package
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MACHINE LEARNING ON OPENSHIFT
Figure 1. liquid estimates. Marco De Mattia
Unique performance computing requirements for
Artificial Intelligence, Machine Learning, Neural
Networks and GPUs
Multiple Data Science images:
• TensorFlow
• PyTorch
• Scikit-learn
Testing GPU (NVidia v100) cluster (OCP). Additional
service to internal HPC.
Next Steps: examine RAPIDS.AI – execute end-to-
end data science pipelines in the GPU…
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
OPENSHIFT GPU PROOF OF CONCEPT (POC)
GPU POC: read & analyze petro-physical data. Use ML Algorithms to generate analysis/models on GPU cluster.
Vetted models can be pushed to Azure for deployment.
GPUDB
Data
Scientist
URL to ML App
User
ML Algorithms
(GIT Repo)
L4
Network
onPrem
Database(s)
Containers
Figure 2. GPU POC workflow, Audrey Reznik
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
READY FOR ANY CLOUD – PRIVATE AND PUBLIC
DATA GRAVITY DRIVES THE LOCATION
• OpenShift for on premise and Public Cloud (Azure) for Container as a Service (CaaS)
1. CaaS Security enabled through AD groups created onPremise and DevOps practices
2. Self-service for accessing Data Science packages with network, routing and DNS services
3. Storage can be self-service with PVC or extended with Ceph and OCP Storage options
Where does your application live? How do you access it?
Is my application
secure?
Enabled Data Science Teams
• Perform More Experiments
• Spend less time on plumbing
• Focus on Delivering Value to
ExxonMobil
Resulting In
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
EXXONMOBIL DATA SCIENCE OS TIMELINE
Started with Data
Virtualization for
Calgary
Optimization
Dec 2017
Containerized JBoss
Data Virtualization
on Openshift on
premise - Feb 2018
Spoke with Data Science
teams - Python,
MATLAB, Julia and R
users – Mar 2018
Introduced Graham
Dumpleton’s
JupyterHub container
image – April - 2018
Delivered Data Science
Workshop on Openshift to
eight different data
science teams – Dec 2018
Built “Base” Data
Science image.
Python 3.x, AI
libraries
July - 2018
Data Science developers
deliver faster and
collaborate globally within
2 months – Feb 2019
Successfully deploy
ODH supporting multiple
notebook kernels and
GPU – Mar 2019
Built test OCP 3.10
cluster for NVidia
v100 testing for
Tensorflow and
Keras - Nov 2018
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MOVING FORWARD: EXXONMOBIL DATA SCIENCE CAPABILITY TODAY
As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook /
Anaconda image (with all required libraries) in a matter of seconds.
Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery
mechanisms. Now that is Democratizing Data Science!
Selected Openshift on premises and public cloud for Container as a Service (CaaS)
• Openshift supports:
• One Click Notebooks and JupyterHub/Lab templates
• Self-service for accessing data & data science packages
• Nexus Repository to allow for Python, Java, R, PHP, .Net package managers
• Docker public repository security built-in process – protects against rooted
containers and new CVE attacks
• NVidia GPU support allows for sharing these resources across multiple teams
Jupyter Notebook	& select	conda libraries	image	being	used	for	Kearl Mining	Optimization	Studies
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
DATA SCIENTIST DEVELOPERS NEEDS
All Developers need
● Choice of architectures
● Choice of programming languages
● Choice of databases and persistence
● Choice of application services
● Choice of development tools
● Choice of build and deploy workflows
Data Science Additional Needs
● Access to GPUs and varied storage
● Access to Curated Data
● Automated ScienceOps pipelines
● Collaboration with the Business
● Access to specific data science
languages and toolsets
They don’t want to have to worry about the infrastructure.
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
YOUR DIFFERENTIATION DEPENDS ON YOUR
ABILITY TO DELIVER INTELLIGENT APPS FASTER
CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS
Innovation
Culture
Cloud-native
Applications
AI & Machine
Learning
Internet of
Things
Virtual GPU
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
OPENDATAHUB.IO ARCHITECTURE
CONTAINER STORAGE (CEPH)
CONTAINER HOST (RHEL/RHCOS)
Microsoft
Azure
AWSOpenStackDatacenterLaptop Google
Cloud
CONTAINER ORCHESTRATION AND MANAGEMENT (OPENSHIFT)
S3 API Object Store BLOCK FILE
GPU FPGA
APPLICATION LIFE CYCLE MANAGEMENT (OPENSHIFT)
DEVOPS WORKFLOW (CODE & DATA)
API GATEWAY (3SCALE) SERVICE MESH (ISTIO)
SERVERLESS
PRIVATE MICRO SERVICES
(CONTAINERIZED CUSTOM APPS)
CONTAINER APPS
PRE-DEFINED AI LIBRARY
(BOTS | ANOMALY | CLASSIFICATION | SENTIMENT | …)
AI TOOLCHAIN & WORKFLOW
(JUPYTER, SUPERSET, …)
COMMON SERVICES
SERVICECATALOG&SELFSERVICEUI/CLI
IDENTITY/POLICY(ACCESS,PLACEMENT)/LINEAGE(CODE
ANDDATA)
MANAGEMENTCONSOLE/INSIGHTS/AIOPS
(PROMETHEUS|ELASTIC|…)
FEDERATION
RH Core
Platform
OpenShift ALM
Red Hat
Middleware
Community &
ISV Ecosystem
Technology
Roadmap
Customer
Content
LEGEND
PYTHON / FLASK JAVA JAVASCRIPT ...
STREAMING (KAFKA - streamzi)
MSG BUS (AMQ) ANALYTICS (SPARK)
ML (TENSORFLOW |
…)
MEMORY CACHE (JDG) ||
DECISION (BxMS)
HDFS | REDIS | SQL | NoSQL
| GRAPHDB | TIMESERIES |
ELASTIC | ...
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
MODERN DATA ANALYTICS PIPELINE
DATA
GENERATION
INGEST DATA
SCIENCE
MACHINE
LEARNING
STREAM
PROCESSING
TRANSFORM,
MERGE, JOIN
DATA
ANALYTICS
• IoT Telemetry
• G&G - Well Logs
• Transactions
• Production
• NiFi
• Kafka
• MQTT
• Presto
• Impala
• SparkSQL
• Notebooks
• TensorFlow
• PyTorch
• Keras
• scikit-learn
• AutoML*
• Kafka
• MQTT
• WebSockets
• Hadoop
• Spark
• Pandas
• Apache Arrow
• Spark
• Hadoop
CONNECTING THE EDGE TO DATA SCIENTISTS
Highly	Scalable,	
flexible,	elastic,	
microservice	based	
architecture
Fully	Portable	– On	
Premise	to	any	
public	cloud	vendor
Leverages	the	
power	and	agility	
of	open	source	
software	without	
lock-in
Architecture	
Tenets
Data	
Scientist
Data	
Manager
s
Citizen
Data	
Scientist
Cognitive	AI
Vision
Speech
Face
Audio
Video
Text
Data
Models
Curation
Prep
Quality
Publishing
SecurityPython,	R,	Jupyter.org,	Tensorflow,	Keras,	Pandas,	Bokeh,	Dash,	Prometheus,	
Grafana,	SciPy,	NumPy,	SumPy,	Julia	,	Spark,	PySpark,	Theano,	Scikit,	FaceDetect
Packages:
AI/ML/Data	Science	Pods
MongoDB,	MariaDB,	mySQL,	Postgres,	Couchbase,	Redis,	MS-SQL,	OraclePersistence
:
SSO	and	Authentication
OIDC
SAML
OAuth
JWT
Kerberos
DevOps	
Node.js,	.Net	Core,	Java,	Python,	PHP,		Ruby,	Rails,	Javascript,	PerlApp	Dev:
AppDev	&	App	Services	and	Persistence	Pods
REST
ODBC
JDBC
WS
Predictive	
Maintenance
Autonomous
Operations
Supply	Chain	
Improvements
Downstream
Reliability
Use	Cases
Multitenant	– CPU	
and	GPU	powered	
workloads
REST
IoT	“Things”
MQTT
Integration,	BPM,	Rules,	Messaging,	API,	IoT,	Microservices,	IstioApp	Services:
OnPremise Public	Cloud
WSS
Kafka
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
● JupyterHub on Openshift
○ Jupyter notebook, JupyterHub, JupyterLab, Openshift Templates
● Kubeflow
○ Kube project for Tensorflow, Seldon, JupyterHub/Lab, PyTorch, MPI
Operator
● Opendatahub.io
○ Ceph, Spark, JupyterHub/Lab, Tensorflow
○ Simplified Multiple Kernels support
○ GPU Support
○ Resource management and instance culling
● radanalytics.io
○ Openshift Spark
○ Oshinko - Apache Spark Cluster
○ Spark Operator
OSS DATA SCIENCE PROJECTS
Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
● Join Openshift Commons - ML SIG - https://commons.openshift.org/
● Openshift Self Service Education - https://learn.openshift.com
● Install Minishift - https://docs.okd.io/latest/minishift/getting-
started/installing.html
○ MacOS - brew cask install minishift
○ Manual - https://github.com/minishift/minishift/releases
● Install Jupyter and JupyterHub Openshift templates
○ https://github.com/jupyter-on-openshift/jupyterhub-quickstart
● Review the OpenDataHub.io project
HOW CAN I GET STARTED?
Delivering Agile Data Science solutions with OpenShift … and providing Business Value!
Delivering Agile Data Science on Openshift  - Red Hat Summit 2019

More Related Content

PPTX
Ansible presentation
ODP
Introduction to Ansible
PPTX
OpenStack Introduction
PDF
Ansible Automation Platform.pdf
PDF
cilium-public.pdf
PDF
Event driven autoscaling with keda
PPTX
Azure container instances
PPTX
Introduction to Apache Kafka
Ansible presentation
Introduction to Ansible
OpenStack Introduction
Ansible Automation Platform.pdf
cilium-public.pdf
Event driven autoscaling with keda
Azure container instances
Introduction to Apache Kafka

What's hot (20)

PDF
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
PDF
Introduction to Azure IaaS
PDF
GitOps and ArgoCD
PPTX
Red Hat Openshift Fundamentals.pptx
ODP
Kubernetes Architecture
PPTX
Terraform on Azure
PDF
Designing a complete ci cd pipeline using argo events, workflow and cd products
PDF
Google Cloud Platform Solutions for DevOps Engineers
PDF
Event driven workloads on Kubernetes with KEDA
PDF
Ansible - Introduction
PDF
Slide DevSecOps Microservices
PPT
Ansible presentation
PDF
Microsoft Azure
PPT
Cloud infrastructure and Cloud Services
PDF
Introducing Confluent labs Parallel Consumer client | Anthony Stubbes, Confluent
PPTX
Cloud Security
PDF
Kubernetes 101
PPTX
Azure Container Apps
PPTX
Introduction to kubernetes
PDF
CD using ArgoCD(KnolX).pdf
Kubernetes Concepts And Architecture Powerpoint Presentation Slides
Introduction to Azure IaaS
GitOps and ArgoCD
Red Hat Openshift Fundamentals.pptx
Kubernetes Architecture
Terraform on Azure
Designing a complete ci cd pipeline using argo events, workflow and cd products
Google Cloud Platform Solutions for DevOps Engineers
Event driven workloads on Kubernetes with KEDA
Ansible - Introduction
Slide DevSecOps Microservices
Ansible presentation
Microsoft Azure
Cloud infrastructure and Cloud Services
Introducing Confluent labs Parallel Consumer client | Anthony Stubbes, Confluent
Cloud Security
Kubernetes 101
Azure Container Apps
Introduction to kubernetes
CD using ArgoCD(KnolX).pdf
Ad

Similar to Delivering Agile Data Science on Openshift - Red Hat Summit 2019 (20)

PDF
Democratizing Data Science on Kubernetes
PDF
DDDP 2019 - Brown to Green
PDF
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
PPTX
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
PPTX
Scaling Data Science on Big Data
PPTX
OpenACC Monthly Highlights: February 2021
PPTX
Career opportunities in open source framework
PDF
Career opportunities in open source framework
PDF
The Future of Data Science
PDF
Big Data & Open Source - Neil Jadhav
PPTX
How Cloud is Affecting Data Scientists
 
PDF
SoftElegance Services: Data Science, Data Engineering, Big Data Architecture
PDF
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
PPTX
Oct 2011 CHADNUG Presentation on Hadoop
PDF
Red hat infrastructure for analytics
PPTX
OpenPOWER foundation
PDF
IBM COE - AI /HPC/CLOUD at your university
PPTX
Available platforms for Big Data 2.0
PDF
End-to-End Big Data AI with Analytics Zoo
Democratizing Data Science on Kubernetes
DDDP 2019 - Brown to Green
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
SQLSaturday #230 - Introduction to Microsoft Big Data (Part 1)
Scaling Data Science on Big Data
OpenACC Monthly Highlights: February 2021
Career opportunities in open source framework
Career opportunities in open source framework
The Future of Data Science
Big Data & Open Source - Neil Jadhav
How Cloud is Affecting Data Scientists
 
SoftElegance Services: Data Science, Data Engineering, Big Data Architecture
BUILDING BETTER PREDICTIVE MODELS WITH COGNITIVE ASSISTANCE IN A DATA SCIENCE...
Oct 2011 CHADNUG Presentation on Hadoop
Red hat infrastructure for analytics
OpenPOWER foundation
IBM COE - AI /HPC/CLOUD at your university
Available platforms for Big Data 2.0
End-to-End Big Data AI with Analytics Zoo
Ad

More from John Archer (9)

PDF
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdf
PPTX
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
PDF
Leveraging IoT as part of your digital transformation
PDF
Locationless data science on a modern secure edge
PDF
Red Hat Java Update and Quarkus Introduction
PDF
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
PDF
Single View of Well, Production and Assets
PDF
Red Hat Openshift on Microsoft Azure
PDF
Field development and operational optimization for unconventionals
Enabling Enterprise-wide OT Data access with Matrikon Data Broker.pdf
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Leveraging IoT as part of your digital transformation
Locationless data science on a modern secure edge
Red Hat Java Update and Quarkus Introduction
Openshift 3.10 & Container solutions for Blockchain, IoT and Data Science
Single View of Well, Production and Assets
Red Hat Openshift on Microsoft Azure
Field development and operational optimization for unconventionals

Recently uploaded (20)

PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Logistic Regression ml machine learning.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Taxes Foundatisdcsdcsdon Certificate.pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
A Quantitative-WPS Office.pptx research study
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
.pdf is not working space design for the following data for the following dat...
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Knowledge Engineering Part 1
Business Acumen Training GuidePresentation.pptx
Supervised vs unsupervised machine learning algorithms
Galatica Smart Energy Infrastructure Startup Pitch Deck
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to machine learning and Linear Models
Logistic Regression ml machine learning.pptx
Clinical guidelines as a resource for EBP(1).pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Taxes Foundatisdcsdcsdon Certificate.pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
A Quantitative-WPS Office.pptx research study

Delivering Agile Data Science on Openshift - Red Hat Summit 2019

  • 1. Delivering Agile Data Science on Openshift Audrey Reznik Data Scientist May 9th, 2019 John Archer Principal Energy Solution Architect How to create Instant Business Value
  • 2. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift MEET THE SPEAKERS John Archer Principal Solution Energy Architect Red Hat since 2015 BEA Systems, BSI Consulting, DocuQuest, Andrews & Kurth, SilverStream, Petris and Oracle Upstream Data Management, DoD, APIs, eCommerce, IoT, data science and blockchain SPE, SEG, PPDM, HJUG, HDUG, HAL- PC, Energistics Audrey Reznik Data Scientist Upstream Research Center ExxonMobil since 2007 Chevron, Akamai, Entriq, Digital Medical Registrar, Spider Technologies, Ziff Davis
  • 3. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift DATA SCIENCE TEAM PRESSURES EXPLOSIVE GROWTH in data analytics teams and analytic tools MULTIPLE TEAMS COMPETING for use of the same storage and computing resources CONGESTION in busy analytic clusters causing frustration and missed SLAs EMERGING DATAOPS Data Scientist Developers vs Full Stack Developer agility and enablement gaps
  • 4. What can you envision and share?
  • 5. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift NEED: SHARE CODE (PRODUCT) WITH USERS Jupyter Notebooks as a technology we could use to combine python code, a GUI, documentation for sharing with customers. Start of a Interactive Data Science environment. Red Hat OpenShift PoC at ExxonMobil. Could this new technology benefit us in creating a Reproducible & Interactive Data Science environment? Prize: This would enable the team to not only quickly obtain customer feedback, but also easily utilize Agile Methodology; therefore, quickly delivering MVPs. Drawback: how does one avoid the setup/configuration issues and reliably deploy the notebook? Pip install required Anaconda libraries Jupyter Notebook Python 3.x (load onto PC – or setup server) Local admin access Access to latest source code OS?SQL Server PC Setup
  • 6. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift LOCAL PC VS OPENSHIFT PROJECT CONTAINERS Jupyter Notebook Python 3.x (image) Libraries • Numpy • Pandas • Matplotlib • IPyWidgets • SciPy • Lmfit • Seaborne • Plotly SQLite Container v2.0 GIT Image project Code project OpenShift URL to PoCCode Local PC Setup pip install required Anaconda libraries Jupyter Notebook Python 3.x (load onto PC – or setup server) Local admin access Access to latest source code OS?SQL Server Reproducible Data Science environment that users interact with via Chrome. Hardware Freedom & easier Reproduction!
  • 7. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift For a Data Scientist, the ability to rapidly deploy code and quickly obtain feedback from a user is extremely valuable and Agile! Openshift facilitates these capabilities! REPRODUCIBLE & INTERACTIVE SCIENTIFIC ENVIRONMENT 1. Understand the Problem 2. Suggest Solutions Deliver POC 3. Refine the Problem Agile How to Deploy? URL to PoC Code GIT Image project Code project OpenShift “Interactive” feedback! Nexus Image As a user I want to provide frequent feedback! Python (Pypi) Security
  • 8. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift DEPLOY SOURCE CODE WITH SOURCE TO IMAGE (S2I) • Re-useable Data Science Applications: data location • To re-useable Data Science Images: can they be re-consumed or modified for particular use cases? • E.g. we have a base python image that has been modified to provide TensorFlow, SciKit Learn for Data Science projects. • Reusable data access containers: SQL Server, Oracle, PI, SAP HANA. Git RepositoryBUILD APP (OpenShift) Developer code Source-to-Image (S2I) Builder Image Image Registry BUILD IMAGE (OpenShift) DEPLOY (OpenShift) deployApplication Container
  • 9. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift MATURING THE CI/CD PIPELINE Seeing an emerging notion of Data ScienceOps workflows. Current OS production CI/CD in progress. Challenges we are experiencing include: 1. OnPrem databases in different countries 2. Development/Deployment in Jupyter notebooks GIT Jenkins build Package Jenkins Archive Artifacts in Nexus Nexus OS build image deploy to TEST OS build image deploy to PROD Test build Package
  • 10. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift MACHINE LEARNING ON OPENSHIFT Figure 1. liquid estimates. Marco De Mattia Unique performance computing requirements for Artificial Intelligence, Machine Learning, Neural Networks and GPUs Multiple Data Science images: • TensorFlow • PyTorch • Scikit-learn Testing GPU (NVidia v100) cluster (OCP). Additional service to internal HPC. Next Steps: examine RAPIDS.AI – execute end-to- end data science pipelines in the GPU…
  • 11. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift OPENSHIFT GPU PROOF OF CONCEPT (POC) GPU POC: read & analyze petro-physical data. Use ML Algorithms to generate analysis/models on GPU cluster. Vetted models can be pushed to Azure for deployment. GPUDB Data Scientist URL to ML App User ML Algorithms (GIT Repo) L4 Network onPrem Database(s) Containers Figure 2. GPU POC workflow, Audrey Reznik
  • 12. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift READY FOR ANY CLOUD – PRIVATE AND PUBLIC DATA GRAVITY DRIVES THE LOCATION • OpenShift for on premise and Public Cloud (Azure) for Container as a Service (CaaS) 1. CaaS Security enabled through AD groups created onPremise and DevOps practices 2. Self-service for accessing Data Science packages with network, routing and DNS services 3. Storage can be self-service with PVC or extended with Ceph and OCP Storage options Where does your application live? How do you access it? Is my application secure? Enabled Data Science Teams • Perform More Experiments • Spend less time on plumbing • Focus on Delivering Value to ExxonMobil Resulting In
  • 13. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift EXXONMOBIL DATA SCIENCE OS TIMELINE Started with Data Virtualization for Calgary Optimization Dec 2017 Containerized JBoss Data Virtualization on Openshift on premise - Feb 2018 Spoke with Data Science teams - Python, MATLAB, Julia and R users – Mar 2018 Introduced Graham Dumpleton’s JupyterHub container image – April - 2018 Delivered Data Science Workshop on Openshift to eight different data science teams – Dec 2018 Built “Base” Data Science image. Python 3.x, AI libraries July - 2018 Data Science developers deliver faster and collaborate globally within 2 months – Feb 2019 Successfully deploy ODH supporting multiple notebook kernels and GPU – Mar 2019 Built test OCP 3.10 cluster for NVidia v100 testing for Tensorflow and Keras - Nov 2018
  • 14. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift MOVING FORWARD: EXXONMOBIL DATA SCIENCE CAPABILITY TODAY As a Data Scientist (all I care about) is that using Openshift, I can now deploy a common Jupyter Notebook / Anaconda image (with all required libraries) in a matter of seconds. Freeing myself (and other Data Scientists) to perform data science and not worry about architecture and delivery mechanisms. Now that is Democratizing Data Science! Selected Openshift on premises and public cloud for Container as a Service (CaaS) • Openshift supports: • One Click Notebooks and JupyterHub/Lab templates • Self-service for accessing data & data science packages • Nexus Repository to allow for Python, Java, R, PHP, .Net package managers • Docker public repository security built-in process – protects against rooted containers and new CVE attacks • NVidia GPU support allows for sharing these resources across multiple teams Jupyter Notebook & select conda libraries image being used for Kearl Mining Optimization Studies
  • 15. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift DATA SCIENTIST DEVELOPERS NEEDS All Developers need ● Choice of architectures ● Choice of programming languages ● Choice of databases and persistence ● Choice of application services ● Choice of development tools ● Choice of build and deploy workflows Data Science Additional Needs ● Access to GPUs and varied storage ● Access to Curated Data ● Automated ScienceOps pipelines ● Collaboration with the Business ● Access to specific data science languages and toolsets They don’t want to have to worry about the infrastructure.
  • 16. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift YOUR DIFFERENTIATION DEPENDS ON YOUR ABILITY TO DELIVER INTELLIGENT APPS FASTER CONTAINERS, KUBERNETES, DEVOPS & DATAOPS ARE KEY INGREDIENTS Innovation Culture Cloud-native Applications AI & Machine Learning Internet of Things Virtual GPU
  • 17. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift
  • 18. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift OPENDATAHUB.IO ARCHITECTURE CONTAINER STORAGE (CEPH) CONTAINER HOST (RHEL/RHCOS) Microsoft Azure AWSOpenStackDatacenterLaptop Google Cloud CONTAINER ORCHESTRATION AND MANAGEMENT (OPENSHIFT) S3 API Object Store BLOCK FILE GPU FPGA APPLICATION LIFE CYCLE MANAGEMENT (OPENSHIFT) DEVOPS WORKFLOW (CODE & DATA) API GATEWAY (3SCALE) SERVICE MESH (ISTIO) SERVERLESS PRIVATE MICRO SERVICES (CONTAINERIZED CUSTOM APPS) CONTAINER APPS PRE-DEFINED AI LIBRARY (BOTS | ANOMALY | CLASSIFICATION | SENTIMENT | …) AI TOOLCHAIN & WORKFLOW (JUPYTER, SUPERSET, …) COMMON SERVICES SERVICECATALOG&SELFSERVICEUI/CLI IDENTITY/POLICY(ACCESS,PLACEMENT)/LINEAGE(CODE ANDDATA) MANAGEMENTCONSOLE/INSIGHTS/AIOPS (PROMETHEUS|ELASTIC|…) FEDERATION RH Core Platform OpenShift ALM Red Hat Middleware Community & ISV Ecosystem Technology Roadmap Customer Content LEGEND PYTHON / FLASK JAVA JAVASCRIPT ... STREAMING (KAFKA - streamzi) MSG BUS (AMQ) ANALYTICS (SPARK) ML (TENSORFLOW | …) MEMORY CACHE (JDG) || DECISION (BxMS) HDFS | REDIS | SQL | NoSQL | GRAPHDB | TIMESERIES | ELASTIC | ...
  • 19. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift MODERN DATA ANALYTICS PIPELINE DATA GENERATION INGEST DATA SCIENCE MACHINE LEARNING STREAM PROCESSING TRANSFORM, MERGE, JOIN DATA ANALYTICS • IoT Telemetry • G&G - Well Logs • Transactions • Production • NiFi • Kafka • MQTT • Presto • Impala • SparkSQL • Notebooks • TensorFlow • PyTorch • Keras • scikit-learn • AutoML* • Kafka • MQTT • WebSockets • Hadoop • Spark • Pandas • Apache Arrow • Spark • Hadoop
  • 20. CONNECTING THE EDGE TO DATA SCIENTISTS Highly Scalable, flexible, elastic, microservice based architecture Fully Portable – On Premise to any public cloud vendor Leverages the power and agility of open source software without lock-in Architecture Tenets Data Scientist Data Manager s Citizen Data Scientist Cognitive AI Vision Speech Face Audio Video Text Data Models Curation Prep Quality Publishing SecurityPython, R, Jupyter.org, Tensorflow, Keras, Pandas, Bokeh, Dash, Prometheus, Grafana, SciPy, NumPy, SumPy, Julia , Spark, PySpark, Theano, Scikit, FaceDetect Packages: AI/ML/Data Science Pods MongoDB, MariaDB, mySQL, Postgres, Couchbase, Redis, MS-SQL, OraclePersistence : SSO and Authentication OIDC SAML OAuth JWT Kerberos DevOps Node.js, .Net Core, Java, Python, PHP, Ruby, Rails, Javascript, PerlApp Dev: AppDev & App Services and Persistence Pods REST ODBC JDBC WS Predictive Maintenance Autonomous Operations Supply Chain Improvements Downstream Reliability Use Cases Multitenant – CPU and GPU powered workloads REST IoT “Things” MQTT Integration, BPM, Rules, Messaging, API, IoT, Microservices, IstioApp Services: OnPremise Public Cloud WSS Kafka
  • 21. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift ● JupyterHub on Openshift ○ Jupyter notebook, JupyterHub, JupyterLab, Openshift Templates ● Kubeflow ○ Kube project for Tensorflow, Seldon, JupyterHub/Lab, PyTorch, MPI Operator ● Opendatahub.io ○ Ceph, Spark, JupyterHub/Lab, Tensorflow ○ Simplified Multiple Kernels support ○ GPU Support ○ Resource management and instance culling ● radanalytics.io ○ Openshift Spark ○ Oshinko - Apache Spark Cluster ○ Spark Operator OSS DATA SCIENCE PROJECTS
  • 22. Red Hat Summit May 2019 – Delivering agile data science solutions with OpenShift ● Join Openshift Commons - ML SIG - https://commons.openshift.org/ ● Openshift Self Service Education - https://learn.openshift.com ● Install Minishift - https://docs.okd.io/latest/minishift/getting- started/installing.html ○ MacOS - brew cask install minishift ○ Manual - https://github.com/minishift/minishift/releases ● Install Jupyter and JupyterHub Openshift templates ○ https://github.com/jupyter-on-openshift/jupyterhub-quickstart ● Review the OpenDataHub.io project HOW CAN I GET STARTED?
  • 23. Delivering Agile Data Science solutions with OpenShift … and providing Business Value!