Distributed Data Science, DevOps, and Docker

Twi$er: @BDaaSmeetup
Hashtag: #BDaaS

Our Sponsor
Big-Data-as-a-Service.
On-Prem, Cloud, or Hybrid. It’s BDaaS.

BDaaS Meetup
•  Welcome and IntroducCons
•  PresentaCon by Nanda Vijaydev
— Distributed Data Science, DevOps, and Docker
•  Q&A and Discussion
Hashtag: #BDaaS

Nanda Vijaydev
•  Data scienCst and director of soluCons at BlueData
•  Prior to BlueData, was a soluCons architect at Silicon Valley
Data Science
•  More than 10 years experience in data management and data
science
•  Has worked with dozens of organizaCons to deploy Hadoop,
Spark, & data science environments using Docker containers

with
Distributed Data Science
DevOps, and Docker
BDaaS Meetup
June 9, 2017

Nanda Vijaydev

@NandaVijaydev

Outline
•  Evolu>on of Data Science Opera>ons
•  Distributed Data Science on Docker
•  Challenges and Key Requirements
•  Demo
•  Key Takeaways
•  Q & A

Understand
Business Problem
Acquire/Collect
Analyze/Model Reflect/Evaluate
Deploy/
Disseminate
A pre$y picture
(ideal workflow)
A not so pre$y picture
(workflow in reality)
Data Science Tasks and Roles
Data
Engineer / Data
Scientist
Core Data
Scientist
Statisticians / Data
Scientists
Data AnalystData Analyst
Data
Engineer / Data
Scientist

Which do you
prefer to use:
SAS, R, or
Python?
Source: “SAS, R, or Python Survey 2016: Which Tool Do Analy>cs Pros Prefer?”, Burtch Works, July 2016
Preferred Language of Analytics Pros

Evolution of Data Science Operations
Sampling
Modeling &
Tuning
Reports
(e.g. credit
card oﬀer)
TradiConal Data Science & AnalyCcs
Distributed
Systems
Acquire
Data
Model
Tune
Deploy
Distributed Data Science & AnalyCcs

What Often Happens …
Faulty AssumpCons
•  IT team thinks they understand
requirements and use cases
•  Assumes infrastructure & systems
will work for most use cases
•  Assumes all data scien>sts will use
similar toolsets
•  Build the infrastructure ﬁrst, then
onboard the data scien>sts …

A More Realistic Journey

Onboarding data scien>sts
Con>nuous infrastructure provisioning
Base-R, SQL,
Python, Java
Established use
cases
Need to analyze
higher data
volumes
SparkR, PySpark,
spark-sql, H2O,
Zeppelin
Numpy,
Scipy, NLTK,
JupyterHub,
with Spark
AddiConal
modules for
Python users
R user base is
adopCng more
Big Data
R-Studio,
Shiny Server
with Spark +
H2O
Use cases, requirements, & tools
will con>nue to evolve over >me

DevOps for Data Science Operations
Source: Rob Nendorf, Allstate, “DevOps for Data Science”

What Data Science Teams Need:
•  Access to data with full ﬁdelity
•  Ability to quickly iterate & validate ﬁndings
•  Access to necessary tools and models
•  Ability to scale environments on-demand
•  Ability to share models and code
•  Ability to deploy and integrate the solu>on

Data Science – Usage Scenarios
1.  End-to-end analysis on local laptops /
worksta>ons using RStudio, Jupyter
2.  Preprocess on Hadoop/Spark, download
and analyze locally using RStudio/Jupyter
3.  Preprocess and analyze on Hadoop/Spark

Single node laptops / worksta>ons:
•  Using single node instances with more resources
•  Projects like ﬀ, bigmemory for R

Distributed processing:
•  SparkR/sparklyr with RStudio and Spark cluster
•  Jupyter/Zeppelin notebook with PySpark and Spark cluster
•  Hadoop clusters
•  Sandbox that can be scaled on demand
Scaling Options for Data Science

Accessing Aggregate Data from HDFS
•  Preprocessed or par>>oned data can be stored in HDFS
•  Can be accessed directly from RStudio/Jupyter using RHadoop client for
aggrega>on/modeling

Distributed Data Science
on Docker
with

R Environment with RStudio Server

•  Install local Spark if
not already available
•  Connect to Spark
cluster
•  Set appropriate
Spark conﬁgura>ons
for op>mal
performance
Spark with sparklyr from RStudio

Python Environment with Jupyter

•  Users work in their
familiar notebooks

•  BlueData provisions
mul>-node Spark
clusters

PySpark with Anaconda from Jupyter

Environments: Scale Up vs. Scale Out
R Packages,
Python, SQL
UI /
Notebooks
Scale Up
Frameworks
Compute
Data
Local Compute
(Laptops)

Spark (SQL, Scala, Python, Java, MLlib) + H2O

Spark (SQL, SparkR,
Scala, Java, MLlib) +
H2O

Scale Out
RStudio + R + Spark
+ sparklyr
Jupyter + Python +
Spark
Zeppelin + R +
Python + Spark
Spark

R
Spark

R
Spark

R
Spark

R
Spark

Spark

Spark

Spark

Spark

Spark

Spark

Spark

Scalable Data Science: Challenges
•  How do you keep up with the constant evoluCon of new
versions and tools?
–  The data science ecosystem is evolving very quickly (e.g. rapid pace of new Spark versions)
–  Related tools (e.g. RStudio, Jupyter, Zeppelin) have to keep pace to support new features
–  New versions of Spark and other tools require different versions of libraries and packages
•  One monolothic cluster won’t cut it … how do you support
the variaCons?
–  Different use cases & users need different op>ons, versions, packages
–  Workloads change … adding new packages or scaling clusters up and down is cumbersome

Scalable Data Science: Challenges
•  How do you make it easy for your data scienCsts to get what
they need?
–  Data scien>sts are comfortable with their desktop tools, not distributed compu>ng
–  They need on-demand environments with instant access to their preferred tools and data
•  How do you manage user access for IDEs / notebooks and
data sources?
–  Given the diﬀerent layers of the stack, this can be complex and challenging for enterprises
•  And more … repeatability, elasCcity, scalability, security,
performance ...

IOBoost™ - Extreme performance and scalability
Elas>cPlane™ - Self-service, mul>-tenant clusters
DataTap™ - In-place access to data on-prem or in the cloud
Blue Data EPIC™ Soaware Plaborm
Data Scien>sts Developers Data Engineers

Data Analysts

BI/Analy>cs Tools
NFS HDFS
Platform for Scalable Data Science
Compute
Storage
On-Premises Public Cloud
EC2
S3
Bring-Your-Own

Multi-Tenant Environments
Distributed clusters with
Jupyter & Zeppelin
notebooks
Links to available services
and notebooks

Pre-Integrated Docker-Based Images
DOCKER-BASED
IMAGES OF
YOUR CHOICE:
SAME FOR ON-
PREM, AWS, OR
ANY CLOUD

On-Demand Spark + R Environments
Just a few mouse
clicks to a fully
conﬁgured
cluster (e.g. with
Spark + RStudio
Server)

Scalable Data Science with R & Python
Deploy on-demand Spark clusters with
RStudio (sparklyr), Zeppelin, or Jupyter

Spark (via Zeppelin Notebook)
Turnkey Spark clusters on Docker,
with Zeppelin, Jupyter, and SparkR
pre-integrated

Scale to Production (Compute + Data)
Compute: Add worker nodes
Data: Point analy>cs to storage (HDFS, S3, NFS)

Distributed Data Science Operations
Data Scien>sts
Spark 2.0 +
Jupyter
Notebook
Spark 1.6.1
+ Zeppelin
Notebook
JupyterHub RStudio
BRING
YOUR OWN
TensorFlow
Hadoop
(Hive, M/R)
Datameer
Launch Launch Launch Launch
Launch Launch Launch Launch
Shared Data, Code, and Results
Users & Security
Orchestra>on & Mgmt
Data Analysts Data Engineers
Comprehensive management of secure, scalable, & reproducible data science environments
ON-PREMISES CLOUD

Distributed Data Science: Takeaways
•  Opera>onalizing distributed data science is hard work
–  Unique requirements for access to data, models, tools, etc.
•  Need to bring a DevOps approach to data science opera>ons
–  Support for fast, itera>ve prototyping and reproducibility
–  Requires ul>mate ﬂexibility as tools evolve and new op>ons emerge
•  Leverage a turnkey purpose-built plalorm (e.g. BlueData EPIC)
–  Bring DevOps agility to distributed data science, powered by Docker
–  Provide ability to share code, models, & data with secure mul>-tenancy
–  Enable on-demand environments with a choice of data science tools

Thank You
TRY BLUEDATA EPIC ON AWS
For more informa>on:
www.bluedata.com
sales@bluedata.com
www.bluedata.com/aws

Wrap-Up
•  We’ll share the slides and video recording
•  SuggesCons for future meetups?
•  Next meeCng TBD – we’ll keep you posted

Thank you
for a$ending!
Thanks to our sponsor:
www.bluedata.com
Hashtag: #BDaaS

Distributed Data Science, DevOps, and Docker

More Related Content

Similar to Distributed Data Science, DevOps, and Docker (20)

Recently uploaded (20)

Distributed Data Science, DevOps, and Docker

Editor's Notes