SlideShare a Scribd company logo
Twi$er:	@BDaaSmeetup	
Hashtag:	#BDaaS
Our	Sponsor	
Big-Data-as-a-Service.		
On-Prem,	Cloud,	or	Hybrid.	It’s	BDaaS.
BDaaS	Meetup	
•  Welcome	and	IntroducCons	
•  PresentaCon	by	Nanda	Vijaydev	
	—	Distributed	Data	Science,	DevOps,	and	Docker	
•  Q&A	and	Discussion	
Twi$er:	@BDaaSmeetup	
Hashtag:	#BDaaS
Nanda	Vijaydev	
•  Data	scienCst	and	director	of	soluCons	at	BlueData	
•  Prior	to	BlueData,	was	a	soluCons	architect	at	Silicon	Valley	
Data	Science	
•  More	than	10	years	experience	in	data	management	and	data	
science	
•  Has	worked	with	dozens	of	organizaCons	to	deploy	Hadoop,	
Spark,	&	data	science	environments	using	Docker	containers
with
Distributed Data Science
DevOps, and Docker
BDaaS	Meetup	
June	9,	2017	
	
Nanda	Vijaydev	
	
	
				@NandaVijaydev
Outline
•  Evolu>on	of	Data	Science	Opera>ons	
•  Distributed	Data	Science	on	Docker	
•  Challenges	and	Key	Requirements	
•  Demo	
•  Key	Takeaways	
•  Q	&	A
Understand	
Business	Problem	
Acquire/Collect	
Analyze/Model	Reflect/Evaluate	
Deploy/
Disseminate	
A	pre$y	picture	
(ideal	workflow)	
A	not	so	pre$y	picture	
(workflow	in	reality)	
Data Science Tasks and Roles
Data
Engineer / Data
Scientist
Core Data
Scientist
Statisticians / Data
Scientists
Data AnalystData Analyst
Data
Engineer / Data
Scientist
Which	do	you	
prefer	to	use:	
SAS,	R,	or	
Python?	
Source:	“SAS,	R,	or	Python	Survey	2016:	Which	Tool	Do	Analy>cs	Pros	Prefer?”,	Burtch	Works,	July	2016		
Preferred Language of Analytics Pros
Evolution of Data Science Operations
Sampling	
Modeling	&	
Tuning	
Reports	
(e.g.	credit	
card	offer)	
TradiConal	Data	Science	&	AnalyCcs	
Distributed	
Systems	
Acquire	
Data	
Model	
Tune	
Deploy	
Distributed	Data	Science	&	AnalyCcs
What Often Happens …
Faulty	AssumpCons	
•  IT	team	thinks	they	understand	
requirements	and	use	cases	
•  Assumes	infrastructure	&	systems	
will	work	for	most	use	cases	
•  Assumes	all	data	scien>sts	will	use	
similar	toolsets	
•  Build	the	infrastructure	first,	then	
onboard	the	data	scien>sts	…
A More Realistic Journey
	
	
Onboarding	data	scien>sts	
Con>nuous	infrastructure	provisioning	
Base-R,	SQL,	
Python,	Java	
Established	use	
cases	
Need	to	analyze	
higher	data	
volumes	
SparkR,	PySpark,	
spark-sql,	H2O,	
Zeppelin		
Numpy,	
Scipy,	NLTK,	
JupyterHub,	
with	Spark	
AddiConal	
modules	for	
Python	users	
R	user	base	is	
adopCng	more	
Big	Data	
R-Studio,	
Shiny	Server	
with	Spark	+	
H2O	
Use	cases,	requirements,	&	tools	
will	con>nue	to	evolve	over	>me
DevOps for Data Science Operations
Source:	Rob	Nendorf,	Allstate,	“DevOps	for	Data	Science”
What Data Science Teams Need:
•  Access	to	data	with	full	fidelity	
•  Ability	to	quickly	iterate	&	validate	findings	
•  Access	to	necessary	tools	and	models	
•  Ability	to	scale	environments	on-demand	
•  Ability	to	share	models	and	code	
•  Ability	to	deploy	and	integrate	the	solu>on
Data Science – Usage Scenarios
1.  End-to-end	analysis	on	local	laptops	/	
worksta>ons	using	RStudio,	Jupyter		
2.  Preprocess	on	Hadoop/Spark,	download	
and	analyze	locally	using	RStudio/Jupyter	
3.  Preprocess	and	analyze	on	Hadoop/Spark
Single	node	laptops	/	worksta>ons:	
•  Using	single	node	instances	with	more	resources	
•  Projects	like	ff,	bigmemory	for	R	
	
Distributed	processing:	
•  SparkR/sparklyr	with	RStudio	and	Spark	cluster	
•  Jupyter/Zeppelin	notebook	with	PySpark	and	Spark	cluster	
•  Hadoop	clusters	
•  Sandbox	that	can	be	scaled	on	demand	
Scaling Options for Data Science
Accessing Aggregate Data from HDFS
•  Preprocessed	or	par>>oned	data	can	be	stored	in	HDFS	
•  Can	be	accessed	directly	from	RStudio/Jupyter	using	RHadoop	client	for	
aggrega>on/modeling
Distributed Data Science
on Docker
with
R Environment with RStudio Server
•  Install	local	Spark		if	
not	already	available	
•  Connect	to	Spark	
cluster	
•  Set	appropriate	
Spark	configura>ons	
for	op>mal	
performance	
Spark with sparklyr from RStudio
Python Environment with Jupyter
•  Users	work	in	their	
familiar	notebooks	
	
•  BlueData	provisions	
mul>-node	Spark	
clusters		
	
PySpark with Anaconda from Jupyter
Environments: Scale Up vs. Scale Out
R	Packages,	
Python,	SQL	
UI	/	
Notebooks	
Scale	Up	
Frameworks	
Compute	
Data	
Local	Compute	
(Laptops)	
	
Spark	(SQL,	Scala,	Python,	Java,	MLlib)	+	H2O	
	
	
Spark	(SQL,	SparkR,	
Scala,	Java,	MLlib)	+	
H2O	
	
						Scale	Out	
RStudio	+	R	+	Spark	
+	sparklyr	
Jupyter	+	Python	+	
Spark	
Zeppelin	+	R	+	
Python	+	Spark	
Spark		
	
R	
Spark		
	
R	
Spark		
	
R	
Spark		
	
R	
Spark		
	
	
Spark		
	
	
Spark		
	
	
Spark		
	
	
Spark		
	
	
Spark		
	
	
Spark		
	
	
Spark
Scalable Data Science: Challenges
•  How	do	you	keep	up	with	the	constant	evoluCon	of	new	
versions	and	tools?	
–  The	data	science	ecosystem	is	evolving	very	quickly	(e.g.	rapid	pace	of	new	Spark	versions)	
–  Related	tools	(e.g.	RStudio,	Jupyter,	Zeppelin)	have	to	keep	pace	to	support	new	features	
–  New	versions	of	Spark	and	other	tools	require	different	versions	of	libraries	and	packages	
•  One	monolothic	cluster	won’t	cut	it	…	how	do	you	support	
the	variaCons?	
–  Different	use	cases	&	users	need	different	op>ons,	versions,		packages	
–  Workloads	change	…	adding	new	packages	or	scaling	clusters	up	and	down	is	cumbersome
Scalable Data Science: Challenges
•  How	do	you	make	it	easy	for	your	data	scienCsts	to	get	what	
they	need?	
–  Data	scien>sts	are	comfortable	with	their	desktop	tools,	not	distributed	compu>ng	
–  They	need	on-demand	environments	with	instant	access	to	their	preferred	tools	and	data	
•  How	do	you	manage	user	access	for	IDEs	/	notebooks	and	
data	sources?		
–  Given	the	different	layers	of	the	stack,	this	can	be	complex	and	challenging	for	enterprises	
•  And	more	…	repeatability,	elasCcity,	scalability,	security,	
performance	...
IOBoost™	-	Extreme	performance	and	scalability	
Elas>cPlane™	-	Self-service,	mul>-tenant	clusters	
DataTap™	-	In-place	access	to	data	on-prem	or	in	the	cloud	
Blue	Data	EPIC™	Soaware	Plaborm	
Data	Scien>sts	 Developers	 Data	Engineers	
	
Data	Analysts	
	
BI/Analy>cs	Tools	
NFS	 HDFS	
Platform for Scalable Data Science
Compute	
Storage	
On-Premises	 Public	Cloud	
EC2	
S3	
Bring-Your-Own
Multi-Tenant Environments
Distributed	clusters	with	
Jupyter	&	Zeppelin		
notebooks	
Links	to	available	services	
and	notebooks
Pre-Integrated Docker-Based Images
DOCKER-BASED	
IMAGES	OF	
YOUR	CHOICE:	
SAME	FOR	ON-
PREM,	AWS,	OR	
ANY	CLOUD
On-Demand Spark + R Environments
Just	a	few	mouse	
clicks	to	a	fully	
configured	
cluster	(e.g.	with	
Spark	+	RStudio	
Server)
Scalable Data Science with R & Python
Deploy	on-demand	Spark	clusters	with	
RStudio	(sparklyr),	Zeppelin,	or	Jupyter
Spark (via Zeppelin Notebook)
Turnkey	Spark	clusters	on	Docker,	
with	Zeppelin,	Jupyter,	and	SparkR	
pre-integrated
Scale to Production (Compute + Data)
Compute:	Add	worker	nodes	
Data:	Point	analy>cs	to	storage	(HDFS,	S3,	NFS)
Distributed Data Science Operations
Data	Scien>sts	
Spark	2.0	+		
Jupyter	
Notebook	
Spark	1.6.1	
+		Zeppelin	
Notebook	
JupyterHub	 RStudio		
BRING	
YOUR	OWN	
TensorFlow	
Hadoop	
(Hive,	M/R)	
Datameer	
Launch	 Launch	 Launch	 Launch	
Launch	Launch	 Launch	 Launch	
Shared	Data,	Code,	and	Results	
Users	&	Security	
Orchestra>on	&	Mgmt	
Data	Analysts	Data	Engineers	
Comprehensive	management	of	secure,	scalable,	&	reproducible	data	science	environments	
ON-PREMISES	 CLOUD
DEMO
Distributed Data Science: Takeaways
•  Opera>onalizing	distributed	data	science	is	hard	work	
–  Unique	requirements	for	access	to	data,	models,	tools,	etc.	
•  Need	to	bring	a	DevOps	approach	to	data	science	opera>ons	
–  Support	for	fast,	itera>ve	prototyping	and	reproducibility	
–  Requires	ul>mate	flexibility	as	tools	evolve	and	new	op>ons	emerge	
•  Leverage	a	turnkey	purpose-built	plalorm	(e.g.	BlueData	EPIC)	
–  Bring	DevOps	agility	to	distributed	data	science,	powered	by	Docker	
–  Provide	ability	to	share	code,	models,	&	data	with	secure	mul>-tenancy	
–  Enable	on-demand	environments	with	a	choice	of	data	science	tools
Thank You
TRY BLUEDATA EPIC ON AWS
For	more	informa>on:	
www.bluedata.com	
sales@bluedata.com	
www.bluedata.com/aws
Q&A
www.bluedata.com
Wrap-Up	
•  We’ll	share	the	slides	and	video	recording	
•  SuggesCons	for	future	meetups?	
•  Next	meeCng	TBD	–	we’ll	keep	you	posted
Thank	you		
for	a$ending!	
Thanks	to	our	sponsor:	
www.bluedata.com	
Twi$er:	@BDaaSmeetup	
Hashtag:	#BDaaS

More Related Content

PDF
Data Vault 2.0 Demystified: East Coast Tour
PDF
IDERA Slides: Managing the Transition to Hybrid Cloud
PPTX
Why Data Lake should be the foundation of Enterprise Data Architecture
PDF
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
PPTX
Azure Stream Analytics
PPT
The Time Has Come for Big-Data-as-a-Service
PPTX
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
PPTX
Deutsche Telekom on Big Data
Data Vault 2.0 Demystified: East Coast Tour
IDERA Slides: Managing the Transition to Hybrid Cloud
Why Data Lake should be the foundation of Enterprise Data Architecture
Pivotal deep dive_on_pivotal_hd_world_class_hdfs_platform
 
Azure Stream Analytics
The Time Has Come for Big-Data-as-a-Service
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
Deutsche Telekom on Big Data

Similar to Distributed Data Science, DevOps, and Docker (20)

PPTX
BDaas- BigData as a service
PDF
DevOps Spain 2019. Olivier Perard-Oracle
PPTX
Big data unit 2
PDF
Data in Action
PPTX
DataScience and BigData Cebu 1st meetup
PDF
The Rise of the DataOps - Dataiku - J On the Beach 2016
PDF
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
PDF
From Developer to Data Scientist - Gaines Kergosien
PDF
Dell EMC Ready Solutions for Big Data
PPTX
Big data solutions on cloud – the way forward
PPTX
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
PDF
Data Science towards the Digital Enterprise
PPTX
Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...
PPTX
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
PDF
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
PDF
Big Data – Is it a hype or for real?
PPTX
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
PPTX
Talend 6.1 - What's New in Talend?
PPTX
The Future of Data Science
PPTX
Journey to SAS Analytics Grid with SAS, R, Python
BDaas- BigData as a service
DevOps Spain 2019. Olivier Perard-Oracle
Big data unit 2
Data in Action
DataScience and BigData Cebu 1st meetup
The Rise of the DataOps - Dataiku - J On the Beach 2016
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
From Developer to Data Scientist - Gaines Kergosien
Dell EMC Ready Solutions for Big Data
Big data solutions on cloud – the way forward
Big Data Solutions on Cloud – The Way Forward by Kiththi Perera SLT
Data Science towards the Digital Enterprise
Oracle EBS Journey to the Cloud - What is New in 2022 (UKOUG Breakthrough 22 ...
Data Warehouse Modernization - Big Data in the Cloud Success with Qubole on O...
2013 International Conference on Knowledge, Innovation and Enterprise Presen...
Big Data – Is it a hype or for real?
Big Data Presentation - Data Center Dynamics Sydney 2014 - Dez Blanchfield
Talend 6.1 - What's New in Talend?
The Future of Data Science
Journey to SAS Analytics Grid with SAS, R, Python
Ad

Recently uploaded (20)

PDF
Introduction to Data Science and Data Analysis
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PDF
annual-report-2024-2025 original latest.
PPTX
Database Infoormation System (DBIS).pptx
Introduction to Data Science and Data Analysis
oil_refinery_comprehensive_20250804084928 (1).pptx
SAP 2 completion done . PRESENTATION.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Business Ppt On Nestle.pptx huunnnhhgfvu
Data_Analytics_and_PowerBI_Presentation.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Supervised vs unsupervised machine learning algorithms
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
climate analysis of Dhaka ,Banglades.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Introduction to machine learning and Linear Models
Computer network topology notes for revision
annual-report-2024-2025 original latest.
Database Infoormation System (DBIS).pptx
Ad

Distributed Data Science, DevOps, and Docker

Editor's Notes

  • #5: Nanda Vijaydev, Director of Solutions, BlueData Nanda has more than 10 years of experience in Data Management and Data Science. At BlueData, Nanda works with Hadoop, Spark, and related technologies to build software solutions for Big Data analytics use cases. She has worked on multiple Data Science and Big Data projects for large enterprises in the healthcare, media, telecommunications, and other industries. Prior to BlueData, she was a principal solutions architect at Silicon Valley Data Science and director of solutions engineering at Karmasphere. She has an in-depth understanding of Data Management tools including dataintegration, ETL, data warehousing, reporting, Hadoop, and Spark.
  • #10: The role of “Data” has changed from an asset that was critical to monitoring and managing business operations (think the world of BI and Reports etc) to being a source of competitive advantage, and has become a strategic enterprise asset that will be the source of new products and services for all organizations. This change combined with Big data technologies, and ability to use more types of data to refine the analytics, the traditional water fall model used has quickly evolved into an iterative, closedloop cycle with emphasis on continuous improvement. The faster organizations can bring data and teams together and iterate , the more likely they are to gain competitive advantage and create disproportionate value. So how does one bring this analytics agility and velocity? If you want to train a statistical model on very large amounts of data, you'll need three things: 
a storage platform capable of holding all of the training data, 
a computational platform capable of efficiently performing the heavy-duty mathematical computations required, and 
a statistical computing language with algorithms that can take advantage of the storage and computation power. Waterfall/Slow(er) Small(ish) Data Single Server Static Results Iterative/Ongoing Big/Fast Data Multiple Servers Results are ‘Big’
  • #11: Assumes there is an inflection point and the need for previous infrastructure goes away
  • #12: User requirements are on-going and infrastructure need is continuous
  • #13: Data Science is not a one-time process Build and evaluate in a sandbox, then evaluate and deploy at scale Minimize recoding of the models – it’s not sustainable for continuous evaluation Run environments should mimic build environments at a larger scale
  • #20: R-Studio cluster provisioned on Spark2.1 in Bluedata
  • #22: R-Studio cluster provisioned on Spark2.1 in Bluedata
  • #26: Our vision for BlueData EPIC is to provide a single software platform for Big-Data-as-a-Service that supports both on-prem and cloud deployments for Big Data.
  • #33: Rapid provisioning of data science environments at scale (Cloud or On-Prem) Sharing of data, code, and results (this is a big deal!) Easily customize environments Reproducibility (environments, results)
  • #35: Leverage a turnkey DevOps platform for Data Science (e.g. BlueData EPIC)