SlideShare a Scribd company logo
BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Big Data - in der Cloud oder doch
lieber On-Premises?
Guido Schmutz
Kassel, 21.9.2017
@gschmutz guidoschmutz@wordpress.com
Guido Schmutz
Working at Trivadis for more than 20 years
Oracle ACE Director for Fusion Middleware and SOA
Consultant, Trainer Software Architect for Java, Oracle, SOA and
Big Data / Fast Data
Head of Trivadis Architecture Board
Technology Manager @ Trivadis
More than 30 years of software development experience
Contact: guido.schmutz@trivadis.com
Blog: http://guidoschmutz.wordpress.com
Slideshare: http://www.slideshare.net/gschmutz
Twitter: gschmutz
2
Agenda
1. Cloud Primer
2. Big Data and IoT Architecture
3. Big Data in the Cloud
4. Various Models for Big Data in Cloud
5. Big Data On-Premises
6. Hybrid Big Data Solutions
4
Cloud Primer
Cloud Primer
5
Instance
• the thing running in the cloud provider’s infrastructure
• can be a VM but does not have to be
Instance Type
• the size of the instance (Combination of CPU, Memory, Disk Storage => Cost)
• Azure: Instance sizes
Instance Control
• lifecycle of an instance
• Instances can be stopped or terminated (deleted)
Cloud Primer
6
Images
• the template used for provisioning an
instance
Serverless
• Run code “without” servers => only
specify functions (Java, C#, Python,
Node.js)
• Pay only for the compute time you
consume
• easy scale-out
• management and capacity planning
decision done by provider
Regions and Availability Zones
• represents geographic distribution of
cloud provider
• Regions are the geographic areas
where a service is offered
• Availability Zones (AZ) add high
availability within a Region
• communication within AZ in same
region cost less than across regions
Cloud Primer – Specific Instances
7
On-Demand Instance
• flexible, on-demand usage
• billing increment dependent on provider
Temporary Instance
• can disappear at any time (bid price)
• are charged significantly less
• well suited for Hadoop workloads (if storage
and compute are separated)
• AWS: spot instances
Reserved Instance
• reserved capacity in advance
• reduced pricing (up to 75% to on-demand)
Dedicated Instance
• pay for instances
• run on hardware dedicated to you
• Amazon decides placement
Dedicated Host
• pay for entire physical server
• full flexibility of placement of instances (VM)
• solves existing server-bound licenses issues
Bare Metal
• bare hardware resources, no virtualization by
cloud provider
• full flexibility / full control
• almost no automation provided
Cloud Primer - Storage
8
Block Storage
• most common type offered by a cloud
provider
• disk-like storage
• comes with each instance when provisioned
• accessed as filesystem mounts => volumes,
disks
• persistent volumes survive beyond lifetime
of instance that spawned it
• ephemeral volumes are limited to life of
instance to which they are attached
• AWS: EBS
• Azure: VHDS & Azure File Storage
• Oracle: Block Storage
Object Storage
• each chunk of data is treated as its own
entity independent of any instance
• content of each object is opaque to the
provider
• API or URL is used to access data (no
mount)
• well suited for Big Data
• hot and cold storage options
• AWS: S3 & Glacier
• Azure: Azure Blob Storage
• Oracle: Object Storage & Archive Storage
Cloud Primer – Usage Patterns
9
Short Lived (Transient)
👍 Minimal maintenance, high efficiency
👎 spin up time, higher resource demand
👎 data transfer to permanent storage
Self-Service
👍 efficiency of on-demand creation
👎 need to maintain tooling
Cloud-Only
👍 data transfer stay within cloud, minimal on-
premises costs, integration with provider
👎 higher cloud expenditure
Long lived (Long Running)
👍 less time waiting for clusters to start/stop
👍 lower resource demand
👎 wasted idle time (if there is)
👎 maintenance burden, growing size over time
Managed
👍 easy alignment with budget constraints
👎 waiting time for usage, admin effort
Hybrid
👍 lower cloud expenditure, local resources
available
👎 complex workflows, data transfer costs
10
Big Data & IoT Architecture
Big Data & IoT Reference Architecture
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Big Data & IoT Reference Architecture
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Cloud / On-PremisesEdge
Internet /
Cloud /
On-Premises
1) Bulk Source – Bulk Processing
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
2) Bulk Source - Edge & Bulk Processing
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
3) Event Source – Stream & Bulk Processing
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
4) Event Source – Edge & Stream & Bulk Processing
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
5) Stream Ingestion – Edge & Stream Processing
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Big Data & IoT Reference Architecture
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
20
Big Data in the Cloud
Big Data in the Cloud – two usage patterns
21
Short Lived Cluster (Transient)
data is repurposed, and used for a
specific use case in a specific workload
Cluster spun up only when needed
Flexibility
• spin up arbitrary number of nodes quickly
• Expand quickly from very small to very large
Simplicity
• use as is, solve problem and move on
Long Lived Cluster (Long Running)
data is acquired and augmented
continuously
cluster is in permanent use for mixed
workloads
Performance
• Raw compute performance across wide range
of workloads
• time of availability
BDaaS – Possible Cost Optimizations
22
Autoscaling
• scale up when a query comes in
• scale down when jobs finish
• match utilization with job demand
• benchmark: auto-scaling saves 33% in
compute costs compared to static-
sized cluster
Excess capacity
• Hadoop is fault tolerant, can take
advantage of unreliable instances
such as temporary instances
• benchmark: if 50% is done on spot
nodes, save 80% compared to normal
nodes
Common workload distribution with Big Data applications
Data Locality vs. Compute/Storage Separation
23
Data Local Compute Separate Compute and Storage
Worker #1
Disk
Processing
Master Node
Worker #2
Disk
Processing
Worker #3
Disk
Processing
Network
Storage
Disk Disk Disk
Compute #1
Processing
Compute #2
Processing
Compute #3
Processing
Network
Master Node
Network
Separation of compute
and storage – the
fundamental difference
• store data in Object
Storage instead of DFS
• bring up Compute nodes
only for data processing
• multiple workloads on
separate clusters can
access same data
A new way to Manage Big Data
24
Big Data Traditional
Assumptions
Bare-metal
Data Locality
HDFS on local disks
Big Data
A New Approach
Containers and VMs
Compute and storage
separation
Shared storage
Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
5 ½ ways to get Big Data in the Cloud
26
1. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on Bare Metal
2. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on VM
3. Hadoop PaaS from Cloud Provider’s Marketplace
4. Dedicated (Long-Running) BigData-as-a-Service
5. Elastic (Transient) Big-Data-as-a-Service (storage and compute
separated)
6. “Cloud on Premises” (Cloud Stack from Vendors on Premises)
28
Various Models for Big Data in
Cloud
Various Models for Big Data in Cloud
29
1. Bare Metal Cloud (Bring Your Own Hadoop - BYOH)
2. IaaS with any Hadoop Distribution (Bring Your Own Hadoop)
3. PaaS with Hadoop (from Marketplace)
4. Dedicated (Long-Running) BDaaS
5. Elastic (Transient) BDaaS
6. BDaas + Analytics SaaS
1) Bare Metal Cloud (BYOH)
30
Compute	(Bare	Metal)
Big	Data	(Custom)
Oracle	Compute
Analytics	(Custom)
Storage	(Bare	Metal)
Oracle	Block	Volume	&	
Object	Storage,	Data	
Transfer	Service
Intelligence	(Custom)
Amazon
Azure
Oracle
Custom
n.a.	(Dedicated	Host	
close,	but	runs	VMs)
n.a.
n.a.	(Dedicated	Host,	
close,	but	runs	VMs)
n.a.
Bring	Your	Own	Hadoop	
(BYOH)
Custom	(SQL,	Machine	
Learning,	..)
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
2) IaaS (Bring Your Own Hadoop)
31
Amazon	EC2	&	EC2	 Azure	VM
Bring	Your	Own	Hadoop	
(BYOH)
Bring	Your	Own	Hadoop	
(BYOH)
Custom	(SQL,	Machine	
Learning,	..)
Custom	(SQL,	Machine	
Learning,	..)
General	Purpose	
Compute	&	Dedicated	
Compute
Bring	Your	Own	Hadoop	
(BYOH)
Custom	(SQL,	Machine	
Learning,	..)
S3,	EBS,	Glacier,	
Snowball,	Snowball	
Edge,	Snowmobile
Storage	(Blob),	Data	
Lake	Store,	
Import/Export
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
Oracle	Object	&	Archive	
Storage,	Data	Transfer	
Service
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
Amazon
Azure
Oracle
Custom
Compute	(Bare	Metal)
Big	Data	(Custom)
Analytics	(Custom)
Storage	(Bare	Metal)
Intelligence	(Custom)
3) PaaS (Hadoop from Marketplace)
32
S3,	EBS,	Glacier,	
Snowball,	Snowball	
Edge,	Snowmobile
Hadoop	(Hortonworks,	
MapR)
Hadoop	(Cloudera,	
Hortonworks,	MapR)
Custom	(SQL,	Machine	
Learning,	..)
Custom	(SQL,	Machine	
Learning,	..)
Amazon	EC2 Azure	VM
General	Purpose	
Compute	&	Dedicated	
Compute
Azure	Storage	(Blob,	
Block,	Disk,	File),	Azure	
Data	Lake	Store
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
Custom	(Image-,	
Speech-Recognition,	
Bots,	…)
Oracle	Object	&	Archive	
Storage,	Data	Transfer	
Service
n.a.
Amazon
Azure
Oracle
Custom
Compute	(Bare	Metal)
Big	Data	(Custom)
Analytics	(Custom)
Storage	(Bare	Metal)
Intelligence	(Custom)
4) Dedicated BDaaS
33
S3,	EBS,	Glacier
Amazon	EMR
Azure	HDInsight	
(Hortonworks)
Custom	(SQL,	Machine	
Learning,	..)
Custom	(SQL,	Machine	
Learning,	..)
Amazon	EC2 Azure	VM
General	Purpose	
Compute	&	Dedicated	
Compute
Azure	Storage	(Blob,	
Block,	Disk,	File),	Azure	
Data	Lake	Store
Image-,	Speech-
Recognition,	Bots,	…
Image-,	Speech-
Recognition,	Bots,	…
Oracle	Object	&	Archive	
Storage,	Data	Transfer	
Service
Big	Data	CS	(Cloudera)
Custom	(SQL,	Machine	
Learning,	..)
Image-,	Speech-
Recognition,	Bots,	…
Amazon
Azure
Oracle
Custom
Compute	(Bare	Metal)
Big	Data	(Custom)
Analytics	(Custom)
Storage	(Bare	Metal)
Intelligence	(Custom)
5) Elastic BDaaS
34
S3,	EBS,	Glacier
Amazon	EMR
Azure	HDInsight	
(Hortonworks)
Custom	(SQL,	Machine	
Learning,	..)
Custom	(SQL,	Machine	
Learning,	..)
Amazon	EC2 Azure	VM
General	Purpose	
Compute	&	Dedicated	
Compute
Azure	Storage	(Blob,	
Block,	Disk,	File),	Azure	
Data	Lake	Store
Image-,	Speech-
Recognition,	Bots,	…
Image-,	Speech-
Recognition,	Bots,	…
Oracle	Object	&	Archive	
Storage,	Data	Transfer	
Service
Big	Data	CS	Compute	
Edition	(Hortonworks)
Custom	(SQL,	Machine	
Learning,	..)
Image-,	Speech-
Recognition,	Bots,	…
Amazon
Azure
Oracle
Custom
Compute	(Bare	Metal)
Big	Data	(Custom)
Analytics	(Custom)
Storage	(Bare	Metal)
Intelligence	(Custom)
6) BDaaS + Analytics SaaS
35
S3,	EBS,	Glacier
Amazon	EMR
Azure	HDInsight	
(Hortonworks)
Machine	Learning,	
Polly,	…
Machine	Learning,	Data	
Lake	Analytics,	…
Amazon	EC2	&	EC2	
Dedicated	Hosts
Azure	VM
General	Purpose	
Compute	&	Dedicated	
Compute
Azure	Storage	(Blob,	
Block,	Disk,	File),	Azure	
Data	Lake	Store
Alexa,	Lex,	Polly
Cortana,	Speech	API,	
Computer	Vision	API,	
Video	API,	...
Oracle	Object	&	Archive	
Storage,	Data	Transfer	
Service
Big	Data	CS	Compute	
Edition	/	Big	Data	CS
Big	Data	Discovery	CS,	
Analytics	Cloud,	Data	
Spatial	&	Graph
n.a.
Amazon
Azure
Oracle
Custom
Compute	(Bare	Metal)
Big	Data	(Custom)
Analytics	(Custom)
Storage	(Bare	Metal)
Intelligence	(Custom)
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Oracle Cloud
36
IoT CS
Event	Hub	CS
Stream	
Analytics
Big	Data	CS
NoSQL	CS
Big	Data	
Discovery	CS
Big	Data	CS	–
Compute
Object
Storage
Archive	
Storage
Data	Transfer	
Service
Block	
Storage
NoSQL	CS
Data	Special	
&	Graph
Data	Transfer	
Service
BigData SQL
Data	Transfer	
Service
NoSQL	CS
Event	Hub	CS
Data	Transfer	
Service
Integration	CS
Messaging	CS
BI	CS
Process	CS
Mobile	CS
Container	CS
Application	
Container	CS
GoldenGate
Visual	Builder
Big	Data	
Preparation	CS
Data	
Visualization	CS
Oracle	Data	
Integrator	CS Analytics	CS
Amazon AWS
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Elastic	MapReduce	(EMR)
Polly
ML
Lex
Rekognition
Kinesis	Analytics
Kinesis	Streams
Kinesis	Firehose
Snowmobile
Snowball
AWS	IoT Platform Lambda
Direct	Connect
S3
Glacier
Dynamo	DB
EC2 Auto	Scaling	
EBS
EFS
Alexa
Athena
Dynamo	DB
Snowball
Direct	Connect
Snowball	Edge
Kinesis	Firehose
Athena
Snowball
Greengrass
Rules	Engine
Lambda
Redshift
EC2	Container	Service
EC2	Container	Registry
Mobile	Hub
Mobile	SDK
Lambda
SQSSNSEmail
PinpointAPI	Gateway
Elasticsearch
ElasticCache
Dynamo	DB
Elasticsearch
TensorFlow
Glue
Data	pipeline
QuickSight
Microsoft Azure
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
HD	Insight
Storage	Blob
Machine	
Learning
Data	Lake	
Store
Storage	Block
Data	Lake
Analytics
Event	Hub
Stream
Analytics
IoT Suite
Cosmos	DB
Import/Export
Import/Export
Speech	
API
Vision	API
Cortana
Bot	Service
Service	Bus
Notification	Hub
API	Management
Power	BI
BizTalk	Services
Event	Hub
IoT Hub
IoT Edge
SQL	Data	
Warehouse
Table	Storage
Redis	Cache
Functions
Container	Service
Container	Registry
Cosmos	DB
Table	Storage
Container	Instances
Time	Series	Insight
Time	Series	Insight
Event	Grid
43
Big Data On-Premises
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
On-Premises – Oracle Cloud Machine
44
IoT CS
Event	Hub	CS
Stream	
Analytics
Big	Data	CS
NoSQL	CS
Big	Data	
Discovery	CS
Big	Data	CS	–
Compute
Object
Storage
Archive	
Storage
Data	Transfer	
Service
Block	
Storage
NoSQL	CS
Data	Special	
&	Graph
Data	Transfer	
Service
BigData SQL
Data	Transfer	
Service
NoSQL	CS
Event	Hub	CS
Data	Transfer	
Service
Integration	CS
Messaging	CS
BI	CS
Process	CS
Mobile	CS
Container	CS
Application	
Container	CS
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
On Premises – Open Source
45
46
Hybrid Big Data Solutions
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Hybrid Big Data Solutions
47
Cloud On-PremOn-Prem/Edge/
Internet
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Hybrid Big Data Solutions
48
Cloud On-PremOn-Prem/Edge/
Internet
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Hybrid Big Data Solutions
49
CloudOn-Prem/Edge/
Internet
On-Prem
Bulk Source
Event Source
Location
DB
Extract
SQL /
Stream
Search
SQL /
Export
Service /
Stream /
Export
BI Tools
Enterprise Data
Warehouse
Search /
Explore
Enterprise
Apps
Import
Import
Edge Cluster
Storage
Core Processing
Stream
Processing
Reference /
Models
File
Weather
Batch Analytics
Stream Analytics
Parallel
Processing
Storage
Storage
RawRefined
Results
Serverless
DB
CDC
Event Hub
Edge Node
Serverless
Rule Engine
Event Hub
Event Hub
Serverless
Processing
File
CDC
Storage
Stream
Stream
State /
Results
IoT
Data
Mobile
Apps
Hybrid Big Data Solutions
50
CloudOn-Prem/Edge
Guido Schmutz
Technology Manager
guido.schmutz@trivadis.com
@gschmutz guidoschmutz.wordpress.com

More Related Content

PDF
What is Cloud Computing | Cloud Computing Tutorial | AWS Tutorial | AWS Train...
PDF
Infrastructure Monitoring Maturity: Modeling Technology, Process, & Culture
PPTX
Azure Data Factory for Azure Data Week
PPTX
DBaaS in the Real World: Risks, Rewards & Tradeoffs
PDF
ML Playbook
PDF
Business Data Lake Best Practices
PPTX
ワタシハ Azure Functions チョットデキル
PDF
Data Governance Program Powerpoint Presentation Slides
What is Cloud Computing | Cloud Computing Tutorial | AWS Tutorial | AWS Train...
Infrastructure Monitoring Maturity: Modeling Technology, Process, & Culture
Azure Data Factory for Azure Data Week
DBaaS in the Real World: Risks, Rewards & Tradeoffs
ML Playbook
Business Data Lake Best Practices
ワタシハ Azure Functions チョットデキル
Data Governance Program Powerpoint Presentation Slides

What's hot (20)

PPTX
AZ-900T01 Microsoft Azure Fundamentals-01.pptx
PPTX
IT전략계획- 03.IT 도입계획
PPTX
The roadmap for sql server 2019
PDF
Building a Data Lake on AWS
PPTX
Microsoft Teams Governance and Security Best Practices - Joel Oleson
PPTX
Azure Stack Fundamentals
PPTX
Introduction to PolyBase
PPTX
Microsoft Information Protection demystified Albert Hoitingh
PDF
IT표준화-아키텍처,프로세스-2015.09.30
PPTX
Modernize & Automate Analytics Data Pipelines
PDF
DAS Slides: Data Modeling Case Study — Business Data Modeling at Kiewit
PDF
Microsoft Azure Fundamentals
PDF
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
PDF
Enterprise Architecture vs. Data Architecture
PPTX
Tips & tricks to drive effective Master Data Management & ERP harmonization
PDF
Cloud migration strategies
PPTX
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
PDF
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
PDF
Azure fundamentals-170910113238
PPTX
『VMware Cloud on AWS』×『Veeam』移行/データ保護の最適解はこれだ!
AZ-900T01 Microsoft Azure Fundamentals-01.pptx
IT전략계획- 03.IT 도입계획
The roadmap for sql server 2019
Building a Data Lake on AWS
Microsoft Teams Governance and Security Best Practices - Joel Oleson
Azure Stack Fundamentals
Introduction to PolyBase
Microsoft Information Protection demystified Albert Hoitingh
IT표준화-아키텍처,프로세스-2015.09.30
Modernize & Automate Analytics Data Pipelines
DAS Slides: Data Modeling Case Study — Business Data Modeling at Kiewit
Microsoft Azure Fundamentals
The Value of the Modern Data Architecture with Apache Hadoop and Teradata
Enterprise Architecture vs. Data Architecture
Tips & tricks to drive effective Master Data Management & ERP harmonization
Cloud migration strategies
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
Introduction to Cloud | Cloud Computing Tutorial for Beginners | Cloud Certif...
Azure fundamentals-170910113238
『VMware Cloud on AWS』×『Veeam』移行/データ保護の最適解はこれだ!
Ad

Viewers also liked (12)

PDF
Cisco Connect Toronto 2017 - Cloud and On Premises Collaboration Security Exp...
PDF
Internet of Things (IoT) - in the cloud or rather on-premises?
PPTX
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
PDF
OOW16 - Deploying Oracle E-Business Suite for On-Premises Cloud and Oracle Cl...
PPTX
GIS Into to Cloud Microsoft Azure
PPTX
Spatial Cloud Computing And Gis Web Version, Urisa October 2012
PPTX
GIS and the Cloud
PPTX
Cloud GIS Software – GEOCIRRUS
KEY
Cloud GIS - GIS in the Rockies 2011
PDF
How to Build Modern Data Architectures Both On Premises and in the Cloud
PPT
David Overton: GIS in the cloud
PPTX
cloud computing ppt
Cisco Connect Toronto 2017 - Cloud and On Premises Collaboration Security Exp...
Internet of Things (IoT) - in the cloud or rather on-premises?
GIS & Cloud Computing - GAASC 2010 Fall Summit - Florence, SC
OOW16 - Deploying Oracle E-Business Suite for On-Premises Cloud and Oracle Cl...
GIS Into to Cloud Microsoft Azure
Spatial Cloud Computing And Gis Web Version, Urisa October 2012
GIS and the Cloud
Cloud GIS Software – GEOCIRRUS
Cloud GIS - GIS in the Rockies 2011
How to Build Modern Data Architectures Both On Premises and in the Cloud
David Overton: GIS in the cloud
cloud computing ppt
Ad

Similar to Big Data - in the cloud or rather on-premises? (20)

PDF
Fundamentals Big Data and AI Architecture
PPTX
Windowsazureplatform Overviewlatest
PPTX
Windows Azure Platform - Jonathan Wong
PDF
Data Ingestion in Big Data and IoT platforms
PDF
Solving enterprise challenges through scale out storage & big compute final
PPTX
Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...
PPTX
Understanding the Windows Azure Platform - Dec 2010
PDF
Streaming Visualization
PPT
Building Cloud-Native Applications with Microsoft Windows Azure
PPTX
Windows Azure Platform + PHP - Jonathan Wong
PDF
Big Data Analytics from Azure Cloud to Power BI Mobile
PDF
Introduction to Stream Processing
PPTX
Architecting Solutions Leveraging The Cloud
PPTX
Microsoft Partner Roadshow - To the Cloud
PPTX
India Webinar
PPTX
Azure data platform overview
PPTX
Azure Overview Csco
PDF
Introduction to Stream Processing
PPTX
Benefits of the Azure cloud
PPTX
Sky High With Azure
Fundamentals Big Data and AI Architecture
Windowsazureplatform Overviewlatest
Windows Azure Platform - Jonathan Wong
Data Ingestion in Big Data and IoT platforms
Solving enterprise challenges through scale out storage & big compute final
Pieter de Bruin (Microsoft) - Welke technologie gebruiken bij implementatie M...
Understanding the Windows Azure Platform - Dec 2010
Streaming Visualization
Building Cloud-Native Applications with Microsoft Windows Azure
Windows Azure Platform + PHP - Jonathan Wong
Big Data Analytics from Azure Cloud to Power BI Mobile
Introduction to Stream Processing
Architecting Solutions Leveraging The Cloud
Microsoft Partner Roadshow - To the Cloud
India Webinar
Azure data platform overview
Azure Overview Csco
Introduction to Stream Processing
Benefits of the Azure cloud
Sky High With Azure

More from Guido Schmutz (20)

PDF
30 Minutes to the Analytics Platform with Infrastructure as Code
PDF
Event Broker (Kafka) in a Modern Data Architecture
PDF
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
PDF
ksqlDB - Stream Processing simplified!
PDF
Kafka as your Data Lake - is it Feasible?
PDF
Event Hub (i.e. Kafka) in Modern Data Architecture
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
PDF
Building Event Driven (Micro)services with Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Apache Kafka
PDF
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
PDF
What is Apache Kafka? Why is it so popular? Should I use it?
PDF
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
PDF
Location Analytics Real-Time Geofencing using Kafka
PDF
Streaming Visualisation
PDF
Kafka as an event store - is it good enough?
PDF
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
PDF
Location Analytics - Real-Time Geofencing using Kafka
PDF
Streaming Visualization
PDF
Streaming Visualization
30 Minutes to the Analytics Platform with Infrastructure as Code
Event Broker (Kafka) in a Modern Data Architecture
Big Data, Data Lake, Fast Data - Dataserialiation-Formats
ksqlDB - Stream Processing simplified!
Kafka as your Data Lake - is it Feasible?
Event Hub (i.e. Kafka) in Modern Data Architecture
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Event Hub (i.e. Kafka) in Modern Data (Analytics) Architecture
Building Event Driven (Micro)services with Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
Solutions for bi-directional integration between Oracle RDBMS and Apache Kafka
What is Apache Kafka? Why is it so popular? Should I use it?
Solutions for bi-directional integration between Oracle RDBMS & Apache Kafka
Location Analytics Real-Time Geofencing using Kafka
Streaming Visualisation
Kafka as an event store - is it good enough?
Solutions for bi-directional Integration between Oracle RDMBS & Apache Kafka
Location Analytics - Real-Time Geofencing using Kafka
Streaming Visualization
Streaming Visualization

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
August Patch Tuesday
PDF
Machine learning based COVID-19 study performance prediction
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Advanced methodologies resolving dimensionality complications for autism neur...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
August Patch Tuesday
Machine learning based COVID-19 study performance prediction
Heart disease approach using modified random forest and particle swarm optimi...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Empathic Computing: Creating Shared Understanding
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Building Integrated photovoltaic BIPV_UPV.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Big Data - in the cloud or rather on-premises?

  • 1. BASEL BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH Big Data - in der Cloud oder doch lieber On-Premises? Guido Schmutz Kassel, 21.9.2017 @gschmutz guidoschmutz@wordpress.com
  • 2. Guido Schmutz Working at Trivadis for more than 20 years Oracle ACE Director for Fusion Middleware and SOA Consultant, Trainer Software Architect for Java, Oracle, SOA and Big Data / Fast Data Head of Trivadis Architecture Board Technology Manager @ Trivadis More than 30 years of software development experience Contact: guido.schmutz@trivadis.com Blog: http://guidoschmutz.wordpress.com Slideshare: http://www.slideshare.net/gschmutz Twitter: gschmutz 2
  • 3. Agenda 1. Cloud Primer 2. Big Data and IoT Architecture 3. Big Data in the Cloud 4. Various Models for Big Data in Cloud 5. Big Data On-Premises 6. Hybrid Big Data Solutions
  • 5. Cloud Primer 5 Instance • the thing running in the cloud provider’s infrastructure • can be a VM but does not have to be Instance Type • the size of the instance (Combination of CPU, Memory, Disk Storage => Cost) • Azure: Instance sizes Instance Control • lifecycle of an instance • Instances can be stopped or terminated (deleted)
  • 6. Cloud Primer 6 Images • the template used for provisioning an instance Serverless • Run code “without” servers => only specify functions (Java, C#, Python, Node.js) • Pay only for the compute time you consume • easy scale-out • management and capacity planning decision done by provider Regions and Availability Zones • represents geographic distribution of cloud provider • Regions are the geographic areas where a service is offered • Availability Zones (AZ) add high availability within a Region • communication within AZ in same region cost less than across regions
  • 7. Cloud Primer – Specific Instances 7 On-Demand Instance • flexible, on-demand usage • billing increment dependent on provider Temporary Instance • can disappear at any time (bid price) • are charged significantly less • well suited for Hadoop workloads (if storage and compute are separated) • AWS: spot instances Reserved Instance • reserved capacity in advance • reduced pricing (up to 75% to on-demand) Dedicated Instance • pay for instances • run on hardware dedicated to you • Amazon decides placement Dedicated Host • pay for entire physical server • full flexibility of placement of instances (VM) • solves existing server-bound licenses issues Bare Metal • bare hardware resources, no virtualization by cloud provider • full flexibility / full control • almost no automation provided
  • 8. Cloud Primer - Storage 8 Block Storage • most common type offered by a cloud provider • disk-like storage • comes with each instance when provisioned • accessed as filesystem mounts => volumes, disks • persistent volumes survive beyond lifetime of instance that spawned it • ephemeral volumes are limited to life of instance to which they are attached • AWS: EBS • Azure: VHDS & Azure File Storage • Oracle: Block Storage Object Storage • each chunk of data is treated as its own entity independent of any instance • content of each object is opaque to the provider • API or URL is used to access data (no mount) • well suited for Big Data • hot and cold storage options • AWS: S3 & Glacier • Azure: Azure Blob Storage • Oracle: Object Storage & Archive Storage
  • 9. Cloud Primer – Usage Patterns 9 Short Lived (Transient) 👍 Minimal maintenance, high efficiency 👎 spin up time, higher resource demand 👎 data transfer to permanent storage Self-Service 👍 efficiency of on-demand creation 👎 need to maintain tooling Cloud-Only 👍 data transfer stay within cloud, minimal on- premises costs, integration with provider 👎 higher cloud expenditure Long lived (Long Running) 👍 less time waiting for clusters to start/stop 👍 lower resource demand 👎 wasted idle time (if there is) 👎 maintenance burden, growing size over time Managed 👍 easy alignment with budget constraints 👎 waiting time for usage, admin effort Hybrid 👍 lower cloud expenditure, local resources available 👎 complex workflows, data transfer costs
  • 10. 10 Big Data & IoT Architecture
  • 11. Big Data & IoT Reference Architecture Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 12. Big Data & IoT Reference Architecture Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Cloud / On-PremisesEdge Internet / Cloud / On-Premises
  • 13. 1) Bulk Source – Bulk Processing Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 14. 2) Bulk Source - Edge & Bulk Processing Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 15. 3) Event Source – Stream & Bulk Processing Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 16. 4) Event Source – Edge & Stream & Bulk Processing Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 17. 5) Stream Ingestion – Edge & Stream Processing Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 18. Big Data & IoT Reference Architecture Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps
  • 19. 20 Big Data in the Cloud
  • 20. Big Data in the Cloud – two usage patterns 21 Short Lived Cluster (Transient) data is repurposed, and used for a specific use case in a specific workload Cluster spun up only when needed Flexibility • spin up arbitrary number of nodes quickly • Expand quickly from very small to very large Simplicity • use as is, solve problem and move on Long Lived Cluster (Long Running) data is acquired and augmented continuously cluster is in permanent use for mixed workloads Performance • Raw compute performance across wide range of workloads • time of availability
  • 21. BDaaS – Possible Cost Optimizations 22 Autoscaling • scale up when a query comes in • scale down when jobs finish • match utilization with job demand • benchmark: auto-scaling saves 33% in compute costs compared to static- sized cluster Excess capacity • Hadoop is fault tolerant, can take advantage of unreliable instances such as temporary instances • benchmark: if 50% is done on spot nodes, save 80% compared to normal nodes Common workload distribution with Big Data applications
  • 22. Data Locality vs. Compute/Storage Separation 23 Data Local Compute Separate Compute and Storage Worker #1 Disk Processing Master Node Worker #2 Disk Processing Worker #3 Disk Processing Network Storage Disk Disk Disk Compute #1 Processing Compute #2 Processing Compute #3 Processing Network Master Node Network Separation of compute and storage – the fundamental difference • store data in Object Storage instead of DFS • bring up Compute nodes only for data processing • multiple workloads on separate clusters can access same data
  • 23. A new way to Manage Big Data 24 Big Data Traditional Assumptions Bare-metal Data Locality HDFS on local disks Big Data A New Approach Containers and VMs Compute and storage separation Shared storage Benefits and Value Big-Data-as-a-Service Agility and cost savings Faster time-to-insights
  • 24. 5 ½ ways to get Big Data in the Cloud 26 1. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on Bare Metal 2. “Bring your own Hadoop” (MapR, Cloudera, Hortonworks) on VM 3. Hadoop PaaS from Cloud Provider’s Marketplace 4. Dedicated (Long-Running) BigData-as-a-Service 5. Elastic (Transient) Big-Data-as-a-Service (storage and compute separated) 6. “Cloud on Premises” (Cloud Stack from Vendors on Premises)
  • 25. 28 Various Models for Big Data in Cloud
  • 26. Various Models for Big Data in Cloud 29 1. Bare Metal Cloud (Bring Your Own Hadoop - BYOH) 2. IaaS with any Hadoop Distribution (Bring Your Own Hadoop) 3. PaaS with Hadoop (from Marketplace) 4. Dedicated (Long-Running) BDaaS 5. Elastic (Transient) BDaaS 6. BDaas + Analytics SaaS
  • 27. 1) Bare Metal Cloud (BYOH) 30 Compute (Bare Metal) Big Data (Custom) Oracle Compute Analytics (Custom) Storage (Bare Metal) Oracle Block Volume & Object Storage, Data Transfer Service Intelligence (Custom) Amazon Azure Oracle Custom n.a. (Dedicated Host close, but runs VMs) n.a. n.a. (Dedicated Host, close, but runs VMs) n.a. Bring Your Own Hadoop (BYOH) Custom (SQL, Machine Learning, ..) Custom (Image-, Speech-Recognition, Bots, …)
  • 28. 2) IaaS (Bring Your Own Hadoop) 31 Amazon EC2 & EC2 Azure VM Bring Your Own Hadoop (BYOH) Bring Your Own Hadoop (BYOH) Custom (SQL, Machine Learning, ..) Custom (SQL, Machine Learning, ..) General Purpose Compute & Dedicated Compute Bring Your Own Hadoop (BYOH) Custom (SQL, Machine Learning, ..) S3, EBS, Glacier, Snowball, Snowball Edge, Snowmobile Storage (Blob), Data Lake Store, Import/Export Custom (Image-, Speech-Recognition, Bots, …) Custom (Image-, Speech-Recognition, Bots, …) Oracle Object & Archive Storage, Data Transfer Service Custom (Image-, Speech-Recognition, Bots, …) Amazon Azure Oracle Custom Compute (Bare Metal) Big Data (Custom) Analytics (Custom) Storage (Bare Metal) Intelligence (Custom)
  • 29. 3) PaaS (Hadoop from Marketplace) 32 S3, EBS, Glacier, Snowball, Snowball Edge, Snowmobile Hadoop (Hortonworks, MapR) Hadoop (Cloudera, Hortonworks, MapR) Custom (SQL, Machine Learning, ..) Custom (SQL, Machine Learning, ..) Amazon EC2 Azure VM General Purpose Compute & Dedicated Compute Azure Storage (Blob, Block, Disk, File), Azure Data Lake Store Custom (Image-, Speech-Recognition, Bots, …) Custom (Image-, Speech-Recognition, Bots, …) Oracle Object & Archive Storage, Data Transfer Service n.a. Amazon Azure Oracle Custom Compute (Bare Metal) Big Data (Custom) Analytics (Custom) Storage (Bare Metal) Intelligence (Custom)
  • 30. 4) Dedicated BDaaS 33 S3, EBS, Glacier Amazon EMR Azure HDInsight (Hortonworks) Custom (SQL, Machine Learning, ..) Custom (SQL, Machine Learning, ..) Amazon EC2 Azure VM General Purpose Compute & Dedicated Compute Azure Storage (Blob, Block, Disk, File), Azure Data Lake Store Image-, Speech- Recognition, Bots, … Image-, Speech- Recognition, Bots, … Oracle Object & Archive Storage, Data Transfer Service Big Data CS (Cloudera) Custom (SQL, Machine Learning, ..) Image-, Speech- Recognition, Bots, … Amazon Azure Oracle Custom Compute (Bare Metal) Big Data (Custom) Analytics (Custom) Storage (Bare Metal) Intelligence (Custom)
  • 31. 5) Elastic BDaaS 34 S3, EBS, Glacier Amazon EMR Azure HDInsight (Hortonworks) Custom (SQL, Machine Learning, ..) Custom (SQL, Machine Learning, ..) Amazon EC2 Azure VM General Purpose Compute & Dedicated Compute Azure Storage (Blob, Block, Disk, File), Azure Data Lake Store Image-, Speech- Recognition, Bots, … Image-, Speech- Recognition, Bots, … Oracle Object & Archive Storage, Data Transfer Service Big Data CS Compute Edition (Hortonworks) Custom (SQL, Machine Learning, ..) Image-, Speech- Recognition, Bots, … Amazon Azure Oracle Custom Compute (Bare Metal) Big Data (Custom) Analytics (Custom) Storage (Bare Metal) Intelligence (Custom)
  • 32. 6) BDaaS + Analytics SaaS 35 S3, EBS, Glacier Amazon EMR Azure HDInsight (Hortonworks) Machine Learning, Polly, … Machine Learning, Data Lake Analytics, … Amazon EC2 & EC2 Dedicated Hosts Azure VM General Purpose Compute & Dedicated Compute Azure Storage (Blob, Block, Disk, File), Azure Data Lake Store Alexa, Lex, Polly Cortana, Speech API, Computer Vision API, Video API, ... Oracle Object & Archive Storage, Data Transfer Service Big Data CS Compute Edition / Big Data CS Big Data Discovery CS, Analytics Cloud, Data Spatial & Graph n.a. Amazon Azure Oracle Custom Compute (Bare Metal) Big Data (Custom) Analytics (Custom) Storage (Bare Metal) Intelligence (Custom)
  • 33. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Oracle Cloud 36 IoT CS Event Hub CS Stream Analytics Big Data CS NoSQL CS Big Data Discovery CS Big Data CS – Compute Object Storage Archive Storage Data Transfer Service Block Storage NoSQL CS Data Special & Graph Data Transfer Service BigData SQL Data Transfer Service NoSQL CS Event Hub CS Data Transfer Service Integration CS Messaging CS BI CS Process CS Mobile CS Container CS Application Container CS GoldenGate Visual Builder Big Data Preparation CS Data Visualization CS Oracle Data Integrator CS Analytics CS
  • 34. Amazon AWS Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Elastic MapReduce (EMR) Polly ML Lex Rekognition Kinesis Analytics Kinesis Streams Kinesis Firehose Snowmobile Snowball AWS IoT Platform Lambda Direct Connect S3 Glacier Dynamo DB EC2 Auto Scaling EBS EFS Alexa Athena Dynamo DB Snowball Direct Connect Snowball Edge Kinesis Firehose Athena Snowball Greengrass Rules Engine Lambda Redshift EC2 Container Service EC2 Container Registry Mobile Hub Mobile SDK Lambda SQSSNSEmail PinpointAPI Gateway Elasticsearch ElasticCache Dynamo DB Elasticsearch TensorFlow Glue Data pipeline QuickSight
  • 35. Microsoft Azure Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps HD Insight Storage Blob Machine Learning Data Lake Store Storage Block Data Lake Analytics Event Hub Stream Analytics IoT Suite Cosmos DB Import/Export Import/Export Speech API Vision API Cortana Bot Service Service Bus Notification Hub API Management Power BI BizTalk Services Event Hub IoT Hub IoT Edge SQL Data Warehouse Table Storage Redis Cache Functions Container Service Container Registry Cosmos DB Table Storage Container Instances Time Series Insight Time Series Insight Event Grid
  • 37. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps On-Premises – Oracle Cloud Machine 44 IoT CS Event Hub CS Stream Analytics Big Data CS NoSQL CS Big Data Discovery CS Big Data CS – Compute Object Storage Archive Storage Data Transfer Service Block Storage NoSQL CS Data Special & Graph Data Transfer Service BigData SQL Data Transfer Service NoSQL CS Event Hub CS Data Transfer Service Integration CS Messaging CS BI CS Process CS Mobile CS Container CS Application Container CS
  • 38. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps On Premises – Open Source 45
  • 39. 46 Hybrid Big Data Solutions
  • 40. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Hybrid Big Data Solutions 47 Cloud On-PremOn-Prem/Edge/ Internet
  • 41. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Hybrid Big Data Solutions 48 Cloud On-PremOn-Prem/Edge/ Internet
  • 42. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Hybrid Big Data Solutions 49 CloudOn-Prem/Edge/ Internet On-Prem
  • 43. Bulk Source Event Source Location DB Extract SQL / Stream Search SQL / Export Service / Stream / Export BI Tools Enterprise Data Warehouse Search / Explore Enterprise Apps Import Import Edge Cluster Storage Core Processing Stream Processing Reference / Models File Weather Batch Analytics Stream Analytics Parallel Processing Storage Storage RawRefined Results Serverless DB CDC Event Hub Edge Node Serverless Rule Engine Event Hub Event Hub Serverless Processing File CDC Storage Stream Stream State / Results IoT Data Mobile Apps Hybrid Big Data Solutions 50 CloudOn-Prem/Edge