SlideShare a Scribd company logo
Building a cloud
based managed BigData platform
Hemanth Yamijala
Lead Consultant - ThoughtWorks
yhemanth@thoughtworks.com
@yhemanth
Pillars in a BigData Solution
BigData

Infrastructure

Data

Process
A Managed Platform
BigData

Services
Infrastructure

Services
Data

Process
Reuse infrastructure
BigData
•Consolidate cluster
resources

•Saves capacity cost
•Saves operational
cost

•Enforce common
Infrastructure

access control and
security measures

Data

Process
Reuse Data
BigData
•Democratize data
assets

•Assist self service,
discoverability

•Conventions based
organization of data

•Enforce access
policies
Infrastructure

Data

Process
Reuse process
BigData

•Build common processing

frameworks or libraries
•Ingest and Extract can be
centralized services
•Frameworks can be
developed for ETL
processes, workflows, etc.

•Save time in building
Infrastructure

Data

analytical solutions

Process
Other Reasons
• Develop and leverage skill set of people
• Separating concerns of running
applications vs running infrastructure

• Evaluate and adopt new developments in
the space
Flavors of managed BigData platforms

•
•

Physical data centers
Private or Public clouds

•
•

Infrastructure Providers:

•

Amazon Web Services, Google Compute Engine,
Microsoft Azure, IBM, Open Stack, Rackspace

Platform Providers:

•
•

Qubole, Xurmo
In-House: Netflix, ...
Architectural Layers
Enterprise User Data / Workloads

User Data / Workloads

Enterprise Managed BigData
Services (E.g. Netflix Genie)

Managed BigData Services (E.g. EMR, Savanna, Redshift)

Cloud Storage (E.g. S3, Swift)

Virtualized Compute (E.g. EC2,
Nova)
Components in a managed platform
Presentation

Command Line Tools

API

Analytics Workbench

Data analytics

Data Catalog

Query

Aggregates

ETL

Platform

Ingest

FileSystem

Workflow

Provisioning

Scheduler

Job Management

Extract

Access Control

Eventing

Infrastructure

Redshift

Data

S3

EMR

Compute

IAM

Identity

SNS

Infrastructure
Elastic MapReduce - 101
•

Provision a Hadoop cluster of given size, using given type
of instances

•
•
•
•
•
•

Support for most of the ecosystem- Hive, Pig, HBase, etc.
Can scale up and down nodes for a cluster on demand
User submits ‘jobflows’ - a sequence of Hadoop jobs
Integrates with S3 as permanent store of data
Integrates with other Amazon services
Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
Reasons for having Enterprise Tier on EMR

• Improve usability by providing better
abstractions, necessary automation

• Improve cost utilization by reusing
infrastructure

• Improve performance by providing system
level optimizations
Improving Usability
•

EMR API expects some
repetitive setup steps as
part of job submission. E.g.
Hive setup for all Hive jobs

•

Provide a service API with a
simpler interface that
automates the setup.
Improving Usability
{"steps": [
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Setup Hive",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Install Hive Site Configuration",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive-site",
"--hive-site=s3://com.x.y.z/security/configs/hive-site.xml",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "TERMINATE_JOB_FLOW",
"stepName": "Run Hive Script",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",

jobs: [{
"name": "hive-query",
"type": "hive",
"args": [
"-hiveconf",
"hive.cli.print.header=true"
],
"script": "select * from table;"
}
]
Improving Usability
•

Separate cluster management
from job management.

•

EMR expects users to
know the cluster sizes
when launching jobs

•

Have the system (or
administrators) launch
clusters on behalf of users

•

Users will either not
know how to launch
clusters, or will launch
incorrectly sized ones.

•

Have the system submit jobs
to appropriate clusters

•

Scale them according to the
needs of the jobs automatically or
administratively
Improve cost utilization

•

Different cluster types
in EMR: ephemeral
(default) and static

•

Ephemeral clusters can
be a huge cost drain Note: minimum charges
for a hour

•

Static clusters can also
waste money (if unused)

•

Go with a Hybrid model
Launch clusters on demand,
but maximize the cost to
utilization ratio - keep them
alive at least for an hour

•
•

•

Reuse them for other jobs
transparently

•

Shutdown if not used anymore
Saved $3000 in a month with
this strategy
Job Management System Design
Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Manage provisioning,
monitoring and terminating
clusters. Matches job
requests to suitable clusters
based on policy

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Pool of clusters brought up
either on demand or predetermined, based on
requirements of resource
requirements, longevity, etc.

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Has knowledge of how to
convert a user jobflow to an
EMR jobflow. Also knows
how to submit jobflows to
clusters identified by cluster
manager

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Job
Management
Service

Monitors running jobs on
clusters using CloudWatch
(or similar system), and
determines whether to add /
delete more nodes to a
cluster

Job Executor

Resource
Estimator

Cluster
Manager
Job Management System Design
Job
Management
Service

Front-end service API for
users to submit their jobs.

Job Executor

Resource
Estimator

Cluster
Manager
Thank you!
http://www.thoughtworks.com/insights/bigdata-analytics

More Related Content

PPTX
Self-Service Provisioning and Hadoop Management with Apache Ambari
PDF
BlueData EPIC on AWS - Spec Sheet
PPTX
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
PPTX
Platform as a service standard for hadoop environment
PPTX
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
PPTX
Spark Infrastructure Made Easy
PPTX
BlueData Integration with Cloudera Manager
PDF
Instaclustr: When and how to migrate from a relational database to Cassandra
Self-Service Provisioning and Hadoop Management with Apache Ambari
BlueData EPIC on AWS - Spec Sheet
SQLNexus Copenhaguen - Pipeline for the new oil: Azure Data Factory, Hybrid D...
Platform as a service standard for hadoop environment
Modern ETL: Azure Data Factory, Data Lake, and SQL Database
Spark Infrastructure Made Easy
BlueData Integration with Cloudera Manager
Instaclustr: When and how to migrate from a relational database to Cassandra

What's hot (20)

PPTX
R in Power BI
PPTX
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
PPTX
BlueData EPIC 2.0 Overview
PDF
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
PDF
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
PDF
StackVelocity Overview
PDF
Modern data warehouse with Azure
PPTX
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PPTX
Tokyo azure meetup #2 big data made easy
PDF
Part 3 - Modern Data Warehouse with Azure Synapse
PPTX
When networks meets apps (open stack atlanta)
PPTX
Choosing the right Cloud Database
PPTX
Application Centric DevOps
PPTX
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
PPTX
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
PPT
Avoiding cloud lock-in
PDF
Coursera's Adoption of Cassandra
PPTX
Atlanta Data Science Meetup | Qubole slides
PDF
Unified Data Access with Gimel
R in Power BI
PaaSport to Paradise: Lifting & Shifting with Azure SQL Database/Managed Inst...
BlueData EPIC 2.0 Overview
Sergii Bielskyi "Using Kafka and Azure Event hub together for streaming Big d...
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
StackVelocity Overview
Modern data warehouse with Azure
Webinar: Simplifying the Enterprise Hybrid Cloud with Azure Stack HCI
Bay Area Impala User Group Meetup (Sept 16 2014)
Tokyo azure meetup #2 big data made easy
Part 3 - Modern Data Warehouse with Azure Synapse
When networks meets apps (open stack atlanta)
Choosing the right Cloud Database
Application Centric DevOps
6Reinventing Oracle Systems in a Cloudy World (Sangam20, December 2020)
IMC Summit 2016 Breakout - Nikita Ivanov - Shared In-Memory RDDs – Missing Li...
Avoiding cloud lock-in
Coursera's Adoption of Cassandra
Atlanta Data Science Meetup | Qubole slides
Unified Data Access with Gimel
Ad

Similar to Building a cloud based managed BigData platform for the enterprise (20)

PDF
Dev & Test on AWS - Journey Through the Cloud
PDF
Bridging the Big Data Gap in the Software-Driven World
PDF
Big Data Analytics with Amazon Web Services
PPTX
Dev & Test on AWS - Hebrew Webinar
PPTX
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
PDF
Simple, Modular and Extensible Big Data Platform Concept
PDF
Big data and Analytics on AWS
PDF
Big Data , Big Problem?
PPTX
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
PDF
2013 05-openstack-israel-heat
PPTX
How to deploy Apache Spark in a multi-tenant, on-premises environment
PPTX
Scaling and Managing Big Data Apps in the Cloud
PPTX
Big Data Analytics - Is Your Elephant Enterprise Ready?
PDF
Trend Micro Big Data Platform and Apache Bigtop
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
PPTX
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
PDF
Apache Spark in Action
PDF
Big Data Use Cases
PDF
Hadoop summit cloudera keynote_v5
PDF
Deploying Hadoop-Based Bigdata Environments
Dev & Test on AWS - Journey Through the Cloud
Bridging the Big Data Gap in the Software-Driven World
Big Data Analytics with Amazon Web Services
Dev & Test on AWS - Hebrew Webinar
HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack (with a ...
Simple, Modular and Extensible Big Data Platform Concept
Big data and Analytics on AWS
Big Data , Big Problem?
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
2013 05-openstack-israel-heat
How to deploy Apache Spark in a multi-tenant, on-premises environment
Scaling and Managing Big Data Apps in the Cloud
Big Data Analytics - Is Your Elephant Enterprise Ready?
Trend Micro Big Data Platform and Apache Bigtop
Big Data in 200 km/h | AWS Big Data Demystified #1.3
AWS Big Data Demystified #1.2 | Big Data architecture lessons learned
Apache Spark in Action
Big Data Use Cases
Hadoop summit cloudera keynote_v5
Deploying Hadoop-Based Bigdata Environments
Ad

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation theory and applications.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
KodekX | Application Modernization Development
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Unlocking AI with Model Context Protocol (MCP)
Encapsulation theory and applications.pdf
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
Programs and apps: productivity, graphics, security and other tools
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release
KodekX | Application Modernization Development

Building a cloud based managed BigData platform for the enterprise

  • 1. Building a cloud based managed BigData platform Hemanth Yamijala Lead Consultant - ThoughtWorks yhemanth@thoughtworks.com @yhemanth
  • 2. Pillars in a BigData Solution BigData Infrastructure Data Process
  • 4. Reuse infrastructure BigData •Consolidate cluster resources •Saves capacity cost •Saves operational cost •Enforce common Infrastructure access control and security measures Data Process
  • 5. Reuse Data BigData •Democratize data assets •Assist self service, discoverability •Conventions based organization of data •Enforce access policies Infrastructure Data Process
  • 6. Reuse process BigData •Build common processing frameworks or libraries •Ingest and Extract can be centralized services •Frameworks can be developed for ETL processes, workflows, etc. •Save time in building Infrastructure Data analytical solutions Process
  • 7. Other Reasons • Develop and leverage skill set of people • Separating concerns of running applications vs running infrastructure • Evaluate and adopt new developments in the space
  • 8. Flavors of managed BigData platforms • • Physical data centers Private or Public clouds • • Infrastructure Providers: • Amazon Web Services, Google Compute Engine, Microsoft Azure, IBM, Open Stack, Rackspace Platform Providers: • • Qubole, Xurmo In-House: Netflix, ...
  • 9. Architectural Layers Enterprise User Data / Workloads User Data / Workloads Enterprise Managed BigData Services (E.g. Netflix Genie) Managed BigData Services (E.g. EMR, Savanna, Redshift) Cloud Storage (E.g. S3, Swift) Virtualized Compute (E.g. EC2, Nova)
  • 10. Components in a managed platform Presentation Command Line Tools API Analytics Workbench Data analytics Data Catalog Query Aggregates ETL Platform Ingest FileSystem Workflow Provisioning Scheduler Job Management Extract Access Control Eventing Infrastructure Redshift Data S3 EMR Compute IAM Identity SNS Infrastructure
  • 11. Elastic MapReduce - 101 • Provision a Hadoop cluster of given size, using given type of instances • • • • • • Support for most of the ecosystem- Hive, Pig, HBase, etc. Can scale up and down nodes for a cluster on demand User submits ‘jobflows’ - a sequence of Hadoop jobs Integrates with S3 as permanent store of data Integrates with other Amazon services Cost = Std. EC2 instance cost + extra + Std s3 ops etc.
  • 12. Reasons for having Enterprise Tier on EMR • Improve usability by providing better abstractions, necessary automation • Improve cost utilization by reusing infrastructure • Improve performance by providing system level optimizations
  • 13. Improving Usability • EMR API expects some repetitive setup steps as part of job submission. E.g. Hive setup for all Hive jobs • Provide a service API with a simpler interface that automates the setup.
  • 14. Improving Usability {"steps": [ { "stepActionOnFailure": "CONTINUE", "stepName": "Setup Hive", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "CONTINUE", "stepName": "Install Hive Site Configuration", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", "--install-hive-site", "--hive-site=s3://com.x.y.z/security/configs/hive-site.xml", "--hive-versions", "latest" ], "stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar" }, { "stepActionOnFailure": "TERMINATE_JOB_FLOW", "stepName": "Run Hive Script", "stepArgs": [ "s3://us-east-1.elasticmapreduce/libs/hive/hive-script", "--base-path", "s3://us-east-1.elasticmapreduce/libs/hive/", jobs: [{ "name": "hive-query", "type": "hive", "args": [ "-hiveconf", "hive.cli.print.header=true" ], "script": "select * from table;" } ]
  • 15. Improving Usability • Separate cluster management from job management. • EMR expects users to know the cluster sizes when launching jobs • Have the system (or administrators) launch clusters on behalf of users • Users will either not know how to launch clusters, or will launch incorrectly sized ones. • Have the system submit jobs to appropriate clusters • Scale them according to the needs of the jobs automatically or administratively
  • 16. Improve cost utilization • Different cluster types in EMR: ephemeral (default) and static • Ephemeral clusters can be a huge cost drain Note: minimum charges for a hour • Static clusters can also waste money (if unused) • Go with a Hybrid model Launch clusters on demand, but maximize the cost to utilization ratio - keep them alive at least for an hour • • • Reuse them for other jobs transparently • Shutdown if not used anymore Saved $3000 in a month with this strategy
  • 17. Job Management System Design Job Management Service Job Executor Resource Estimator Cluster Manager
  • 18. Job Management System Design Manage provisioning, monitoring and terminating clusters. Matches job requests to suitable clusters based on policy Job Management Service Job Executor Resource Estimator Cluster Manager
  • 19. Job Management System Design Pool of clusters brought up either on demand or predetermined, based on requirements of resource requirements, longevity, etc. Job Management Service Job Executor Resource Estimator Cluster Manager
  • 20. Job Management System Design Has knowledge of how to convert a user jobflow to an EMR jobflow. Also knows how to submit jobflows to clusters identified by cluster manager Job Management Service Job Executor Resource Estimator Cluster Manager
  • 21. Job Management System Design Job Management Service Monitors running jobs on clusters using CloudWatch (or similar system), and determines whether to add / delete more nodes to a cluster Job Executor Resource Estimator Cluster Manager
  • 22. Job Management System Design Job Management Service Front-end service API for users to submit their jobs. Job Executor Resource Estimator Cluster Manager