Building a cloud based managed BigData platform for the enterprise

Building a cloud
based managed BigData platform
Hemanth Yamijala
Lead Consultant - ThoughtWorks
yhemanth@thoughtworks.com
@yhemanth

Pillars in a BigData Solution
BigData

Infrastructure

Data

Process

A Managed Platform
BigData

Services
Infrastructure

Services
Data

Process

Reuse infrastructure
BigData
•Consolidate cluster
resources

•Saves capacity cost
•Saves operational
cost

•Enforce common
Infrastructure

access control and
security measures

Data

Process

Reuse Data
BigData
•Democratize data
assets

•Assist self service,
discoverability

•Conventions based
organization of data

•Enforce access
policies
Infrastructure

Data

Process

Reuse process
BigData

•Build common processing

frameworks or libraries
•Ingest and Extract can be
centralized services
•Frameworks can be
developed for ETL
processes, workﬂows, etc.

•Save time in building
Infrastructure

Data

analytical solutions

Process

Other Reasons
• Develop and leverage skill set of people
• Separating concerns of running
applications vs running infrastructure

• Evaluate and adopt new developments in
the space

Flavors of managed BigData platforms

•
•

Physical data centers
Private or Public clouds

•
•

Infrastructure Providers:

•

Amazon Web Services, Google Compute Engine,
Microsoft Azure, IBM, Open Stack, Rackspace

Platform Providers:

•
•

Qubole, Xurmo
In-House: Netﬂix, ...

Architectural Layers
Enterprise User Data / Workloads

User Data / Workloads

Enterprise Managed BigData
Services (E.g. Netﬂix Genie)

Managed BigData Services (E.g. EMR, Savanna, Redshift)

Cloud Storage (E.g. S3, Swift)

Virtualized Compute (E.g. EC2,
Nova)

Components in a managed platform
Presentation

Command Line Tools

API

Analytics Workbench

Data analytics

Data Catalog

Query

Aggregates

ETL

Platform

Ingest

FileSystem

Workﬂow

Provisioning

Scheduler

Job Management

Extract

Access Control

Eventing

Infrastructure

Redshift

Data

S3

EMR

Compute

IAM

Identity

SNS

Infrastructure

Elastic MapReduce - 101
•

Provision a Hadoop cluster of given size, using given type
of instances

•
•
•
•
•
•

Support for most of the ecosystem- Hive, Pig, HBase, etc.
Can scale up and down nodes for a cluster on demand
User submits ‘jobﬂows’ - a sequence of Hadoop jobs
Integrates with S3 as permanent store of data
Integrates with other Amazon services
Cost = Std. EC2 instance cost + extra + Std s3 ops etc.

Reasons for having Enterprise Tier on EMR

• Improve usability by providing better
abstractions, necessary automation

• Improve cost utilization by reusing
infrastructure

• Improve performance by providing system
level optimizations

Improving Usability
•

EMR API expects some
repetitive setup steps as
part of job submission. E.g.
Hive setup for all Hive jobs

•

Provide a service API with a
simpler interface that
automates the setup.

Improving Usability
{"steps": [
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Setup Hive",
"stepArgs": [
"s3://us-east-1.elasticmapreduce/libs/hive/hive-script",
"--base-path",
"s3://us-east-1.elasticmapreduce/libs/hive/",
"--install-hive",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "CONTINUE",
"stepName": "Install Hive Site Configuration",
"stepArgs": [
"--base-path",
"--install-hive-site",
"--hive-site=s3://com.x.y.z/security/configs/hive-site.xml",
"--hive-versions",
"latest"
],
"stepJar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar"
},
{
"stepActionOnFailure": "TERMINATE_JOB_FLOW",
"stepName": "Run Hive Script",
"stepArgs": [
"--base-path",

jobs: [{
"name": "hive-query",
"type": "hive",
"args": [
"-hiveconf",
"hive.cli.print.header=true"
],
"script": "select * from table;"
}
]

Improving Usability
•

Separate cluster management
from job management.

•

EMR expects users to
know the cluster sizes
when launching jobs

•

Have the system (or
administrators) launch
clusters on behalf of users

•

Users will either not
know how to launch
clusters, or will launch
incorrectly sized ones.

•

Have the system submit jobs
to appropriate clusters

•

Scale them according to the
needs of the jobs automatically or
administratively

Improve cost utilization

•

Different cluster types
in EMR: ephemeral
(default) and static

•

Ephemeral clusters can
be a huge cost drain Note: minimum charges
for a hour

•

Static clusters can also
waste money (if unused)

•

Go with a Hybrid model
Launch clusters on demand,
but maximize the cost to
utilization ratio - keep them
alive at least for an hour

•
•

•

Reuse them for other jobs
transparently

•

Shutdown if not used anymore
Saved $3000 in a month with
this strategy

Job Management System Design
Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager

Manage provisioning,
monitoring and terminating
clusters. Matches job
requests to suitable clusters
based on policy

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager

Pool of clusters brought up
either on demand or predetermined, based on
requirements of resource
requirements, longevity, etc.

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager

Has knowledge of how to
convert a user jobflow to an
EMR jobflow. Also knows
how to submit jobflows to
clusters identified by cluster
manager

Job
Management
Service

Job Executor

Resource
Estimator

Cluster
Manager

Job
Management
Service

Monitors running jobs on
clusters using CloudWatch
(or similar system), and
determines whether to add /
delete more nodes to a
cluster

Job Executor

Resource
Estimator

Cluster
Manager

Job
Management
Service

Front-end service API for
users to submit their jobs.

Job Executor

Resource
Estimator

Cluster
Manager

Thank you!
http://www.thoughtworks.com/insights/bigdata-analytics

Building a cloud based managed BigData platform for the enterprise

More Related Content

What's hot (20)

Similar to Building a cloud based managed BigData platform for the enterprise (20)

Recently uploaded (20)

Building a cloud based managed BigData platform for the enterprise