SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Customer Best Practices: Optimizing Cloudera on AWS
Josh Hammer | Partner Solutions Architect, AWS
Hervé Bertacchi | Director of Big Data Solutions Architecture, Celgene
Alex Moundalexis | Cloudera Software Engineer | @technmsg
2© Cloudera, Inc. All rights reserved.
Agenda
• The AWS Difference (Josh Hammer) - 15 min
• Celgene Customer Story (Hervé Bertacchi) - 10 min
• Optimizing Cloudera on AWS (Alex Moundalexis) - 25 min
• Q&A time - 10 min
3© Cloudera, Inc. All rights reserved.
Our Speakers
Josh Hammer is Partner Solution Architect with
AWS. Before joining the Partner Solution
Architect team, he was a Security Architect with
AWS Professional Services helping large
enterprises adopt AWS. Prior to joining AWS,
Josh was a Security Architect at Oracle.
Alex Moundalexis is a software engineer at
Cloudera responsible for partner platforms.
Formerly on Cloudera's professional services team,
Alex spent several years installing and configuring
Hadoop clusters across the United States and
abroad.
4© Cloudera, Inc. All rights reserved.
Hervé Bertacchi, Director of Big Data Solutions Architecture is a
Senior Information Technology leader with more than two decades of
experience delivering strategy, long-range planning, solutions, and
enabling innovation.
Together with his team, Hervé received Cloudera’s 2017 Data Impact
Award for establishing Celgene’s platform for Big Data Analytics that
enabled the company to unlock new insights and drastically improve the
productivity of its data scientists.
Previously, Hervé led the Information Management and Commercial
Insight function at AstraZeneca, where he established a Big Data
capability and modernized the legacy MDM platform. Earlier in his
career, he held various architecture leadership roles, as well as, being an
entrepreneur.
Hervé holds a Master of Science in Computer Science from Stevens
Institute of Technology and has completed advanced studies in
Mathematics and Physics at Lycée Masséna.
Our Speakers
5© Cloudera, Inc. All rights reserved.
Ways to Deploy Cloudera Enterprise on AWS
On Premise AWS Cloud, IaaS AWS Cloud, PaaS
Analytic
DBMS
Operational
DBMS
Data
Engineering
Altus Platform Services
= current = planned
Data
Science
Cloudera Director
6© Cloudera, Inc. All rights reserved.
The AWS Difference
7© Cloudera, Inc. All rights reserved.
What sets AWS apart?
Building and managing cloud since 2006
90+ services to support any cloud workload; rapid
customer driven releases
16 regions, 42 availability zones, 76 edge locations
Thousands of partners; 3,800+ Marketplace products
Experience: 1M+ customers
Service Breadth & Depth; pace of innovation
Global Footprint
Ecosystem
Fine-grained controlSecurity
Fully integrated in AWSArtificial Intelligence
Gartner Magic quadrant recognizing AWSEnterprise leader
8© Cloudera, Inc. All rights reserved.
Security
Security full fine-grained level control: a customer might restrict a user to create only a
database in a specific Region, from 3-5 pm on Monday to Friday, only on a virtual private
cloud, and only in an M4 instance with a maximum number of IOPs. This only can be done in
AWS.
Shared responsibility model: AWS manages the “security of the cloud”, and customers operate
the “security in the cloud”. Customers retain control of what security they choose.
AWS has more than 50 compliance certifications and accreditations.
Security and Compliance from the Ground Up Giving Customers Fine-grained Control
“We worked closely with the Amazon team to develop a security model, which we
believe enables us to operate more securely in the public cloud than we can even in
our data centers.” Capital One
9© Cloudera, Inc. All rights reserved.
Account
Support
Support
Managed
Services
Professional
Services
Partner
Ecosystem
Training &
Certification
Solution
Architects
Account
Management
Security &
Pricing Reports
Technical Acct.
Management
Marketplace
Business
Applications
DevOps
Tools
Business
Intelligence
Security
Networking
Database &
Storage
SaaS
Subscriptions
Operating
Systems
Mobile
Build, Test,
Monitor Apps
Push
Notifications
Build, Deploy,
Manage APIs
Device
Testing
Identity
Enterprise
Applications
Document
Sharing
Email &
Calendaring
Hosted
Desktops
Application
Streaming
Backup
Game
Development
3D Game
Engine
Multi-player
Backends
Mgmt. Tools
Monitoring
Auditing
Service
Catalog
Server
Management
Configuration
Tracking
Optimization
Resource
Templates
Automation
Analytics
Query Large
Data Sets
Elasticsearch
Business
Analytics
Hadoop/Spark
Real-time
Data
Streaming
Orchestration
Workflows
Managed
Search
Managed ETL
Artificial
Intelligence
Voice & Text
Chatbots
Machine
Learning
Text-to-
Speech
Image
Analysis
IoT
Rules Engine
Local
Compute and
Sync
Device
Shadows
Device
Gateway
Registry
Hybrid
Devices &
Edge
Systems
Data
Integration
Integrated
Networking
Resource
Management
VMware on
AWS
Identity
Federation
Migration
Application
Discovery
Application
Migration
Database
Migration
Server
Migration
Data
Migration
Infrastructure Regions
Availability
Zones
Points of
Presence
Compute Containers
Event-driven
Computing
Virtual
Machines
Simple
Servers
Auto Scaling Batch
Web
Applications
Storage
Object
Storage
Archive Block Storage
Managed File
Storage
Exabyte-scale
Data
Transport
Database MariaDB
Data
Warehousing
NoSQLAurora MySQL Oracle SQL ServerPostgreSQL
Application
Services
Transcoding
Step
Functions
Messaging
Security
Certificate
Management
Web App.
Firewall
Identity &
Access
Key Storage
&
Management
DDoS
Protection
Application
Analysis
Active
Directory
Dev Tools
Private Git
Repositories
Continuous
Delivery
Build, Test,
and Debug
Deployment
Networking
Isolated
Resources
Dedicated
Connections
Load
Balancing
Scalable DNSGlobal CDN
The AWS
Platform
10© Cloudera, Inc. All rights reserved.
48
82
280
722
2009 2011 2013 2015 2016
AWS’ History of Innovation
* As of 1 July 2016
AWS has been continually expanding its services to support virtually
any cloud workload, and it now has more than 90 services that range
from compute, storage, networking, database, analytics, application
services, deployment, management, developer, mobile, Internet of
Things (IoT), Artificial Intelligence (AI), security, hybrid and enterprise
applications.
In 2016, we launched 1,017 new services and features. As of April 1st,
we have launched 236 new features and services in 2017.
1017
Customer driven services and features
11© Cloudera, Inc. All rights reserved.
Experience with Operational Reliability
More than a decade building the most reliable, secure, scalable, and cost-effective
infrastructure.
Availability Zones exist on isolated fault lines, flood plains, networks, and electrical grids to
substantially reduce the chance of simultaneous failure.
AWS has reduced prices 61 times since AWS launched in 2006 [as of May 2017].
Millions of active customers use AWS
12© Cloudera, Inc. All rights reserved.
Experience with Hybrid
AWS provide the broadest set of hybrid capabilities of any cloud provider (networking, data,
access, management and application) without making any new on-premises hardware
purchases.
The partnership with VMware to allow customers to seamlessly run existing VMware
workloads on AWS with the skills and toolsets they already have.
Customer continue to use key enterprise solutions from Microsoft, Oracle, SAP…but also
offered as fully managed services.
Customers continue to use their on premise investments while getting the full
benefits of the cloud: flexibility, cost effective, reliability, scalability and security
“We’re running a hybrid architecture, and we still host some of our applications in our
own data center, but we are moving more and more applications to the cloud…We’re
finding that we have the flexibility to support both environments because of the
tooling, APIs, and management features AWS has on top of its cloud. And, with the
flexibility we get using AWS, we can adapt over time to our company’s changing
needs.”
CSRA (US Gov.)
13© Cloudera, Inc. All rights reserved.
Global Footprint
190 countries
2,300 government agencies
7,000 educational institutions
22,000 Non profits organizations
16 regions (2 more announced)
42 availability zones
73 edge locations
AWS regions consists of multiple Availability Zones (“AZs”) isolated for failures and with low latency networks
connectivity making AWS the only provider that allows HA natively supported
Region & Number of Availability Zones
AWS GovCloud (2)
US West: Oregon (3), Northern California (3)
US East: Northern Virginia (5), Ohio (3)
Canada: Central (2)
South America: São Paulo (3)
Europe: Ireland (3), Frankfurt (2), London (2)
Asia Pacific: Singapore (2), Sydney (3), Tokyo (3), Seoul (2), Mumbai (2)
China: Beijing (2)
New Region (coming soon): Paris, Ningxia
14© Cloudera, Inc. All rights reserved.
Region
Region
Production applications that are highly available
The AWS Cloud infrastructure:
• Availability Zones (42) consist of one or more discrete data
centers, each with redundant power, networking, and
connectivity, housed in separate facilities.
• A Region (16) is a physical location in the world where we
have multiple Availability Zones.
1
N
2
1
N
2 1
N
2
1
N
2
1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
21
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
2
1
N
2 1
N
2
1
N
2
1
AZ
Availability Zones (AZs) provide the resiliency of
performing real-time data replication and the reliability of
multiple physical locations
2
Low latency
ensures real
data replication
Distance
ensures high
availability
15© Cloudera, Inc. All rights reserved.
Artificial Intelligence fully integrated in AWS
Application Developers
Amazon Rekognition
Amazon Machine Learning
Amazon Polly
Amazon Lex
Natural Language Understanding (NLU)
& Automatic Speech Recognition (ASR)
Image Recognition & Analysis
Text-to-Speech
Managed Machine Learning
AWS Deep Learning AMI
Use and scale deep learning
frameworks quickly and easily
Data Scientists & Researchers
16© Cloudera, Inc. All rights reserved.
An Expansive Ecosystem
Thousands of the world’s largest
technology and consulting companies
55 Premier Consulting Partners
17 Enterprise-focused competencies
3,800+ products
Customers run over 370 M hours of
software per month
Products fully integrated with AWS platform and easy to fully test
17© Cloudera, Inc. All rights reserved.
AWS Positioned as a Leader in the Gartner Magic Quadrant for Cloud
Infrastructure as a Service, Worldwide*
AWS is positioned highest in execution
and furthest in vision within the Leaders
Quadrant
*Gartner, Magic Quadrant for Cloud Infrastructure as a Service, Worldwide, Leong, Lydia, Petri, Gregor, Gill, Bob, Dorosh, Mike, August 32016
This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document.
The Gartner document is available upon request from AWS : http://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sb
Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only
those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization
and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any
warranties of merchantability or fitness for a particular purpose.
Mark Benioff, SalesForce.com CEO: “There is no
public cloud infrastructure provider that is more
sophisticated or has more robust enterprise
capabilities for supporting the needs of our
growing global customer base.”
Jim Fowler, GE CIO: "AWS is our trusted partner
that is going to run our company for the next
140 years.”
AirBnB: “Amazon Web Services listens to
customers’ needs. If the feature does not yet
exist, it probably will in a matter of months.”
18© Cloudera, Inc. All rights reserved.
The Celgene Story
19© Cloudera, Inc. All rights reserved.
Celgene is focused on the discovery,
development, and commercialization of
innovative therapies for patients with
cancer, immune-inflammatory, and other
unmet medical needs.
• We are amidst a revolution in the
rapidly evolving healthcare ecosystem
to change the way we define and treat
human diseases
• We transformed our approach to
harnessing data and using the derived
insights to drive decision-making
• We made an investment in Big Data
Analytics
20© Cloudera, Inc. All rights reserved.
WHAT WE ACHIEVED
Cloudera on AWS:
•Helps us to bring innovative therapies
to patients faster and rapidly adapt to
evolving business needs
•Reduced the time taken to do the
patient data analysis by 99% and
achieved a 70% savings in operating
costs
•Improved data scientists and analysts
productivity across the company and
enabled them to answer business
questions that could not be answered
before
21© Cloudera, Inc. All rights reserved.
LESSONS LEARNED
• Security is not optional. Carefully
plan and test your security
architecture
• Full automation across several
technology ecosystems is not
trivial. Think end-to-end and pick
the right automation tool
• Ease of access and trust in the
data is essential. Technology is
only an enabler
22© Cloudera, Inc. All rights reserved.
Best practices for running your
Cloudera Enterprise cluster on AWS
23© Cloudera, Inc. All rights reserved.
AWS
Components
23© Cloudera, Inc. All rights reserved.
24© Cloudera, Inc. All rights reserved.
Quick AWS Component Review
EC2 - servers
S3 - object storage
EBS - block storage
RDS - we still need RDBMS
VPC - everything needs network
25© Cloudera, Inc. All rights reserved.
AWS Service Limits
Default limits on most AWS services (most can be increased)
• EC2 - ten (10) m4.4xlarge instances per region
• EBS - 20 TB of Throughput Optimized HDD (st1) per region
• S3 - 100 buckets per account
• VPC - 5 VPC per region
Defaults might limit your cluster, so plan ahead!
26© Cloudera, Inc. All rights reserved.
Deployment
Topology
26© Cloudera, Inc. All rights reserved.
27© Cloudera, Inc. All rights reserved.
Deployment Topology
Two types of deployments from network perspective:
• Public Subnet - direct access to Internet & AWS services
• Private Subnet - instances must go through a VPC endpoints to
reach AWS services & NAT instances for Internet
If you require high-bandwidth access to data sources on Internet or
AWS services in another region, deploy to a public subnet.
Public doesn’t mean open to world. Limit access using Security
Groups.
28© Cloudera, Inc. All rights reserved.
Deployment Topology (Public Subnet)
29© Cloudera, Inc. All rights reserved.
Deployment Topology (Private Subnet)
30© Cloudera, Inc. All rights reserved.
A Word About Edge Nodes
Edge nodes:
• Have direct access to cluster
• Users user these nodes via client applications
• Web applications, BI tools, or just Hadoop command-line client
Avoid direct user access to the cluster!
31© Cloudera, Inc. All rights reserved.
Deployment Topology with Edge Nodes
32© Cloudera, Inc. All rights reserved.
Deployment Topology with Edge Nodes
33© Cloudera, Inc. All rights reserved.
Roles and
Instance
Types
33© Cloudera, Inc. All rights reserved.
34© Cloudera, Inc. All rights reserved.
So you want a cluster, eh?
35© Cloudera, Inc. All rights reserved.
Deployment Model
Two types of deployments:
• Persistent - long-lived, always on, no spin-up time
• Temporary - short-lived, can be started/stopped to save money
Impacts job setup time, instance types, storage method, and cost.
36© Cloudera, Inc. All rights reserved.
Roles
Three classes of cluster nodes:
• Masters - the brains behind all the cluster services
• Workers - the processing power and storage
• Edge nodes - provides client access to cluster services
Different instance type recommendations depending on the class and
also the type of deployment model.
37© Cloudera, Inc. All rights reserved.
Instance Types
Ephemeral
• have locally attached storage, HDD or SSD
• on instance termination, data is irrecoverable
EBS-only
• no local storage, must mount EBS volumes
• on instance termination, data is safe in EBS
38© Cloudera, Inc. All rights reserved.
Networking
Connectivity
Security
38© Cloudera, Inc. All rights reserved.
39© Cloudera, Inc. All rights reserved.
Networking, Connectivity, and Security
• Use an HVM AMI in VPC with correct network drivers
• Use VPC endpoints for AWS services
• Private data center connectivity via VPN or Direct Connect
• Use Placement Groups, a logical grouping of EC instances
• Security groups analogous to firewalls
permit traffic from sg-clusterNNN to sg-clusterNNN
permit traffic from office-net to edge-node tcp 22
40© Cloudera, Inc. All rights reserved.
Storage
Configuration
40© Cloudera, Inc. All rights reserved.
41© Cloudera, Inc. All rights reserved.
Storage Configuration
Three types of storage:
• Instance Storage
• Elastic Block Storage (EBS)
• Object Storage (S3)
42© Cloudera, Inc. All rights reserved.
Storage Configuration - Instance Storage
• Attached to EC2 instances, like physical disks on physical server
• Lifetime of storage == Lifetime of EC2 instance
• Each EC2 instance has different amounts of storage
c1.xlarge has 4 x 420 GB
d2.8xlarge has 24 x 2 TB
• For HDFS data directories, we like HDD instance storage
• Backup planning -- multi-instance shutdowns, multi-VM AWS
events
43© Cloudera, Inc. All rights reserved.
Storage Configuration - EBS
• Persistent block level storage volumes
• Can be encrypted at rest w/ negligible impact to latency/throughput
• OS: General Purpose (gp2)
• DFS: Throughput Optimized HDD (st1), Cold HDD (sc1)
• Baseline & burst performance increase with size of provisioned
volume
e.g. 500 GB st1 baseline throughput 20 MB/s; 1000 GB → 40 MB/s
44© Cloudera, Inc. All rights reserved.
Storage Configuration - EBS Recommendations
Instance selection
• Use EBS-optimized instances OR instances with 10Gb+ network
• Minimum dedicated EBS bandwidth of 1000 Mb/s (125 MB/s)
Volume selection
• Baseline performance, 40 MB/s or better (1000 GB st1, 3200 GB
sc1)
• Do not exceed instance’s dedicated EBS bandwidth!
45© Cloudera, Inc. All rights reserved.
Storage Configuration - S3
• Great for cold backup: durable, available, inexpensive
• For hot backup, use a second HDFS cluster
• Hive and Spark can also use S3 directly
• Standard data operations can read from & write to S3 buckets
46© Cloudera, Inc. All rights reserved.
Storage Configuration
Three types of storage:
• Instance Storage
• Elastic Block Storage (EBS)
• Object Storage (S3)
• Root Device
47© Cloudera, Inc. All rights reserved.
Storage Configuration - Root Device
• For operating system and logs
• Use EBS gp2 volumes as root devices
• At least 500 GB for OS, CDH software, and logs
• Do not use instance storage for the root device!
48© Cloudera, Inc. All rights reserved. 48© Cloudera, Inc. All rights reserved.
Capacity
Planning
49© Cloudera, Inc. All rights reserved.
Capacity Planning
• AWS makes expansion easy; advance planning makes things
easier
• Consider workloads: how much storage vs compute? balanced?
• Consider data replication (3x), growth, retention
• Low storage density, r3.8xlarge or c4.8xlarge provide less storage
but higher compute and memory
• High storage density, d2.8xlarge offers 48 TB per instance with a
good amount of compute and memory
50© Cloudera, Inc. All rights reserved.
Cloudera Enterprise Hardware
Requirements Guide
tiny.cloudera.com/hw-reqs
51© Cloudera, Inc. All rights reserved. 51© Cloudera, Inc. All rights reserved.
Provisioning
Instances
52© Cloudera, Inc. All rights reserved.
Provisioning Instances
• Cloudera Director automates most things
• Manual via EC2 command-line API tool or AWS management
console
• Don’t forget your databases (either RDS or self-managed)
• Cloudera Altus
53© Cloudera, Inc. All rights reserved.
Provisioning Instances
No matter which route you take...
• root device: 500 GB+ gp2 EBS volume
• master metadata: ephemeral or recommended gp2 EBS volumes
• DFS data: ephemeral or recommended st1/sc1 EBS volumes
• use tags to indicate the role instances/volumes will play
54© Cloudera, Inc. All rights reserved.
Cloudera Enterprise Reference Architecture
for AWS Deployments
tiny.cloudera.com/aws-ra
55© Cloudera, Inc. All rights reserved.
Thank you
Questions?

More Related Content

PPTX
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
PPTX
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
PPTX
Big data journey to the cloud rohit pujari 5.30.18
PPTX
Big data journey to the cloud maz chaudhri 5.30.18
PPTX
Consolidate your data marts for fast, flexible analytics 5.24.18
PPTX
Driving Better Products with Customer Intelligence

PPTX
Making Self-Service BI a Reality in the Enterprise
PPTX
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
Multidisziplinäre Analyseanwendungen auf einer gemeinsamen Datenplattform ers...
Big data journey to the cloud rohit pujari 5.30.18
Big data journey to the cloud maz chaudhri 5.30.18
Consolidate your data marts for fast, flexible analytics 5.24.18
Driving Better Products with Customer Intelligence

Making Self-Service BI a Reality in the Enterprise
Part 1: Cloudera’s Analytic Database: BI & SQL Analytics in a Hybrid Cloud World

What's hot (20)

PPTX
How to Lower TCO and Avoid Cloud Lock-in

PPTX
PaaS or Fail: Rule the Cloud with Altus
PPTX
Self-service Big Data Analytics on Microsoft Azure
PPTX
Get started with Cloudera's cyber solution
PPTX
Leveraging the cloud for analytics and machine learning 1.29.19
PPTX
Cloudera - The Modern Platform for Analytics
PPTX
Transforming Insurance Analytics with Big Data and Automated Machine Learning

PPTX
Big Data Fundamentals
PPTX
Part 1: Introducing the Cloudera Data Science Workbench
PPTX
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
PPTX
High-Performance Analytics in the Cloud with Apache Impala
PPTX
Kudu Forrester Webinar
PPTX
Analyzing Hadoop Data Using Sparklyr

PPTX
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
PPTX
Big data journey to the cloud 5.30.18 asher bartch
PPTX
The Five Markers on Your Big Data Journey
PPTX
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
PDF
Big Data in the Cloud? Yes, you can do it in OpenStack
PPTX
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
PPTX
Turning Data into Business Value with a Modern Data Platform
How to Lower TCO and Avoid Cloud Lock-in

PaaS or Fail: Rule the Cloud with Altus
Self-service Big Data Analytics on Microsoft Azure
Get started with Cloudera's cyber solution
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera - The Modern Platform for Analytics
Transforming Insurance Analytics with Big Data and Automated Machine Learning

Big Data Fundamentals
Part 1: Introducing the Cloudera Data Science Workbench
Standing Up an Effective Enterprise Data Hub -- Technology and Beyond
High-Performance Analytics in the Cloud with Apache Impala
Kudu Forrester Webinar
Analyzing Hadoop Data Using Sparklyr

What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
Big data journey to the cloud 5.30.18 asher bartch
The Five Markers on Your Big Data Journey
Evolution from Apache Hadoop to the Enterprise Data Hub by Cloudera - ArabNet...
Big Data in the Cloud? Yes, you can do it in OpenStack
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Turning Data into Business Value with a Modern Data Platform
Ad

Similar to Customer Best Practices: Optimizing Cloudera on AWS (20)

PPTX
AWS The Enterprise Cloud 2015
PDF
Intro to Amazon Web Services (AWS) and Gen AI
PPTX
AWS solution Architect Associate study material
PPTX
AWS 101 - An Introduction to the Amazon Cloud
PPTX
Five Tips for Running Cloudera on AWS
PPTX
Amazon Webservices Introduction And Core Modules
PDF
ARCHITECTING CLOUD SOLUTIONS: LEVERAGING AWS FOR SCALABLE AND SECURE SYSTEMS
PPTX
AWS Initiate Day Mexico City | Sesión Plenaria
PDF
Introduction to the AWS Cloud from Digital Tuesday Meetup
PDF
01 aw some day_main track_aws basics
PPTX
Pitt Immersion Day- Module 1
PPTX
Aws 101 garage+
PDF
AWS 101: Introduction to AWS
PDF
AWS CloudSchool Introduction - December 2014
PDF
An Introduction to AWS
PDF
Aws cloud best_practices
PDF
Amazon Web Services - The New Normal
PDF
Cloud 101: Your Gateway to Computing Freedom With AWS
PDF
Aws in enterprise applications
PPTX
AWSome Day Roadshow 2017
AWS The Enterprise Cloud 2015
Intro to Amazon Web Services (AWS) and Gen AI
AWS solution Architect Associate study material
AWS 101 - An Introduction to the Amazon Cloud
Five Tips for Running Cloudera on AWS
Amazon Webservices Introduction And Core Modules
ARCHITECTING CLOUD SOLUTIONS: LEVERAGING AWS FOR SCALABLE AND SECURE SYSTEMS
AWS Initiate Day Mexico City | Sesión Plenaria
Introduction to the AWS Cloud from Digital Tuesday Meetup
01 aw some day_main track_aws basics
Pitt Immersion Day- Module 1
Aws 101 garage+
AWS 101: Introduction to AWS
AWS CloudSchool Introduction - December 2014
An Introduction to AWS
Aws cloud best_practices
Amazon Web Services - The New Normal
Cloud 101: Your Gateway to Computing Freedom With AWS
Aws in enterprise applications
AWSome Day Roadshow 2017
Ad

More from Cloudera, Inc. (20)

PPTX
Partner Briefing_January 25 (FINAL).pptx
PPTX
Cloudera Data Impact Awards 2021 - Finalists
PPTX
2020 Cloudera Data Impact Awards Finalists
PPTX
Edc event vienna presentation 1 oct 2019
PPTX
Machine Learning with Limited Labeled Data 4/3/19
PPTX
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
PPTX
Introducing Cloudera Data Science Workbench for HDP 2.12.19
PPTX
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
PPTX
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
PPTX
Leveraging the Cloud for Big Data Analytics 12.11.18
PPTX
Modern Data Warehouse Fundamentals Part 3
PPTX
Modern Data Warehouse Fundamentals Part 2
PPTX
Modern Data Warehouse Fundamentals Part 1
PPTX
Extending Cloudera SDX beyond the Platform
PPTX
Federated Learning: ML with Privacy on the Edge 11.15.18
PPTX
Analyst Webinar: Doing a 180 on Customer 360
PPTX
Build a modern platform for anti-money laundering 9.19.18
PPTX
Introducing the data science sandbox as a service 8.30.18
PPTX
Cloudera SDX
Partner Briefing_January 25 (FINAL).pptx
Cloudera Data Impact Awards 2021 - Finalists
2020 Cloudera Data Impact Awards Finalists
Edc event vienna presentation 1 oct 2019
Machine Learning with Limited Labeled Data 4/3/19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Leveraging the Cloud for Big Data Analytics 12.11.18
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 1
Extending Cloudera SDX beyond the Platform
Federated Learning: ML with Privacy on the Edge 11.15.18
Analyst Webinar: Doing a 180 on Customer 360
Build a modern platform for anti-money laundering 9.19.18
Introducing the data science sandbox as a service 8.30.18
Cloudera SDX

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
A Presentation on Artificial Intelligence
PDF
cuic standard and advanced reporting.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
KodekX | Application Modernization Development
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Approach and Philosophy of On baking technology
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
A Presentation on Artificial Intelligence
cuic standard and advanced reporting.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KodekX | Application Modernization Development
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation
Approach and Philosophy of On baking technology
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Chapter 3 Spatial Domain Image Processing.pdf
Big Data Technologies - Introduction.pptx

Customer Best Practices: Optimizing Cloudera on AWS

  • 1. 1© Cloudera, Inc. All rights reserved. Customer Best Practices: Optimizing Cloudera on AWS Josh Hammer | Partner Solutions Architect, AWS Hervé Bertacchi | Director of Big Data Solutions Architecture, Celgene Alex Moundalexis | Cloudera Software Engineer | @technmsg
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • The AWS Difference (Josh Hammer) - 15 min • Celgene Customer Story (Hervé Bertacchi) - 10 min • Optimizing Cloudera on AWS (Alex Moundalexis) - 25 min • Q&A time - 10 min
  • 3. 3© Cloudera, Inc. All rights reserved. Our Speakers Josh Hammer is Partner Solution Architect with AWS. Before joining the Partner Solution Architect team, he was a Security Architect with AWS Professional Services helping large enterprises adopt AWS. Prior to joining AWS, Josh was a Security Architect at Oracle. Alex Moundalexis is a software engineer at Cloudera responsible for partner platforms. Formerly on Cloudera's professional services team, Alex spent several years installing and configuring Hadoop clusters across the United States and abroad.
  • 4. 4© Cloudera, Inc. All rights reserved. Hervé Bertacchi, Director of Big Data Solutions Architecture is a Senior Information Technology leader with more than two decades of experience delivering strategy, long-range planning, solutions, and enabling innovation. Together with his team, Hervé received Cloudera’s 2017 Data Impact Award for establishing Celgene’s platform for Big Data Analytics that enabled the company to unlock new insights and drastically improve the productivity of its data scientists. Previously, Hervé led the Information Management and Commercial Insight function at AstraZeneca, where he established a Big Data capability and modernized the legacy MDM platform. Earlier in his career, he held various architecture leadership roles, as well as, being an entrepreneur. Hervé holds a Master of Science in Computer Science from Stevens Institute of Technology and has completed advanced studies in Mathematics and Physics at Lycée Masséna. Our Speakers
  • 5. 5© Cloudera, Inc. All rights reserved. Ways to Deploy Cloudera Enterprise on AWS On Premise AWS Cloud, IaaS AWS Cloud, PaaS Analytic DBMS Operational DBMS Data Engineering Altus Platform Services = current = planned Data Science Cloudera Director
  • 6. 6© Cloudera, Inc. All rights reserved. The AWS Difference
  • 7. 7© Cloudera, Inc. All rights reserved. What sets AWS apart? Building and managing cloud since 2006 90+ services to support any cloud workload; rapid customer driven releases 16 regions, 42 availability zones, 76 edge locations Thousands of partners; 3,800+ Marketplace products Experience: 1M+ customers Service Breadth & Depth; pace of innovation Global Footprint Ecosystem Fine-grained controlSecurity Fully integrated in AWSArtificial Intelligence Gartner Magic quadrant recognizing AWSEnterprise leader
  • 8. 8© Cloudera, Inc. All rights reserved. Security Security full fine-grained level control: a customer might restrict a user to create only a database in a specific Region, from 3-5 pm on Monday to Friday, only on a virtual private cloud, and only in an M4 instance with a maximum number of IOPs. This only can be done in AWS. Shared responsibility model: AWS manages the “security of the cloud”, and customers operate the “security in the cloud”. Customers retain control of what security they choose. AWS has more than 50 compliance certifications and accreditations. Security and Compliance from the Ground Up Giving Customers Fine-grained Control “We worked closely with the Amazon team to develop a security model, which we believe enables us to operate more securely in the public cloud than we can even in our data centers.” Capital One
  • 9. 9© Cloudera, Inc. All rights reserved. Account Support Support Managed Services Professional Services Partner Ecosystem Training & Certification Solution Architects Account Management Security & Pricing Reports Technical Acct. Management Marketplace Business Applications DevOps Tools Business Intelligence Security Networking Database & Storage SaaS Subscriptions Operating Systems Mobile Build, Test, Monitor Apps Push Notifications Build, Deploy, Manage APIs Device Testing Identity Enterprise Applications Document Sharing Email & Calendaring Hosted Desktops Application Streaming Backup Game Development 3D Game Engine Multi-player Backends Mgmt. Tools Monitoring Auditing Service Catalog Server Management Configuration Tracking Optimization Resource Templates Automation Analytics Query Large Data Sets Elasticsearch Business Analytics Hadoop/Spark Real-time Data Streaming Orchestration Workflows Managed Search Managed ETL Artificial Intelligence Voice & Text Chatbots Machine Learning Text-to- Speech Image Analysis IoT Rules Engine Local Compute and Sync Device Shadows Device Gateway Registry Hybrid Devices & Edge Systems Data Integration Integrated Networking Resource Management VMware on AWS Identity Federation Migration Application Discovery Application Migration Database Migration Server Migration Data Migration Infrastructure Regions Availability Zones Points of Presence Compute Containers Event-driven Computing Virtual Machines Simple Servers Auto Scaling Batch Web Applications Storage Object Storage Archive Block Storage Managed File Storage Exabyte-scale Data Transport Database MariaDB Data Warehousing NoSQLAurora MySQL Oracle SQL ServerPostgreSQL Application Services Transcoding Step Functions Messaging Security Certificate Management Web App. Firewall Identity & Access Key Storage & Management DDoS Protection Application Analysis Active Directory Dev Tools Private Git Repositories Continuous Delivery Build, Test, and Debug Deployment Networking Isolated Resources Dedicated Connections Load Balancing Scalable DNSGlobal CDN The AWS Platform
  • 10. 10© Cloudera, Inc. All rights reserved. 48 82 280 722 2009 2011 2013 2015 2016 AWS’ History of Innovation * As of 1 July 2016 AWS has been continually expanding its services to support virtually any cloud workload, and it now has more than 90 services that range from compute, storage, networking, database, analytics, application services, deployment, management, developer, mobile, Internet of Things (IoT), Artificial Intelligence (AI), security, hybrid and enterprise applications. In 2016, we launched 1,017 new services and features. As of April 1st, we have launched 236 new features and services in 2017. 1017 Customer driven services and features
  • 11. 11© Cloudera, Inc. All rights reserved. Experience with Operational Reliability More than a decade building the most reliable, secure, scalable, and cost-effective infrastructure. Availability Zones exist on isolated fault lines, flood plains, networks, and electrical grids to substantially reduce the chance of simultaneous failure. AWS has reduced prices 61 times since AWS launched in 2006 [as of May 2017]. Millions of active customers use AWS
  • 12. 12© Cloudera, Inc. All rights reserved. Experience with Hybrid AWS provide the broadest set of hybrid capabilities of any cloud provider (networking, data, access, management and application) without making any new on-premises hardware purchases. The partnership with VMware to allow customers to seamlessly run existing VMware workloads on AWS with the skills and toolsets they already have. Customer continue to use key enterprise solutions from Microsoft, Oracle, SAP…but also offered as fully managed services. Customers continue to use their on premise investments while getting the full benefits of the cloud: flexibility, cost effective, reliability, scalability and security “We’re running a hybrid architecture, and we still host some of our applications in our own data center, but we are moving more and more applications to the cloud…We’re finding that we have the flexibility to support both environments because of the tooling, APIs, and management features AWS has on top of its cloud. And, with the flexibility we get using AWS, we can adapt over time to our company’s changing needs.” CSRA (US Gov.)
  • 13. 13© Cloudera, Inc. All rights reserved. Global Footprint 190 countries 2,300 government agencies 7,000 educational institutions 22,000 Non profits organizations 16 regions (2 more announced) 42 availability zones 73 edge locations AWS regions consists of multiple Availability Zones (“AZs”) isolated for failures and with low latency networks connectivity making AWS the only provider that allows HA natively supported Region & Number of Availability Zones AWS GovCloud (2) US West: Oregon (3), Northern California (3) US East: Northern Virginia (5), Ohio (3) Canada: Central (2) South America: São Paulo (3) Europe: Ireland (3), Frankfurt (2), London (2) Asia Pacific: Singapore (2), Sydney (3), Tokyo (3), Seoul (2), Mumbai (2) China: Beijing (2) New Region (coming soon): Paris, Ningxia
  • 14. 14© Cloudera, Inc. All rights reserved. Region Region Production applications that are highly available The AWS Cloud infrastructure: • Availability Zones (42) consist of one or more discrete data centers, each with redundant power, networking, and connectivity, housed in separate facilities. • A Region (16) is a physical location in the world where we have multiple Availability Zones. 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 21 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 N 2 1 AZ Availability Zones (AZs) provide the resiliency of performing real-time data replication and the reliability of multiple physical locations 2 Low latency ensures real data replication Distance ensures high availability
  • 15. 15© Cloudera, Inc. All rights reserved. Artificial Intelligence fully integrated in AWS Application Developers Amazon Rekognition Amazon Machine Learning Amazon Polly Amazon Lex Natural Language Understanding (NLU) & Automatic Speech Recognition (ASR) Image Recognition & Analysis Text-to-Speech Managed Machine Learning AWS Deep Learning AMI Use and scale deep learning frameworks quickly and easily Data Scientists & Researchers
  • 16. 16© Cloudera, Inc. All rights reserved. An Expansive Ecosystem Thousands of the world’s largest technology and consulting companies 55 Premier Consulting Partners 17 Enterprise-focused competencies 3,800+ products Customers run over 370 M hours of software per month Products fully integrated with AWS platform and easy to fully test
  • 17. 17© Cloudera, Inc. All rights reserved. AWS Positioned as a Leader in the Gartner Magic Quadrant for Cloud Infrastructure as a Service, Worldwide* AWS is positioned highest in execution and furthest in vision within the Leaders Quadrant *Gartner, Magic Quadrant for Cloud Infrastructure as a Service, Worldwide, Leong, Lydia, Petri, Gregor, Gill, Bob, Dorosh, Mike, August 32016 This graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from AWS : http://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sb Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose. Mark Benioff, SalesForce.com CEO: “There is no public cloud infrastructure provider that is more sophisticated or has more robust enterprise capabilities for supporting the needs of our growing global customer base.” Jim Fowler, GE CIO: "AWS is our trusted partner that is going to run our company for the next 140 years.” AirBnB: “Amazon Web Services listens to customers’ needs. If the feature does not yet exist, it probably will in a matter of months.”
  • 18. 18© Cloudera, Inc. All rights reserved. The Celgene Story
  • 19. 19© Cloudera, Inc. All rights reserved. Celgene is focused on the discovery, development, and commercialization of innovative therapies for patients with cancer, immune-inflammatory, and other unmet medical needs. • We are amidst a revolution in the rapidly evolving healthcare ecosystem to change the way we define and treat human diseases • We transformed our approach to harnessing data and using the derived insights to drive decision-making • We made an investment in Big Data Analytics
  • 20. 20© Cloudera, Inc. All rights reserved. WHAT WE ACHIEVED Cloudera on AWS: •Helps us to bring innovative therapies to patients faster and rapidly adapt to evolving business needs •Reduced the time taken to do the patient data analysis by 99% and achieved a 70% savings in operating costs •Improved data scientists and analysts productivity across the company and enabled them to answer business questions that could not be answered before
  • 21. 21© Cloudera, Inc. All rights reserved. LESSONS LEARNED • Security is not optional. Carefully plan and test your security architecture • Full automation across several technology ecosystems is not trivial. Think end-to-end and pick the right automation tool • Ease of access and trust in the data is essential. Technology is only an enabler
  • 22. 22© Cloudera, Inc. All rights reserved. Best practices for running your Cloudera Enterprise cluster on AWS
  • 23. 23© Cloudera, Inc. All rights reserved. AWS Components 23© Cloudera, Inc. All rights reserved.
  • 24. 24© Cloudera, Inc. All rights reserved. Quick AWS Component Review EC2 - servers S3 - object storage EBS - block storage RDS - we still need RDBMS VPC - everything needs network
  • 25. 25© Cloudera, Inc. All rights reserved. AWS Service Limits Default limits on most AWS services (most can be increased) • EC2 - ten (10) m4.4xlarge instances per region • EBS - 20 TB of Throughput Optimized HDD (st1) per region • S3 - 100 buckets per account • VPC - 5 VPC per region Defaults might limit your cluster, so plan ahead!
  • 26. 26© Cloudera, Inc. All rights reserved. Deployment Topology 26© Cloudera, Inc. All rights reserved.
  • 27. 27© Cloudera, Inc. All rights reserved. Deployment Topology Two types of deployments from network perspective: • Public Subnet - direct access to Internet & AWS services • Private Subnet - instances must go through a VPC endpoints to reach AWS services & NAT instances for Internet If you require high-bandwidth access to data sources on Internet or AWS services in another region, deploy to a public subnet. Public doesn’t mean open to world. Limit access using Security Groups.
  • 28. 28© Cloudera, Inc. All rights reserved. Deployment Topology (Public Subnet)
  • 29. 29© Cloudera, Inc. All rights reserved. Deployment Topology (Private Subnet)
  • 30. 30© Cloudera, Inc. All rights reserved. A Word About Edge Nodes Edge nodes: • Have direct access to cluster • Users user these nodes via client applications • Web applications, BI tools, or just Hadoop command-line client Avoid direct user access to the cluster!
  • 31. 31© Cloudera, Inc. All rights reserved. Deployment Topology with Edge Nodes
  • 32. 32© Cloudera, Inc. All rights reserved. Deployment Topology with Edge Nodes
  • 33. 33© Cloudera, Inc. All rights reserved. Roles and Instance Types 33© Cloudera, Inc. All rights reserved.
  • 34. 34© Cloudera, Inc. All rights reserved. So you want a cluster, eh?
  • 35. 35© Cloudera, Inc. All rights reserved. Deployment Model Two types of deployments: • Persistent - long-lived, always on, no spin-up time • Temporary - short-lived, can be started/stopped to save money Impacts job setup time, instance types, storage method, and cost.
  • 36. 36© Cloudera, Inc. All rights reserved. Roles Three classes of cluster nodes: • Masters - the brains behind all the cluster services • Workers - the processing power and storage • Edge nodes - provides client access to cluster services Different instance type recommendations depending on the class and also the type of deployment model.
  • 37. 37© Cloudera, Inc. All rights reserved. Instance Types Ephemeral • have locally attached storage, HDD or SSD • on instance termination, data is irrecoverable EBS-only • no local storage, must mount EBS volumes • on instance termination, data is safe in EBS
  • 38. 38© Cloudera, Inc. All rights reserved. Networking Connectivity Security 38© Cloudera, Inc. All rights reserved.
  • 39. 39© Cloudera, Inc. All rights reserved. Networking, Connectivity, and Security • Use an HVM AMI in VPC with correct network drivers • Use VPC endpoints for AWS services • Private data center connectivity via VPN or Direct Connect • Use Placement Groups, a logical grouping of EC instances • Security groups analogous to firewalls permit traffic from sg-clusterNNN to sg-clusterNNN permit traffic from office-net to edge-node tcp 22
  • 40. 40© Cloudera, Inc. All rights reserved. Storage Configuration 40© Cloudera, Inc. All rights reserved.
  • 41. 41© Cloudera, Inc. All rights reserved. Storage Configuration Three types of storage: • Instance Storage • Elastic Block Storage (EBS) • Object Storage (S3)
  • 42. 42© Cloudera, Inc. All rights reserved. Storage Configuration - Instance Storage • Attached to EC2 instances, like physical disks on physical server • Lifetime of storage == Lifetime of EC2 instance • Each EC2 instance has different amounts of storage c1.xlarge has 4 x 420 GB d2.8xlarge has 24 x 2 TB • For HDFS data directories, we like HDD instance storage • Backup planning -- multi-instance shutdowns, multi-VM AWS events
  • 43. 43© Cloudera, Inc. All rights reserved. Storage Configuration - EBS • Persistent block level storage volumes • Can be encrypted at rest w/ negligible impact to latency/throughput • OS: General Purpose (gp2) • DFS: Throughput Optimized HDD (st1), Cold HDD (sc1) • Baseline & burst performance increase with size of provisioned volume e.g. 500 GB st1 baseline throughput 20 MB/s; 1000 GB → 40 MB/s
  • 44. 44© Cloudera, Inc. All rights reserved. Storage Configuration - EBS Recommendations Instance selection • Use EBS-optimized instances OR instances with 10Gb+ network • Minimum dedicated EBS bandwidth of 1000 Mb/s (125 MB/s) Volume selection • Baseline performance, 40 MB/s or better (1000 GB st1, 3200 GB sc1) • Do not exceed instance’s dedicated EBS bandwidth!
  • 45. 45© Cloudera, Inc. All rights reserved. Storage Configuration - S3 • Great for cold backup: durable, available, inexpensive • For hot backup, use a second HDFS cluster • Hive and Spark can also use S3 directly • Standard data operations can read from & write to S3 buckets
  • 46. 46© Cloudera, Inc. All rights reserved. Storage Configuration Three types of storage: • Instance Storage • Elastic Block Storage (EBS) • Object Storage (S3) • Root Device
  • 47. 47© Cloudera, Inc. All rights reserved. Storage Configuration - Root Device • For operating system and logs • Use EBS gp2 volumes as root devices • At least 500 GB for OS, CDH software, and logs • Do not use instance storage for the root device!
  • 48. 48© Cloudera, Inc. All rights reserved. 48© Cloudera, Inc. All rights reserved. Capacity Planning
  • 49. 49© Cloudera, Inc. All rights reserved. Capacity Planning • AWS makes expansion easy; advance planning makes things easier • Consider workloads: how much storage vs compute? balanced? • Consider data replication (3x), growth, retention • Low storage density, r3.8xlarge or c4.8xlarge provide less storage but higher compute and memory • High storage density, d2.8xlarge offers 48 TB per instance with a good amount of compute and memory
  • 50. 50© Cloudera, Inc. All rights reserved. Cloudera Enterprise Hardware Requirements Guide tiny.cloudera.com/hw-reqs
  • 51. 51© Cloudera, Inc. All rights reserved. 51© Cloudera, Inc. All rights reserved. Provisioning Instances
  • 52. 52© Cloudera, Inc. All rights reserved. Provisioning Instances • Cloudera Director automates most things • Manual via EC2 command-line API tool or AWS management console • Don’t forget your databases (either RDS or self-managed) • Cloudera Altus
  • 53. 53© Cloudera, Inc. All rights reserved. Provisioning Instances No matter which route you take... • root device: 500 GB+ gp2 EBS volume • master metadata: ephemeral or recommended gp2 EBS volumes • DFS data: ephemeral or recommended st1/sc1 EBS volumes • use tags to indicate the role instances/volumes will play
  • 54. 54© Cloudera, Inc. All rights reserved. Cloudera Enterprise Reference Architecture for AWS Deployments tiny.cloudera.com/aws-ra
  • 55. 55© Cloudera, Inc. All rights reserved. Thank you Questions?