SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved
Running Enterprise Workloads in the
Cloud
Richard Doktorics
Peter Darvasi
2 © Hortonworks Inc. 2011–2018. All rights reserved
Who we are?
⬢ Peter Darvasi
- Partner Engineer at Hortonworks
- @pdarvasi
⬢ Richard Doktorics
- Software Engineer
- @doktoric
3 © Hortonworks Inc. 2011–2018. All rights reserved
Agenda
⬢ What is Cloudbreak?
⬢ Enterprise checklist for big data in the cloud
⬢ Cloudbreak in da house
⬢ Questions
4 © Hortonworks Inc. 2011–2018. All rights reserved
What is Cloudbreak?
5 © Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak is a tool for provisioning Hadoop
clusters on any cloud infrastructure
Simplified Cluster Provisioning - prescriptive setup,
simple automation
6 © Hortonworks Inc. 2011–2018. All rights reserved
Deploy on Public or Private
Clouds
Dynamically configure and manage
clusters on public or private clouds
(Amazon Web Services, Microsoft
Azure, Google Cloud Platform and
OpenStack)
Automated Scaling
Seamlessly manage elasticity
requirements as cluster workloads
change (Ambari Metrics / Prometheus)
Secured Cluster Access
Supports configuration defining
network boundaries and configuring
security groups
Highly Extensible
Recipes to run custom commands
Custom images
7 © Hortonworks Inc. 2011–2018. All rights reserved
⬢ Cloudbreak Deployer (CBD)
– Written in Go and Bash
– Compiled into single binary
⬢ Micro-service architecture
– Each service runs in a Docker
container
– Each container is replaceable
with custom ones
– Services are handled with
docker-compose
Single node deployment
8 © Hortonworks Inc. 2011–2018. All rights reserved
Enterprise checklist for big data in cloud
9 © Hortonworks Inc. 2011–2018. All rights reserved
✓ Control and Automation
✓ Cloudy Services
✓ Security
✓ Enterprise-Grade Support
Checklist for enterprises in the cloud
1
0
© Hortonworks Inc. 2011–2018. All rights reserved
✓ Control and Automation
✓ Cloudy Services
✓ Security
✓ Enterprise-Grade Support
Checklist for enterprises in the cloud
✓ Simple UX
✓ Powerful CLI
✓ Autoscaling
1
1
© Hortonworks Inc. 2011–2018. All rights reserved
Simplified UX
1
2
© Hortonworks Inc. 2011–2018. All rights reserved
Create Credential Experience
1
3
© Hortonworks Inc. 2011–2018. All rights reserved
Built-In Blueprints
1
4
© Hortonworks Inc. 2011–2018. All rights reserved
Basic and Advanced Cluster Creation Experiences
BASIC ADVANCED
1
5
© Hortonworks Inc. 2011–2018. All rights reserved
New Network and Security Group Choices
⬢ Network
– Create new Network and new
Subnet
– Choose existing Network and
existing Subnet
⬢ Security Groups
– Create new SGs
• Choose default SGs
(minimal set of ports)
• Create customized
– Choose existing SGs
1
6
© Hortonworks Inc. 2011–2018. All rights reserved
Powerful CLI
1
7
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak CLI: Designed for DevOps
1
8
© Hortonworks Inc. 2011–2018. All rights reserved
“Show cli command” for every request
1
9
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-scaling
2
0
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-Scaling
⬢ Alerts: Create metric or time-based alerts for cluster scaling
⬢ Policies: Scaling policies adjust cluster size based on activity and workload
alerts
⬢ General Configurations: Boundaries and cooldown period
2
1
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-Scaling Time-Based Alert
Fire at 10:15 am everyday
2
2
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-Scaling Metric-Based Alert
Fire after NodeManagers are in
CRITICAL state for 10 minutes
2
3
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-Scaling Policies
⬢ Define the Scale Adjustment (Node Count/Percentage/Exact size)
⬢ Select the HostGroup (to Scale)
⬢ Select Alert (which when fired, executes the Policy)
2
4
© Hortonworks Inc. 2011–2018. All rights reserved
Auto-Scaling General Configurations
⬢ Cooldown Period (between scaling actions)
⬢ Minimum and Maximum Cluster size (boundaries)
Cluster size
boundaries
Time Interval between
two Autoscale events
2
5
© Hortonworks Inc. 2011–2018. All rights reserved
✓ Control and Automation
✓ Cloudy Services
✓ Security
✓ Enterprise-Grade Support
Checklist for enterprises in the cloud
✓ Cloud Resources
✓ Hortonworks DataFlow
✓ Custom Images
2
6
© Hortonworks Inc. 2011–2018. All rights reserved
Cloud Resources:
RDBMS + LDAP
2
7
© Hortonworks Inc. 2011–2018. All rights reserved
Cloud Resources: RDBMS and LDAP/AD = Dynamic Blueprints
⬢ Background:
– Cluster configuration often includes external database (for Hive, Ranger, etc) and LDAP/AD configs
– It’s a challenge to know the different Blueprint configuration choices per service across the stack
⬢ Dynamic Blueprints:
– Ability to manage External Sources (e.g. RDBMS and LDAP/AD) outside of your Blueprint
– Cloudbreak will inject the configurations into your Blueprint
– Simplifies reuse of external cloud resources
– Simplifies your Blueprints -> don’t have to know all the configurations for each component
2
8
© Hortonworks Inc. 2011–2018. All rights reserved
Dynamic Blueprints: RDBMS/LDAP
⬢ Built-In Components:
– Atlas, Ranger, Hadoop, Hive LLAP, Hive, Ambari, Oozie, Druid, SuperSet
JDBC/LDAP
properties in
Blueprint for the
Component?
Yes
Use Blueprint as-is,
no Component
configuration
property injection
No Inject Component
configuration
properties
Perform property
variable
replacement
S
E
2
9
© Hortonworks Inc. 2011–2018. All rights reserved
At-Motion Workloads:
Hortonworks DataFlow
3
0
© Hortonworks Inc. 2011–2018. All rights reserved
Hortonworks DataFlow in CloudBreak
⬢ Default blueprint: “Flow Management: Apache NiFi”
HDF 3.1: NiFi, Ambari, Ambari Metrics, ZooKeeper
3
1
© Hortonworks Inc. 2011–2018. All rights reserved
HDF - cluster creation
3
2
© Hortonworks Inc. 2011–2018. All rights reserved
HDF - cluster creation
3
3
© Hortonworks Inc. 2011–2018. All rights reserved
Custom Images
3
4
© Hortonworks Inc. 2011–2018. All rights reserved
Background: Cloudbreak
1. Cloudbreak creates VM instances using a default base images.
2. Cloudbreak installs Ambari on a VM instance.
3. Cloudbreak instructs Ambari to install an HDP Cluster on other VM instances.
Cloudbreak
RHEL 7
HDP Node
VM
HDP Node
VM
HDP Node
VM
HDP Node
VM
HDP Node
VM
HDP Node
VM
HDP Cluster
3
5
© Hortonworks Inc. 2011–2018. All rights reserved
Background: Cloudbreak Default Images
⬢ By default, Cloudbreak uses default base public images when creating VM instances.
Cloud Standard Image Operating System
AWS Amazon Linux 2017
Azure CentOS 7.x
Google Cloud Platform CentOS 7.x
OpenStack CentOS 7.x
Support for Custom Images provides a way for Cloudbreak
users to leverage their own custom image (not the default
image) when creating VM instances.
3
6
© Hortonworks Inc. 2011–2018. All rights reserved
Making a Custom Image: Overview
Create the
Custom Image
Register the
Custom Image
in Cloudbreak
Use the Custom
Image when
Creating a
Cluster
1 2 3
3
7
© Hortonworks Inc. 2011–2018. All rights reserved
Creating the Image: Code Repository
⬢ Instructions, Packer scripts and Salt states in public GitHub repository
– https://github.com/hortonworks/cloudbreak-images
⬢ An understanding of Packer and Salt is useful
– Packer creates infrastructure
– Packer runs Salt provisioner
⬢ Customer should clone the repository and build on it
3
8
© Hortonworks Inc. 2011–2018. All rights reserved
Creating the Image: Example Scenarios
SCENARIO APPROACH
For AWS: I don’t want Amazon Linux
and instead want RHEL 7
1. Setup repository and AWS environment
2. Use the repository tools to build a RHEL 7 image
make build-aws-rhel7
I don’t want OpenJDK and instead
want Oracle JDK
1. Setup repository and environment
2. Turn on Oracle optional state
3. Use the repository tools to build an image
For AWS: I don’t want Amazon Linux
and instead want MY RHEL 7
** This is an advanced scenario**
1. Setup repository and AWS environment
2. Change the source base image
3. Use the repository tools to build a RHEL 7 image
make build-aws-rhel7
3
9
© Hortonworks Inc. 2011–2018. All rights reserved
Use the Custom Image: Create Cluster (UI)
⬢ Create Cluster > General Configuration > Advanced
Choose image
catalog
Adjust the Ambari +
HDP repos (if you want)
Choose image
you registered
4
0
© Hortonworks Inc. 2011–2018. All rights reserved
Pre-Warmed Images
PROS CONS
Prewarmed: OS + pre-installed Ambari and
HDP
Cluster installs are faster
No internet connection is needed
Cannot change the Ambari or HDP versions,
cannot use local repositories
Base: OS only Cluster installs take longer Can change the Ambari or HDP Versions, or
use local repositories
Base Images Prewarmed Images
4
1
© Hortonworks Inc. 2011–2018. All rights reserved
✓ Control and Automation
✓ Cloudy Services
✓ Security
✓ Enterprise-Grade Support
Checklist for enterprises in the cloud
✓ Kerberos support
✓ LDAP integration
✓ Proxy configuration
4
2
© Hortonworks Inc. 2011–2018. All rights reserved
Cluster Security:
Kerberos
4
3
© Hortonworks Inc. 2011–2018. All rights reserved
What is Kerberos
⬢ Strongly authenticating and establishing a user’s identity is the basis for secure access in
Hadoop. Users need to be able to reliably “identify” themselves and then have that
identity propagated throughout the Hadoop cluster.
⬢ Kerberos is the de-facto system for authenticating access to distributed services
4
4
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak: Support for Enabling Kerberos
Goal
Provide a way for Cloudbreak users to create clusters that
are Kerberos enabled
Approach
Ambari exposes a lot of Kerberos options
Leverage Ambari Kerberos options and avoid re-creating
Ambari Kerberos experience
Pragmatic prescriptive options on-top
4
5
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak: Enable Kerberos Security
⬢ Create Cluster > Security > Advanced
⬢ [ ] Enable Kerberos Security
4
6
© Hortonworks Inc. 2011–2018. All rights reserved
Options: Use Existing KDC or Use Test KDC
Use Existing
KDC
Use Test KDC
Advanced
Basic
- Not for production use. For testing and
evaluation purposes only.
- Installs and configures an MIT KDC on the
master node.
- Configures the cluster to leverage that KDC.
- Provide basic information
about your existing KDC.
- Ambari Kerberos descriptors
are generated automatically.
- Provide basic information
about your existing KDC.
- Provide your own Ambari
Kerberos descriptors.
4
7
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak + LDAP/AD
4
8
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak User AuthN
⬢ Goal: Configure Cloudbreak to provide for external User AuthN to LDAP/AD
– CloudFoundry UAA (User Account and Authentication Server) is the foundation
https://github.com/cloudfoundry/uaa
⬢ Two parts:
1. Configure Cloudbreak to talk to external LDAP/AD
2. Configure which group(s) can access Cloudbreak
4
9
© Hortonworks Inc. 2011–2018. All rights reserved
Step 1: Configure Cloudbreak to talk to LDAP/AD
⬢ On the Cloudbreak host, create:
/var/lib/cloudbreak-deployment/uaa-changes.yml
⬢ Define LDAP profile for users and groups
Cloudbreak LDAP/AD
5
0
© Hortonworks Inc. 2011–2018. All rights reserved
Step 2: Configure which group(s) can access Cloudbreak
⬢ Configure which group(s) are authorized to access Cloudbreak:
cbd util execute-ldap-mapping [group]
cbd util delete-ldap-mapping [group]
⬢ To authorize users in the ”Analysts” group to access Cloudbreak:
cbd util execute-ldap-mapping cn=Analysts,ou=Groups,dc=hortonworks,dc=local
5
1
© Hortonworks Inc. 2011–2018. All rights reserved
Proxy configuration
5
2
© Hortonworks Inc. 2011–2018. All rights reserved
Limited Outbound Internet Access
⬢ Handle enterprise scenarios where:
– Limited (or restricted) outbound internet access, and/or
– Required use of a Proxy to obtain internet access
Cloudbreak
Cluster Hosts
Cloudbreak
• Docker Hub
• Cloudbreak dependencies
• Default Image Catalog
Cloudbreak and Cluster Hosts
• Cloud Provider APIs
• HDP or HDF platform repositories
http/sproxy
(optional)
5
3
© Hortonworks Inc. 2011–2018. All rights reserved
Internet Access via Proxy
Cloudbreak
Proxy Setup
Clusters Proxy
Setup
How does Cloudbreak
communicate thru a proxy to
get to the internet (and to the
cluster hosts)?
How do the Cluster Hosts
communicate thru a proxy to
get to the Internet?
5
4
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak: Proxy Setup
⬢ Setup Docker Environment to use Proxy
– Modify the Docker service to set HTTP_PROXY and HTTPS_PROXY (and NO_PROXY)
https://docs.docker.com/config/daemon/systemd/#httphttps-proxy
⬢ Setup Cloudbreak to use Proxy in Profile
⬢ Advanced Profile option “HTTPS_PROXYFORCLUSTERCONNECTION=true|false”
– Defaults to “false”
HTTP_PROXY_HOST=your-proxy-host
HTTPS_PROXY_HOST=your-proxy-host
PROXY_PORT=your-proxy-port
PROXY_USER=your-proxy-user
PROXY_PASSWORD=your-proxy-password
5
5
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak: Advanced Proxy Scenarios
SCENARIO #1: Proxy for internet, not clusters SCENARIO #2: Proxy for internet and clusters
5
6
© Hortonworks Inc. 2011–2018. All rights reserved
Clusters: Register Proxy Configuration
⬢ External Sources > Proxy Configurations
(optional)
if proxy requires
authentication
5
7
© Hortonworks Inc. 2011–2018. All rights reserved
Clusters: Configure Proxy for Cluster Hosts
⬢ Create Cluster > Advanced > External Sources > Configure Proxy
• Configures yum with “proxy” settings
• Configures Ambari Server with “httpProxy”
settings
5
8
© Hortonworks Inc. 2011–2018. All rights reserved
✓ Control and Automation
✓ Cloudy Services
✓ Security
✓ Enterprise-Grade Support
Checklist for enterprises in the cloud
✓ SmartSense integration
✓ Flex support
5
9
© Hortonworks Inc. 2011–2018. All rights reserved
Cloudbreak in da house
6
0
© Hortonworks Inc. 2011–2018. All rights reserved
⬢ Have an internal hosted Cloudbreak service for…
– our CI/CD pipeline
– testing and prototyping HDP and HDF services
– have self-service clusters for QE/SE/PS teams
Main use cases
6
1
© Hortonworks Inc. 2011–2018. All rights reserved
⬢ Run Cloudbreak in HA (High Availability) mode
– Ability to recover flows in case of node failure
– Avoid master-slave design / leader election problems
⬢ Scale Cloudbreak as we desire
– Distribute each cluster related flow
– Cannot run 2 flows for the same cluster at the same time (e.g: 2 upscale flows)
– Flow cancellation must be handled
⬢ Scale the Web UI
– Had to introduce a Redis cluster for the session store
⬢ Scale every other service as well
⬢ Find a tool that makes it easy to deploy these services to multiple nodes
Our technical goals
6
2
© Hortonworks Inc. 2011–2018. All rights reserved
⬢ Not because it’s fancy..
⬢ Evaluated Kubernetes, Swarm, Mesos, Rancher
⬢ Open source / Active community with hands-on experience
⬢ Many cloud providers already supports it
⬢ Lots of tooling behind it / API / CLI / Helm / Ansible / Salt
⬢ Integration with most of the cloud providers
– Provision Load Balancer (GCP, AWS, Azure)
– Use object stores to share data (Ceph, S3, GCP bucket, Azure Storage Account)
– Dynamic volume provisioning / Persistent disk (EBS, Azure Blob)
Why Kubernetes?
6
3
© Hortonworks Inc. 2011–2018. All rights reserved
Thank you

More Related Content

PDF
What is new in Apache Hive 3.0?
PDF
What s new in spark 2.3 and spark 2.4
PPTX
Apache Hive 2.0: SQL, Speed, Scale
PPTX
Double Your Hadoop Hardware Performance with SmartSense
PDF
Meet HBase 2.0 and Phoenix-5.0
PPTX
Running a container cloud on YARN
PPTX
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
PPTX
Hadoop Operations - Past, Present, and Future
What is new in Apache Hive 3.0?
What s new in spark 2.3 and spark 2.4
Apache Hive 2.0: SQL, Speed, Scale
Double Your Hadoop Hardware Performance with SmartSense
Meet HBase 2.0 and Phoenix-5.0
Running a container cloud on YARN
Apache Ambari - HDP Cluster Upgrades Operational Deep Dive and Troubleshooting
Hadoop Operations - Past, Present, and Future

What's hot (20)

PPTX
Connecting the Drops with Apache NiFi & Apache MiNiFi
PPTX
Running Enterprise Workloads in the Cloud
PPTX
Apache Hadoop YARN: state of the union
PPTX
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
PPTX
Ozone- Object store for Apache Hadoop
PPTX
An Overview on Optimization in Apache Hive: Past, Present Future
PPTX
Mission to NARs with Apache NiFi
PDF
An Apache Hive Based Data Warehouse
PPTX
Hive ACID Apache BigData 2016
PDF
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
PPTX
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
PPTX
Streamline Hadoop DevOps with Apache Ambari
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
PPTX
An Apache Hive Based Data Warehouse
PPTX
Row/Column- Level Security in SQL for Apache Spark
PPTX
The Future of Apache Ambari
PPTX
Enabling ABAC with Accumulo and Ranger integration
PDF
Present and future of unified, portable and efficient data processing with Ap...
PDF
Ozone and HDFS’s evolution
PDF
Meet HBase 2.0 and Phoenix 5.0
Connecting the Drops with Apache NiFi & Apache MiNiFi
Running Enterprise Workloads in the Cloud
Apache Hadoop YARN: state of the union
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Ozone- Object store for Apache Hadoop
An Overview on Optimization in Apache Hive: Past, Present Future
Mission to NARs with Apache NiFi
An Apache Hive Based Data Warehouse
Hive ACID Apache BigData 2016
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Streamline Hadoop DevOps with Apache Ambari
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
An Apache Hive Based Data Warehouse
Row/Column- Level Security in SQL for Apache Spark
The Future of Apache Ambari
Enabling ABAC with Accumulo and Ranger integration
Present and future of unified, portable and efficient data processing with Ap...
Ozone and HDFS’s evolution
Meet HBase 2.0 and Phoenix 5.0
Ad

Similar to Running Enterprise Workloads in the Cloud (20)

PDF
Hadoop Operations - Past, Present, and Future
PDF
Hadoop Operations – Past, Present, and Future
PDF
Data in the Cloud Crash Course
PDF
Data in the Cloud Crash Course
PPTX
Running Cloudbreak on Kubernetes
PPTX
Running Cloudbreak on Kubernetes
PDF
Getting the Most Out of Your Data in the Cloud with Cloudbreak
PPTX
Micro services vs hadoop
PPTX
Cloudbreak - Technical Deep Dive
PPTX
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
PPTX
Hortonworks Data Cloud for AWS
PDF
Hadoop Everywhere & Cloudbreak
PDF
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
PPTX
Saving the elephant—now, not later
PDF
Sql on everything with drill
PPTX
Managing enterprise users in Hadoop ecosystem
PPTX
Built-In Security for the Cloud
PPTX
The Unbearable Lightness of Ephemeral Processing
PPTX
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
PDF
Hp
Hadoop Operations - Past, Present, and Future
Hadoop Operations – Past, Present, and Future
Data in the Cloud Crash Course
Data in the Cloud Crash Course
Running Cloudbreak on Kubernetes
Running Cloudbreak on Kubernetes
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Micro services vs hadoop
Cloudbreak - Technical Deep Dive
Fortifying Multi-Cluster Hybrid Cloud Data Lakes using Apache Knox
Hortonworks Data Cloud for AWS
Hadoop Everywhere & Cloudbreak
Hortonworks Technical Workshop: HDP everywhere - cloud considerations using...
Saving the elephant—now, not later
Sql on everything with drill
Managing enterprise users in Hadoop ecosystem
Built-In Security for the Cloud
The Unbearable Lightness of Ephemeral Processing
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Hp
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PPTX
Managing the Dewey Decimal System
PPTX
Practical NoSQL: Accumulo's dirlist Example
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
PPTX
Security Framework for Multitenant Architecture
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
PPTX
Extending Twitter's Data Platform to Google Cloud
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
PDF
Computer Vision: Coming to a Store Near You
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded (20)

PPTX
1. Introduction to Computer Programming.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Tartificialntelligence_presentation.pptx
PDF
Getting Started with Data Integration: FME Form 101
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
1. Introduction to Computer Programming.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectroscopy.pptx food analysis technology
Building Integrated photovoltaic BIPV_UPV.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Unlocking AI with Model Context Protocol (MCP)
Tartificialntelligence_presentation.pptx
Getting Started with Data Integration: FME Form 101
Accuracy of neural networks in brain wave diagnosis of schizophrenia
A comparative analysis of optical character recognition models for extracting...
20250228 LYD VKU AI Blended-Learning.pptx
Empathic Computing: Creating Shared Understanding
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...

Running Enterprise Workloads in the Cloud

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved Running Enterprise Workloads in the Cloud Richard Doktorics Peter Darvasi
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved Who we are? ⬢ Peter Darvasi - Partner Engineer at Hortonworks - @pdarvasi ⬢ Richard Doktorics - Software Engineer - @doktoric
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved Agenda ⬢ What is Cloudbreak? ⬢ Enterprise checklist for big data in the cloud ⬢ Cloudbreak in da house ⬢ Questions
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved What is Cloudbreak?
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak is a tool for provisioning Hadoop clusters on any cloud infrastructure Simplified Cluster Provisioning - prescriptive setup, simple automation
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved Deploy on Public or Private Clouds Dynamically configure and manage clusters on public or private clouds (Amazon Web Services, Microsoft Azure, Google Cloud Platform and OpenStack) Automated Scaling Seamlessly manage elasticity requirements as cluster workloads change (Ambari Metrics / Prometheus) Secured Cluster Access Supports configuration defining network boundaries and configuring security groups Highly Extensible Recipes to run custom commands Custom images
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved ⬢ Cloudbreak Deployer (CBD) – Written in Go and Bash – Compiled into single binary ⬢ Micro-service architecture – Each service runs in a Docker container – Each container is replaceable with custom ones – Services are handled with docker-compose Single node deployment
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved Enterprise checklist for big data in cloud
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved ✓ Control and Automation ✓ Cloudy Services ✓ Security ✓ Enterprise-Grade Support Checklist for enterprises in the cloud
  • 10. 1 0 © Hortonworks Inc. 2011–2018. All rights reserved ✓ Control and Automation ✓ Cloudy Services ✓ Security ✓ Enterprise-Grade Support Checklist for enterprises in the cloud ✓ Simple UX ✓ Powerful CLI ✓ Autoscaling
  • 11. 1 1 © Hortonworks Inc. 2011–2018. All rights reserved Simplified UX
  • 12. 1 2 © Hortonworks Inc. 2011–2018. All rights reserved Create Credential Experience
  • 13. 1 3 © Hortonworks Inc. 2011–2018. All rights reserved Built-In Blueprints
  • 14. 1 4 © Hortonworks Inc. 2011–2018. All rights reserved Basic and Advanced Cluster Creation Experiences BASIC ADVANCED
  • 15. 1 5 © Hortonworks Inc. 2011–2018. All rights reserved New Network and Security Group Choices ⬢ Network – Create new Network and new Subnet – Choose existing Network and existing Subnet ⬢ Security Groups – Create new SGs • Choose default SGs (minimal set of ports) • Create customized – Choose existing SGs
  • 16. 1 6 © Hortonworks Inc. 2011–2018. All rights reserved Powerful CLI
  • 17. 1 7 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak CLI: Designed for DevOps
  • 18. 1 8 © Hortonworks Inc. 2011–2018. All rights reserved “Show cli command” for every request
  • 19. 1 9 © Hortonworks Inc. 2011–2018. All rights reserved Auto-scaling
  • 20. 2 0 © Hortonworks Inc. 2011–2018. All rights reserved Auto-Scaling ⬢ Alerts: Create metric or time-based alerts for cluster scaling ⬢ Policies: Scaling policies adjust cluster size based on activity and workload alerts ⬢ General Configurations: Boundaries and cooldown period
  • 21. 2 1 © Hortonworks Inc. 2011–2018. All rights reserved Auto-Scaling Time-Based Alert Fire at 10:15 am everyday
  • 22. 2 2 © Hortonworks Inc. 2011–2018. All rights reserved Auto-Scaling Metric-Based Alert Fire after NodeManagers are in CRITICAL state for 10 minutes
  • 23. 2 3 © Hortonworks Inc. 2011–2018. All rights reserved Auto-Scaling Policies ⬢ Define the Scale Adjustment (Node Count/Percentage/Exact size) ⬢ Select the HostGroup (to Scale) ⬢ Select Alert (which when fired, executes the Policy)
  • 24. 2 4 © Hortonworks Inc. 2011–2018. All rights reserved Auto-Scaling General Configurations ⬢ Cooldown Period (between scaling actions) ⬢ Minimum and Maximum Cluster size (boundaries) Cluster size boundaries Time Interval between two Autoscale events
  • 25. 2 5 © Hortonworks Inc. 2011–2018. All rights reserved ✓ Control and Automation ✓ Cloudy Services ✓ Security ✓ Enterprise-Grade Support Checklist for enterprises in the cloud ✓ Cloud Resources ✓ Hortonworks DataFlow ✓ Custom Images
  • 26. 2 6 © Hortonworks Inc. 2011–2018. All rights reserved Cloud Resources: RDBMS + LDAP
  • 27. 2 7 © Hortonworks Inc. 2011–2018. All rights reserved Cloud Resources: RDBMS and LDAP/AD = Dynamic Blueprints ⬢ Background: – Cluster configuration often includes external database (for Hive, Ranger, etc) and LDAP/AD configs – It’s a challenge to know the different Blueprint configuration choices per service across the stack ⬢ Dynamic Blueprints: – Ability to manage External Sources (e.g. RDBMS and LDAP/AD) outside of your Blueprint – Cloudbreak will inject the configurations into your Blueprint – Simplifies reuse of external cloud resources – Simplifies your Blueprints -> don’t have to know all the configurations for each component
  • 28. 2 8 © Hortonworks Inc. 2011–2018. All rights reserved Dynamic Blueprints: RDBMS/LDAP ⬢ Built-In Components: – Atlas, Ranger, Hadoop, Hive LLAP, Hive, Ambari, Oozie, Druid, SuperSet JDBC/LDAP properties in Blueprint for the Component? Yes Use Blueprint as-is, no Component configuration property injection No Inject Component configuration properties Perform property variable replacement S E
  • 29. 2 9 © Hortonworks Inc. 2011–2018. All rights reserved At-Motion Workloads: Hortonworks DataFlow
  • 30. 3 0 © Hortonworks Inc. 2011–2018. All rights reserved Hortonworks DataFlow in CloudBreak ⬢ Default blueprint: “Flow Management: Apache NiFi” HDF 3.1: NiFi, Ambari, Ambari Metrics, ZooKeeper
  • 31. 3 1 © Hortonworks Inc. 2011–2018. All rights reserved HDF - cluster creation
  • 32. 3 2 © Hortonworks Inc. 2011–2018. All rights reserved HDF - cluster creation
  • 33. 3 3 © Hortonworks Inc. 2011–2018. All rights reserved Custom Images
  • 34. 3 4 © Hortonworks Inc. 2011–2018. All rights reserved Background: Cloudbreak 1. Cloudbreak creates VM instances using a default base images. 2. Cloudbreak installs Ambari on a VM instance. 3. Cloudbreak instructs Ambari to install an HDP Cluster on other VM instances. Cloudbreak RHEL 7 HDP Node VM HDP Node VM HDP Node VM HDP Node VM HDP Node VM HDP Node VM HDP Cluster
  • 35. 3 5 © Hortonworks Inc. 2011–2018. All rights reserved Background: Cloudbreak Default Images ⬢ By default, Cloudbreak uses default base public images when creating VM instances. Cloud Standard Image Operating System AWS Amazon Linux 2017 Azure CentOS 7.x Google Cloud Platform CentOS 7.x OpenStack CentOS 7.x Support for Custom Images provides a way for Cloudbreak users to leverage their own custom image (not the default image) when creating VM instances.
  • 36. 3 6 © Hortonworks Inc. 2011–2018. All rights reserved Making a Custom Image: Overview Create the Custom Image Register the Custom Image in Cloudbreak Use the Custom Image when Creating a Cluster 1 2 3
  • 37. 3 7 © Hortonworks Inc. 2011–2018. All rights reserved Creating the Image: Code Repository ⬢ Instructions, Packer scripts and Salt states in public GitHub repository – https://github.com/hortonworks/cloudbreak-images ⬢ An understanding of Packer and Salt is useful – Packer creates infrastructure – Packer runs Salt provisioner ⬢ Customer should clone the repository and build on it
  • 38. 3 8 © Hortonworks Inc. 2011–2018. All rights reserved Creating the Image: Example Scenarios SCENARIO APPROACH For AWS: I don’t want Amazon Linux and instead want RHEL 7 1. Setup repository and AWS environment 2. Use the repository tools to build a RHEL 7 image make build-aws-rhel7 I don’t want OpenJDK and instead want Oracle JDK 1. Setup repository and environment 2. Turn on Oracle optional state 3. Use the repository tools to build an image For AWS: I don’t want Amazon Linux and instead want MY RHEL 7 ** This is an advanced scenario** 1. Setup repository and AWS environment 2. Change the source base image 3. Use the repository tools to build a RHEL 7 image make build-aws-rhel7
  • 39. 3 9 © Hortonworks Inc. 2011–2018. All rights reserved Use the Custom Image: Create Cluster (UI) ⬢ Create Cluster > General Configuration > Advanced Choose image catalog Adjust the Ambari + HDP repos (if you want) Choose image you registered
  • 40. 4 0 © Hortonworks Inc. 2011–2018. All rights reserved Pre-Warmed Images PROS CONS Prewarmed: OS + pre-installed Ambari and HDP Cluster installs are faster No internet connection is needed Cannot change the Ambari or HDP versions, cannot use local repositories Base: OS only Cluster installs take longer Can change the Ambari or HDP Versions, or use local repositories Base Images Prewarmed Images
  • 41. 4 1 © Hortonworks Inc. 2011–2018. All rights reserved ✓ Control and Automation ✓ Cloudy Services ✓ Security ✓ Enterprise-Grade Support Checklist for enterprises in the cloud ✓ Kerberos support ✓ LDAP integration ✓ Proxy configuration
  • 42. 4 2 © Hortonworks Inc. 2011–2018. All rights reserved Cluster Security: Kerberos
  • 43. 4 3 © Hortonworks Inc. 2011–2018. All rights reserved What is Kerberos ⬢ Strongly authenticating and establishing a user’s identity is the basis for secure access in Hadoop. Users need to be able to reliably “identify” themselves and then have that identity propagated throughout the Hadoop cluster. ⬢ Kerberos is the de-facto system for authenticating access to distributed services
  • 44. 4 4 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak: Support for Enabling Kerberos Goal Provide a way for Cloudbreak users to create clusters that are Kerberos enabled Approach Ambari exposes a lot of Kerberos options Leverage Ambari Kerberos options and avoid re-creating Ambari Kerberos experience Pragmatic prescriptive options on-top
  • 45. 4 5 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak: Enable Kerberos Security ⬢ Create Cluster > Security > Advanced ⬢ [ ] Enable Kerberos Security
  • 46. 4 6 © Hortonworks Inc. 2011–2018. All rights reserved Options: Use Existing KDC or Use Test KDC Use Existing KDC Use Test KDC Advanced Basic - Not for production use. For testing and evaluation purposes only. - Installs and configures an MIT KDC on the master node. - Configures the cluster to leverage that KDC. - Provide basic information about your existing KDC. - Ambari Kerberos descriptors are generated automatically. - Provide basic information about your existing KDC. - Provide your own Ambari Kerberos descriptors.
  • 47. 4 7 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak + LDAP/AD
  • 48. 4 8 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak User AuthN ⬢ Goal: Configure Cloudbreak to provide for external User AuthN to LDAP/AD – CloudFoundry UAA (User Account and Authentication Server) is the foundation https://github.com/cloudfoundry/uaa ⬢ Two parts: 1. Configure Cloudbreak to talk to external LDAP/AD 2. Configure which group(s) can access Cloudbreak
  • 49. 4 9 © Hortonworks Inc. 2011–2018. All rights reserved Step 1: Configure Cloudbreak to talk to LDAP/AD ⬢ On the Cloudbreak host, create: /var/lib/cloudbreak-deployment/uaa-changes.yml ⬢ Define LDAP profile for users and groups Cloudbreak LDAP/AD
  • 50. 5 0 © Hortonworks Inc. 2011–2018. All rights reserved Step 2: Configure which group(s) can access Cloudbreak ⬢ Configure which group(s) are authorized to access Cloudbreak: cbd util execute-ldap-mapping [group] cbd util delete-ldap-mapping [group] ⬢ To authorize users in the ”Analysts” group to access Cloudbreak: cbd util execute-ldap-mapping cn=Analysts,ou=Groups,dc=hortonworks,dc=local
  • 51. 5 1 © Hortonworks Inc. 2011–2018. All rights reserved Proxy configuration
  • 52. 5 2 © Hortonworks Inc. 2011–2018. All rights reserved Limited Outbound Internet Access ⬢ Handle enterprise scenarios where: – Limited (or restricted) outbound internet access, and/or – Required use of a Proxy to obtain internet access Cloudbreak Cluster Hosts Cloudbreak • Docker Hub • Cloudbreak dependencies • Default Image Catalog Cloudbreak and Cluster Hosts • Cloud Provider APIs • HDP or HDF platform repositories http/sproxy (optional)
  • 53. 5 3 © Hortonworks Inc. 2011–2018. All rights reserved Internet Access via Proxy Cloudbreak Proxy Setup Clusters Proxy Setup How does Cloudbreak communicate thru a proxy to get to the internet (and to the cluster hosts)? How do the Cluster Hosts communicate thru a proxy to get to the Internet?
  • 54. 5 4 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak: Proxy Setup ⬢ Setup Docker Environment to use Proxy – Modify the Docker service to set HTTP_PROXY and HTTPS_PROXY (and NO_PROXY) https://docs.docker.com/config/daemon/systemd/#httphttps-proxy ⬢ Setup Cloudbreak to use Proxy in Profile ⬢ Advanced Profile option “HTTPS_PROXYFORCLUSTERCONNECTION=true|false” – Defaults to “false” HTTP_PROXY_HOST=your-proxy-host HTTPS_PROXY_HOST=your-proxy-host PROXY_PORT=your-proxy-port PROXY_USER=your-proxy-user PROXY_PASSWORD=your-proxy-password
  • 55. 5 5 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak: Advanced Proxy Scenarios SCENARIO #1: Proxy for internet, not clusters SCENARIO #2: Proxy for internet and clusters
  • 56. 5 6 © Hortonworks Inc. 2011–2018. All rights reserved Clusters: Register Proxy Configuration ⬢ External Sources > Proxy Configurations (optional) if proxy requires authentication
  • 57. 5 7 © Hortonworks Inc. 2011–2018. All rights reserved Clusters: Configure Proxy for Cluster Hosts ⬢ Create Cluster > Advanced > External Sources > Configure Proxy • Configures yum with “proxy” settings • Configures Ambari Server with “httpProxy” settings
  • 58. 5 8 © Hortonworks Inc. 2011–2018. All rights reserved ✓ Control and Automation ✓ Cloudy Services ✓ Security ✓ Enterprise-Grade Support Checklist for enterprises in the cloud ✓ SmartSense integration ✓ Flex support
  • 59. 5 9 © Hortonworks Inc. 2011–2018. All rights reserved Cloudbreak in da house
  • 60. 6 0 © Hortonworks Inc. 2011–2018. All rights reserved ⬢ Have an internal hosted Cloudbreak service for… – our CI/CD pipeline – testing and prototyping HDP and HDF services – have self-service clusters for QE/SE/PS teams Main use cases
  • 61. 6 1 © Hortonworks Inc. 2011–2018. All rights reserved ⬢ Run Cloudbreak in HA (High Availability) mode – Ability to recover flows in case of node failure – Avoid master-slave design / leader election problems ⬢ Scale Cloudbreak as we desire – Distribute each cluster related flow – Cannot run 2 flows for the same cluster at the same time (e.g: 2 upscale flows) – Flow cancellation must be handled ⬢ Scale the Web UI – Had to introduce a Redis cluster for the session store ⬢ Scale every other service as well ⬢ Find a tool that makes it easy to deploy these services to multiple nodes Our technical goals
  • 62. 6 2 © Hortonworks Inc. 2011–2018. All rights reserved ⬢ Not because it’s fancy.. ⬢ Evaluated Kubernetes, Swarm, Mesos, Rancher ⬢ Open source / Active community with hands-on experience ⬢ Many cloud providers already supports it ⬢ Lots of tooling behind it / API / CLI / Helm / Ansible / Salt ⬢ Integration with most of the cloud providers – Provision Load Balancer (GCP, AWS, Azure) – Use object stores to share data (Ceph, S3, GCP bucket, Azure Storage Account) – Dynamic volume provisioning / Persistent disk (EBS, Azure Blob) Why Kubernetes?
  • 63. 6 3 © Hortonworks Inc. 2011–2018. All rights reserved Thank you