SlideShare a Scribd company logo
Janos Matyas / CTO / SequenceIQ Inc.
GOAL / MOTIVATION
TECHNOLOGY STACK
PROBLEM RESOLUTION / HOW IT WORKS
RESULTS / ACHIEVEMENTS
OVERVIEW
GOAL / MOTIVATION
 Ease Hadoop provisioning – everywhere
 Automate and unify the process
 Arbitrary cluster size
 Same process through a cluster lifecycle (Dev, QA, UAT, Prod)
 (Auto) scaling Hadoop
 QoS
OUR APPROACH
 Use Docker
 Build cloud-specific ‘Dockerized’ images
 Provision the cluster
 Use Ambari
DOCKER
 Lightweight, portable
 Build once, run anywhere
 VM – without the overhead of a VM
 Isolated containers
 Automated and scripted
DOCKER – CONTAINERS vs. VMs
 Containers are isolated, but share OS and,
where appropriate, bins/libraries
APACHE AMBARI – ARCHITECTURE
 Easy Hadoop cluster provisioning
 Management and monitoring
 Key features – blueprints
 REST API
APACHE AMBARI – CREATE CLUSTER
 Define a blueprint (POST /api/v1/blueprints)
 Create cluster (POST /api/v1/clusters/mycluster)
HADOOP PROVISIONG ISSUES
 Each cloud provider has a proprietary API
 Create images for each provider
 Network configuration
 Service discovery
 Resize, failover, member join support
OUR APPROACH – DETAILS
 Build your Docker image
 Install or pre-install Hadoop services with Ambari
 Install Serf and dnsmasq
 Build your cloud image
 Use Ansible to create an image
 Provision the cluster
BUILD DOCKER IMAGES
 Create the Dockerfile
 Have Docker.io to build the image
 Optionally pre-install services
 Use Ambari
 Push image to Docker.io
 Licensing questions
BUILD CLOUD IMAGES
 Use a Docker ready base image
 Use Ansible to provision the image template
 Pull the Docker images
 Apply custom infrastructure
 Use cloud provider specific playbooks
 AWS EC2
 Azure
ANSIBLE
 Configuration as data
 Simplest way to automate IT
 Secure and agentless
 Goal oriented
 One playbook – multiple modules
 We use it to “burn” cloud images/templates
PROVISIONING – ISSUES
 FQDN
 /etc/hosts is read-only in Docker
 Everybody needs to know everybody
 DNS
 Single point of failure
 Dynamic cluster – nodes joining, leaving, failing
 Routing
 Cloud – ability to inter-host container routing
 Collision free private IP range for Docker bridge
PROVISIONING – SOLUTION
 FQDN
 Use –h and –dns Docker params
 DNS
 dnsmasq is running on each Docker container
 Serf member-xxx events trigger dnsmasq reconfiguration
 Routing
 Docker bridge configuration – follows a convention
SERF
 Gossip based membership
 Service discovery
 Decentralized
 Lightweight, fault tolerant
 Highly available
 DevOps friendly
 Keep an eye on Consul, Open vSwitch, pipework
SERF – DECENTRALIZED SERVICE DISCOVERY
 Gossip instead of heartbeat
 LAN, WAN profiles
 Provides membership information
 Event handlers: member_join, member_leave, member_failed, member-
update, member-reap, user
 Query
SERF – GOSSIPING
SERF – MEMBERSHIP, EVENT HANDLERS
DNSMASQ
 Network infrastructure for small networks
 Lightweight DNS, DHCP server
 Comes with most Linux distributions
AWS EC2 – HADOOP CLUSTER
 Use EC2 REST API to provision instances (from Dockerized image)
 Start Docker containers
 One Ambari server
 N-1 Ambari agents connecting to server
 Connect ambari-shell to
 Define blueprint
 Provision the cluster
AWS EC2 – NETWORK SECURITY
 Create a VPC
 Configure subnets
 Routing tables
 Security gateway
 Set ACL
 Configure VPN
AWS EC2 - CLOUDFORMATION
 Manually set up VPC is too complicated
 Use CloudFormation
 Manage the stack together
 Template-based
 Environments under version control
 Customizable at runtime
 No extra charge
"VpcId" : {
"Type" : "String",
"Description" : "VpcId of your existing Virtual Private Cloud (VPC)"
},
"SubnetId" : {
"Type" : "String",
"Description" : "SubnetId of an existing subnet (for the primary
network) in your Virtual Private Cloud (VPC)"
},
"SecondaryIPAddressCount" : {
"Type" : "Number",
"Default" : "1",
"MinValue" : "1",
"MaxValue" : "5",
"Description" : "Number of secondary IP addresses to assign to the
network interface (1-5)",
"ConstraintDescription": "must be a number from 1 to 5."
},
"SSHLocation" : {
"Description" : "The IP address range that can be used to SSH to the
EC2 instances",
"Type": "String",
"MinLength": "9",
"MaxLength": "18",
"Default": "0.0.0.0/0",
"AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/
(d{1,2})",
"ConstraintDescription": "must be a valid IP CIDR range of the form
x.x.x.x/x."
}
},
CLOUDBREAK
Cloudbreak is a powerful left surf that
breaks over a coral reef, a mile off
southwest the island of Tavarua, Fiji.
Cloudbreak is a cloud-agnostic
Hadoop as a Service API. Abstracts
the provisioning and ease
management and monitoring of on-
demand clusters.
Provisioning Hadoop has never been easier
CLOUDBREAK
 Benefits
 Elastic
 Scalable
 Blueprints
 Flexible
 Main REST resources
 /template – specify a cluster infrastructure
 /stack – creates a cloud infrastructure built from a template
 /blueprint – describes a Hadoop cluster
 /cluster – creates a Hadoop cluster
RESULTS AND ACHIEVEMENTS
 Hadoop as a Service API
 Available for EC2 and Azure cloud
 OpenStack, bare metal is coming soon
 Open source under Apache 2 licence
 Same goals as Apache Ambari Launchpad project
 What's next?
HADOOP SERVICES - AS A SERVICE
 Leverage YARN
 Slider (Hoya) providers
 HBase, Accumulo
 SequenceIQ providers - Flume, Tomcat
 YARN -1964
 QoS for YARN – heuristic scheduler
 Platform as a Service API
BANZAI PIPELINE
Banzai Pipeline is a surf reef break located
in Hawaii, off Ehukai Beach Park in
Pupukea on O'ahu's North Shore.
Banzai Pipeline is a RESTful
application development
platform for building on-
demand data and job pipelines
running on Hadoop YARN.
Banzai Pipeline is a big data API for the REST
THANK YOU
 Get the code: https://github.com/sequenceiq
 Read about: http://blog.sequenceiq.com
 Facebook: http://facebook.com/sequenceiq
 Twitter: http://twitter.com/sequenceiq
 LinkedIn: http://linkedin.com/sequenceiq
 Contact: janos.matyas@sequenceiq.com
FEEL FREE TO CONTRIBUTE

More Related Content

PPT
Docker Based Hadoop Provisioning
PDF
Hadoop Cluster on Docker Containers
PPTX
Scalable On-Demand Hadoop Clusters with Docker and Mesos
PPTX
Big Data in Container; Hadoop Spark in Docker and Mesos
PPTX
Lessons Learned Running Hadoop and Spark in Docker Containers
PPT
Introduction to Apache CloudStack by David Nalley
PPTX
Docker, Mesos, Spark
PPTX
Building clouds with apache cloudstack apache roadshow 2018
Docker Based Hadoop Provisioning
Hadoop Cluster on Docker Containers
Scalable On-Demand Hadoop Clusters with Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
Lessons Learned Running Hadoop and Spark in Docker Containers
Introduction to Apache CloudStack by David Nalley
Docker, Mesos, Spark
Building clouds with apache cloudstack apache roadshow 2018

What's hot (20)

PPT
February 2016 HUG: Running Spark Clusters in Containers with Docker
ODP
Guaranteeing Storage Performance by Mike Tutkowski
PDF
OpenStack Best Practices and Considerations - terasky tech day
PDF
Cloud stack for_beginners
PDF
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
PDF
Ceph with CloudStack
PDF
Wido den hollander cloud stack and ceph
PPTX
Ansible + Hadoop
PDF
Open Datacentre
PDF
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
PDF
OpenStack Summit Vancouver: Lessons learned on upgrades
PPTX
Hypervisor Selection in Apache CloudStack 4.4
PDF
Ceph and Apache CloudStack
PPTX
On Docker and its use for LHC at CERN
PPTX
OpenStack Cinder
PDF
Cassandra and Docker Lessons Learned
PPTX
How bigtop leveraged docker for build automation and one click hadoop provis...
PDF
CloudStack Best Practice in PPTV
PPTX
Cloud stack overview
PPTX
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28
February 2016 HUG: Running Spark Clusters in Containers with Docker
Guaranteeing Storage Performance by Mike Tutkowski
OpenStack Best Practices and Considerations - terasky tech day
Cloud stack for_beginners
Avishay Traeger & Shimshon Zimmerman, Stratoscale - Deploying OpenStack Cinde...
Ceph with CloudStack
Wido den hollander cloud stack and ceph
Ansible + Hadoop
Open Datacentre
Muli Ben-Yehuda, Stratoscale - The Road to a Hyper-Converged OpenStack, OpenS...
OpenStack Summit Vancouver: Lessons learned on upgrades
Hypervisor Selection in Apache CloudStack 4.4
Ceph and Apache CloudStack
On Docker and its use for LHC at CERN
OpenStack Cinder
Cassandra and Docker Lessons Learned
How bigtop leveraged docker for build automation and one click hadoop provis...
CloudStack Best Practice in PPTV
Cloud stack overview
vBACD - Deploying Infrastructure-as-a-Service with CloudStack - 2/28
Ad

Viewers also liked (9)

PPTX
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
PDF
Apache Hadoop YARN - Enabling Next Generation Data Applications
PPTX
Managing Docker Containers In A Cluster - Introducing Kubernetes
PPTX
Hadoop on Docker
PDF
Docker Swarm Cluster
PPTX
Simplified Cluster Operation & Troubleshooting
PDF
Hortonworks Technical Workshop: What's New in HDP 2.3
PPTX
Configuring Your First Hadoop Cluster On EC2
PDF
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Apache Hadoop YARN - Enabling Next Generation Data Applications
Managing Docker Containers In A Cluster - Introducing Kubernetes
Hadoop on Docker
Docker Swarm Cluster
Simplified Cluster Operation & Troubleshooting
Hortonworks Technical Workshop: What's New in HDP 2.3
Configuring Your First Hadoop Cluster On EC2
Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....
Ad

Similar to Docker based Hadoop provisioning - Hadoop Summit 2014 (20)

PDF
Higher order infrastructure: from Docker basics to cluster management - Nicol...
PPT
Automating Your CloudStack Cloud with Puppet
PDF
Automating CloudStack with Puppet - David Nalley
PDF
From Monolith to Docker Distributed Applications
PPTX
Netflix and Open Source
PDF
Agile Brown Bag - Vagrant & Docker: Introduction
PDF
Introduction to Docker
PPTX
Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...
PDF
From Monolith to Docker Distributed Applications
PPTX
.NET Developer Days - So many Docker platforms, so little time...
PPTX
Dragonflow Austin Summit Talk
PPTX
Using the Azure Container Service in your company
PPTX
Best Practices for Running Kafka on Docker Containers
PPTX
Docker Demystified for SB JUG
PDF
Introduction to docker security
PDF
Docker Online Meetup #28: Production-Ready Docker Swarm
PPTX
Docker, cornerstone of cloud hybridation ? [Cloud Expo Europe 2016]
PPTX
Docker, cornerstone of an hybrid cloud?
PPTX
Silicon Valley CloudStack User Group - Introduction to Apache CloudStack
PPTX
Docker - Portable Deployment
Higher order infrastructure: from Docker basics to cluster management - Nicol...
Automating Your CloudStack Cloud with Puppet
Automating CloudStack with Puppet - David Nalley
From Monolith to Docker Distributed Applications
Netflix and Open Source
Agile Brown Bag - Vagrant & Docker: Introduction
Introduction to Docker
Azure: Docker Container orchestration, PaaS ( Service Farbic ) and High avail...
From Monolith to Docker Distributed Applications
.NET Developer Days - So many Docker platforms, so little time...
Dragonflow Austin Summit Talk
Using the Azure Container Service in your company
Best Practices for Running Kafka on Docker Containers
Docker Demystified for SB JUG
Introduction to docker security
Docker Online Meetup #28: Production-Ready Docker Swarm
Docker, cornerstone of cloud hybridation ? [Cloud Expo Europe 2016]
Docker, cornerstone of an hybrid cloud?
Silicon Valley CloudStack User Group - Introduction to Apache CloudStack
Docker - Portable Deployment

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
history of c programming in notes for students .pptx
PPTX
Transform Your Business with a Software ERP System
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Digital Strategies for Manufacturing Companies
PDF
Nekopoi APK 2025 free lastest update
PPTX
ai tools demonstartion for schools and inter college
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
top salesforce developer skills in 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
medical staffing services at VALiNTRY
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Upgrade and Innovation Strategies for SAP ERP Customers
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Understanding Forklifts - TECH EHS Solution
history of c programming in notes for students .pptx
Transform Your Business with a Software ERP System
Operating system designcfffgfgggggggvggggggggg
Digital Strategies for Manufacturing Companies
Nekopoi APK 2025 free lastest update
ai tools demonstartion for schools and inter college
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Online Work Permit System for Fast Permit Processing
Softaken Excel to vCard Converter Software.pdf
top salesforce developer skills in 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PTS Company Brochure 2025 (1).pdf.......
medical staffing services at VALiNTRY
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
2025 Textile ERP Trends: SAP, Odoo & Oracle

Docker based Hadoop provisioning - Hadoop Summit 2014

  • 1. Janos Matyas / CTO / SequenceIQ Inc.
  • 2. GOAL / MOTIVATION TECHNOLOGY STACK PROBLEM RESOLUTION / HOW IT WORKS RESULTS / ACHIEVEMENTS OVERVIEW
  • 3. GOAL / MOTIVATION  Ease Hadoop provisioning – everywhere  Automate and unify the process  Arbitrary cluster size  Same process through a cluster lifecycle (Dev, QA, UAT, Prod)  (Auto) scaling Hadoop  QoS
  • 4. OUR APPROACH  Use Docker  Build cloud-specific ‘Dockerized’ images  Provision the cluster  Use Ambari
  • 5. DOCKER  Lightweight, portable  Build once, run anywhere  VM – without the overhead of a VM  Isolated containers  Automated and scripted
  • 6. DOCKER – CONTAINERS vs. VMs  Containers are isolated, but share OS and, where appropriate, bins/libraries
  • 7. APACHE AMBARI – ARCHITECTURE  Easy Hadoop cluster provisioning  Management and monitoring  Key features – blueprints  REST API
  • 8. APACHE AMBARI – CREATE CLUSTER  Define a blueprint (POST /api/v1/blueprints)  Create cluster (POST /api/v1/clusters/mycluster)
  • 9. HADOOP PROVISIONG ISSUES  Each cloud provider has a proprietary API  Create images for each provider  Network configuration  Service discovery  Resize, failover, member join support
  • 10. OUR APPROACH – DETAILS  Build your Docker image  Install or pre-install Hadoop services with Ambari  Install Serf and dnsmasq  Build your cloud image  Use Ansible to create an image  Provision the cluster
  • 11. BUILD DOCKER IMAGES  Create the Dockerfile  Have Docker.io to build the image  Optionally pre-install services  Use Ambari  Push image to Docker.io  Licensing questions
  • 12. BUILD CLOUD IMAGES  Use a Docker ready base image  Use Ansible to provision the image template  Pull the Docker images  Apply custom infrastructure  Use cloud provider specific playbooks  AWS EC2  Azure
  • 13. ANSIBLE  Configuration as data  Simplest way to automate IT  Secure and agentless  Goal oriented  One playbook – multiple modules  We use it to “burn” cloud images/templates
  • 14. PROVISIONING – ISSUES  FQDN  /etc/hosts is read-only in Docker  Everybody needs to know everybody  DNS  Single point of failure  Dynamic cluster – nodes joining, leaving, failing  Routing  Cloud – ability to inter-host container routing  Collision free private IP range for Docker bridge
  • 15. PROVISIONING – SOLUTION  FQDN  Use –h and –dns Docker params  DNS  dnsmasq is running on each Docker container  Serf member-xxx events trigger dnsmasq reconfiguration  Routing  Docker bridge configuration – follows a convention
  • 16. SERF  Gossip based membership  Service discovery  Decentralized  Lightweight, fault tolerant  Highly available  DevOps friendly  Keep an eye on Consul, Open vSwitch, pipework
  • 17. SERF – DECENTRALIZED SERVICE DISCOVERY  Gossip instead of heartbeat  LAN, WAN profiles  Provides membership information  Event handlers: member_join, member_leave, member_failed, member- update, member-reap, user  Query
  • 19. SERF – MEMBERSHIP, EVENT HANDLERS
  • 20. DNSMASQ  Network infrastructure for small networks  Lightweight DNS, DHCP server  Comes with most Linux distributions
  • 21. AWS EC2 – HADOOP CLUSTER  Use EC2 REST API to provision instances (from Dockerized image)  Start Docker containers  One Ambari server  N-1 Ambari agents connecting to server  Connect ambari-shell to  Define blueprint  Provision the cluster
  • 22. AWS EC2 – NETWORK SECURITY  Create a VPC  Configure subnets  Routing tables  Security gateway  Set ACL  Configure VPN
  • 23. AWS EC2 - CLOUDFORMATION  Manually set up VPC is too complicated  Use CloudFormation  Manage the stack together  Template-based  Environments under version control  Customizable at runtime  No extra charge "VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" }, "SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" }, "SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." }, "SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(d{1,3}).(d{1,3}).(d{1,3}).(d{1,3})/ (d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },
  • 24. CLOUDBREAK Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off southwest the island of Tavarua, Fiji. Cloudbreak is a cloud-agnostic Hadoop as a Service API. Abstracts the provisioning and ease management and monitoring of on- demand clusters. Provisioning Hadoop has never been easier
  • 25. CLOUDBREAK  Benefits  Elastic  Scalable  Blueprints  Flexible  Main REST resources  /template – specify a cluster infrastructure  /stack – creates a cloud infrastructure built from a template  /blueprint – describes a Hadoop cluster  /cluster – creates a Hadoop cluster
  • 26. RESULTS AND ACHIEVEMENTS  Hadoop as a Service API  Available for EC2 and Azure cloud  OpenStack, bare metal is coming soon  Open source under Apache 2 licence  Same goals as Apache Ambari Launchpad project  What's next?
  • 27. HADOOP SERVICES - AS A SERVICE  Leverage YARN  Slider (Hoya) providers  HBase, Accumulo  SequenceIQ providers - Flume, Tomcat  YARN -1964  QoS for YARN – heuristic scheduler  Platform as a Service API
  • 28. BANZAI PIPELINE Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in Pupukea on O'ahu's North Shore. Banzai Pipeline is a RESTful application development platform for building on- demand data and job pipelines running on Hadoop YARN. Banzai Pipeline is a big data API for the REST
  • 29. THANK YOU  Get the code: https://github.com/sequenceiq  Read about: http://blog.sequenceiq.com  Facebook: http://facebook.com/sequenceiq  Twitter: http://twitter.com/sequenceiq  LinkedIn: http://linkedin.com/sequenceiq  Contact: janos.matyas@sequenceiq.com FEEL FREE TO CONTRIBUTE

Editor's Notes

  • #2: Thanks for coming – today will talk about Docker based Hadoop provisioning. Quick introduction of who we are - Young startup, from Budapest, Hungary. Janos Matyas – CTO, open source contributor, Hadoop YARN evangelist.
  • #4: Why we have started this at all – there are so many options. We repeated the same steps over and over – and scripted. Still, we felt that there is something missing. See bullet points
  • #5: Been through many different approaches. Bare metal, cloud VM, so on – ended up using Docker. Tested many provisioning frameworks – Ambari is the one.
  • #6: Quick question - How many of you have used Docker before. Docker is a container based virtualization framework. Unlike traditional virtualization Docker is fast, lightweight and easy to use. Docker allows you to create containers holding all the dependencies for an application. Each container is kept isolated from any other, and nothing gets shared.
  • #7: I can run 5-6 containers – less overhead than 1 virtualbox. No SOCKS proxy, etc.
  • #8: The ‘provisioning’ framework. No need to enter details, there were pretty good sessions about Ambari. Blueprints 1.5.1 tech preview, 1.6 fully supported. Blueprint = stack definition + component layout. REST API – we have created, open sourced Ambari client + shell (come and join the Ambari Meetup today at 3:30)
  • #10: Now, the issues. Do it again and again – for each cloud provider. Create the image – but how do you know what’s the requirement, building an image each and every time? Network – this is a big issue. EC2 has API, Azure his own. Open Stack has a network as a service component – Neutrom. SDN – Software define network!!! Everything is dynamic – how do you do service discovery? Extra features – fully dynamic Hadoop cluster.
  • #11: Will expand on these shortly. Sounds too easy – lets get into details.
  • #12: A Docker image is described by a Dockerfile – like a Vagrant file for virtualbox for example. You want trusted build – use Docker.io Faster provisioning – a 100+ node Hadoop cluster in less than 5 minutes? Come and join the Ambari meetup. Licensing –Ganglia or Nagios (BSD and GPL). Hortonworks Hadoop – Apache 2 Bigtop is coming…
  • #13: Amazon Linux – Redhat based – recently is Docker ready. OpenStack stack Nova hypervisor supports Docker. Apply the network and other infrastructure relates stuff. Remember the licensing – use our Ansible script to build your cloud image. Or modify.
  • #14: IT automation war - Ansible vc Chef, Puppet. Ansible configurations are simple data descriptions of your infrastructure (both human-readable and machine-parsable). Needs only SSH.
  • #15: Dev – env : use default Docker bridge (easy) All talks to each other DNS – heavy management overweight
  • #16: -h for hostname, --dns to specify the DNS service to use Convention: AMI launch index
  • #17: Serf is a decentralized solution for cluster membership, failure detection and orchestration. Serf, Zookeeper, etcd, doozerd. All three have server nodes that require a quorum of nodes to operate – strong consistency. Serf - eventual consistency Most important thing is that gossip based – will expand shortly. Decentralized – all nodes are equal.
  • #18: Fire and forget Waits for anwer – limited response collection. Custom event handlers Tags – e.g. Ambari server, hostgroups, etc
  • #20: Load increases – how to cluster knows that there is a new member.
  • #21: Running on each Docker container – updated by SERF events.
  • #22: Amazon supports Docker natively. Start N number of nodes. Pass our userdata script .at startup. Start the containers – they will know about each other using Serf. Shell or REST API or Ambari UI.
  • #23: You need security – strongly recommended use your VPC instead of default VPC. Use different availability zones for maximum uptime.
  • #24: Who did VPS knows – can be scripted. It is harder to decommision / change / delete than add components. Use CloudFormation.
  • #25: This is a very easy but still error prone process – though it helps a let. We build an API on top, and automated the whole process. We are not a Service Provider – this is an API.
  • #26: Elastic – arbitrary number of nodes. Scalable – follow your workload change. Blueprints – supports different cluster blueprints Flexible – Use your favorite cloud, bring your own Hadoop – one common API
  • #27: One API – any size, anywhere. Why we needed Cloudbreak – this is not the end of the story.
  • #28: We wanted to have a Platform as a Service API. We are YARN evangelists – wanted to run everything on YARN. Community driven. Heuristic scheduler.
  • #29: A fully dynamic big data pipeline. Build your pipeline, run dynamically / on demand. All pre-coded, zero coding, only configuration. Data pipeline – run services on demand, short or long term. Start when needed, stoped when is idle. Apply ETL on demand. Job pipeline – all major ML are supported (Mahout, Mllib), and 44 other MR jobs (correlations, joins, summarizations, filtering, sort, sharding, shuffle) Streaming pipeline – Spark based Custom SDK – abstracts the complexity behind MR and Spark.
  • #30: Subscribe to the Beta test. Contribute. We did contributions on several Apache and other open source projects. Babilon at SequenceIQ; Java and Scala is the default. Groovy is used very often. Than Go – Docker + Serf – we had to learn Go to fix things. Ansible for IT. Strongly suggest to use Docker – we use it everywhere. CI/CD, cloud. For a demo come and join the Ambari meetup. Thanks for coming. Q&A. Join me after or follow us through one of the social medias listed.