SlideShare a Scribd company logo
Operational Best Practices
      in the Cloud



        October 27, 2011
     Watch the video of this webinar
2#



Your Panel Today
Presenting
• Rafael H. Saavedra, VP Engineering, RightScale
• Josep Blanquer, Sr. Systems Architect, RightScale

Q&A
• David Manriquez, Account Manager, RightScale




Please use the “Questions” window to ask questions any time!


                                                 Cloud Management Platform
3#



Agenda
•   RightScale architecture
•   The release cycle
•   Monitoring, alerts and escalations
•   When servers fail
•   Our best practices

Today’s material will discuss how we run RightScale in the cloud.
From this, we distill best practices that are relevant for all.

Please use the “Questions” window to ask questions any time!

                                               Cloud Management Platform
Operational Best Practices in the Cloud



RightScale architecture
5#



The scale of RightScale

• > 3M servers launched by RightScale

• RightScale continuously monitors > 100K servers

• Every day at RightScale:
   •   2,000 array resize actions are executed
   •   35,000 alert escalations are triggered
   •   20,000 escalation emails are sent to users
   •   9.0TB of monitoring data is exchanged with our servers
   •   1.6TB of logging data is sent to our servers




                                                                Cloud Management Platform
6#



Architecture of a cloud-based SaaS app
• RightScale is a SaaS application that runs completely in the cloud
   • Databases
   • Core web app and API
   • Services such as monitoring, logging, and MultiCloud Marketplace




                                                          Cloud Management Platform
7#



A quick primer on ServerTemplates
              Configuring servers
           through bundling Images:                                       Configuring servers
                                                                         with ServerTemplates:
                 Custom MySQL 5.0.24 (CentOS 5.2)
               Custom MySQL 5.0.24 (CentOS 5.4)
                         MySQL 5.0.36 (CentOS 5.4)
                                                                                           Setup DNS and IPs
                      MySQL 5.0.36 (Ubuntu 8.10)




                                                                      boot sequence
                         MySQL 5.0.36 (Ubuntu 8.10) 64bit                   A set Restore last backup
                                                                                  of configuration
                                                                        directives that will install and
        Frontend Apache 1.3 (Ubuntu 8.10)
                                                                        configure Configure MySQL of
                                                                                   software on top
  Frontend Apache 2.0 (Ubuntu 9.10) - patched                                   the base image
                CMS v1.0 (CentOS 5.4)                                                     Install MySQL Server

               CMS v1.1 (CentOS 5.4)                                                       Install monitoring
               My ASP appserver (windows 2008)
                My ASP.net (windows 2008) – security update 1
                                                                                          Base Image
         My ASP.net (windows 2008) – security update 8                                  MultiCloudImage
                                                                                       Very few and basic
            SharePoint v4 (windows 2003) – 32bit
                  SharePoint v4 (windows 2003) –64bit
              SharePoint v4.5 (windows 2003) –64bit             CentOS 5.2                  Ubuntu 8.10          Win 2003
                                                                 CentOS 5.4                  Ubuntu 9.10          Win 2007
                         …
                                                                                      Cloud Management Platform
8#



We use the same ServerTemplates our
customers do
• RightScale uses 15-20 different ServerTemplates in Production
   • We don’t build images, we use pre-built MultiCloud Images with RightLink
   • We make heavy use of RightScale provided tool boxes (EBS, DNS, LB)
• Off-the shelf: 1 template (MySQL)
• Customized: App servers and load balancers
   • Written with RightScripts in Ruby, Bash, etc.
   • Mostly Rail apps to run our core services: front-end, API, Marketplace, etc.
• From MultiCloud Image: Messaging and databases
   • RabbitMQ, Cassandra




                                                             Cloud Management Platform
9#



Deployments group RightScale services




                               Cloud Management Platform
10#



Best practices: Architecture
• ServerTemplates can be used off the shelf or customized
   • Don’t bundle images
   • Make heavy use of MCI’s instead of hardcoding base RightImages



• Deployments let you stage servers in the cloud
   • The use of inputs guarantee consistency across all servers
   • Easily test or failover
   • Macros/API automation can quickly stand up entire deployments




                                                            Cloud Management Platform
Operational Best Practices in the Cloud



   The release cycle
12#



Challenges of the release cycle


  • Limited resources and lead time for procuring and
    provisioning equipment
  • Maintaining multiple environments from development
    through production
  • Maintaining consistency for reusability and QA
  • Distributed teams and team members



                                            Cloud Management Platform
13#



A typical release cycle flow




                               Cloud Management Platform
14#



Our development environment
• We keep a number of different deployments
   • Each development team has its own mini-environment
   • A larger integrated staging environment
   • One production environment


• Accounts keep things organized and secure
   • We keep a separate accounts for staging and production
   • One team of sys admins manage all environments




                                                          Cloud Management Platform
15#



RightScale release cycle
• One set of scripts and ServerTemplates are used everywhere
   • Gate accounts for security, development vs. production, etc.
   • Less test variance between Production and Staging
   • Only difference is size of environment


• Easy to bring up development environment on demand using
  deployments and macros
   • Get it up and running, on demand in less than an hour
   • Cloud is pay-by-the-hour, so it is cheap to run temporary environments




                                                            Cloud Management Platform
16#



Best practices: Release cycle
• Don’t be afraid to run many environments
   •   Dynamically clone, launch and teardown environments for quick tests
   •   Configure a fixed set of environment for development, integration, staging
   •   Use different accounts to segregate users and configurations.
   •   Sys admins are expensive. Cloud servers are cheap.
• Reuse ServerTemplates to keep environments consistent
   • Make use of the versioning and freeze software repositories
   • Share or Publish them through the MultiCloud Marketplace
   • Create all-in-one ServerTemplates from the same RightScripts and recipes
• Avoid upgrading existing servers, fail forward instead
   • Keep old servers running so you can rollback, or do post-mortem later on
   • For databases: Launch additional slaves. Freeze replication at upgrade point.
     Take snapshots!

                                                              Cloud Management Platform
17#



Release night steps
                          2) Servers with new code               7) Take snapshot
                                                                    at cutoff
                                  Main App

                                                9) Reconnect
       10) Open access                             all servers
           to site                                                 8) Update schema
                                                                    Databases
   Front Ends
                                                                                DB Master

                                                                                DB Slave

                                  Main App                                      DB Slave

                                                                  3) Add second slave
        4) Cut access
                                                                  6) Stop replication
           to site
                                                  5) Stop all access
                                                     to databases
                         1) Servers with current code
                                                             Cloud Management Platform
Operational Best Practices in the Cloud



Monitoring, alerts and escalations
19#



Monitoring and alerts: Diagnose & optimize
• Off-the-shelf monitoring
   • OS: CPU, Disk, Memory, Network, Processes, System
   • App: Apache, IIS, MySQL, Nginx, SQL Server
   • Plus many more CollectD plug-ins!


• Custom monitoring

• Cluster monitoring

• Alerts & escalations



                                                         Cloud Management Platform
20#



Monitoring, alerts & escalations
• We monitor as much relevant data as possible and display it
  in insightful ways to quickly detect patterns and abnormalities
• We proactively eliminate the conditions that raise critical alerts
   • No broken windows policy. No critical alerts can remain unresolved.


     API Network Activity                Dashboard Network Activity




                                                           Cloud Management Platform
21#



Off-the-shelf: MySQL Collectd Plugin




                                Cloud Management Platform
22#



Off-the-shelf: MySQL reads graphs
• Read-random-next represents a table scan
• Read-next represents an index scan




                                             Cloud Management Platform
23#



Custom: Whatever you want with collectd
• Any statistic you can think of can easily be added as a monitor.
• All of these are graph-able and alert-able in our dashboard!
• Many can be written in less than an hour.
   • As easy as printing a line of formatted numbers every few seconds
• support.rightscale.com is an authority on collectd

• How we do it:
   • We use Ruby to write our custom monitors
   • Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans
   • Passenger: Ruby script that parses data from Passenger command line interface




                                                           Cloud Management Platform
24#



Custom: Cassandra monitors




                             Cloud Management Platform
25#



Cluster: Monitor hundreds of servers
 • We leverage a
   monitoring data
   warehouse to develop
   heat maps
   & stacked graphs




                                Cloud Management Platform
26#



Automated actions using alerts from monitors
• Create an alert for any monitor, even your custom ones
   • RightScale example: Cassandra pending reads signals overloading


• Break alerts into critical and warning
   • Critical: Wake me up! Page me!
   • Warning: Send email to team.


• Trigger many actions: email, run script, scale, relaunch, reboot,…
   • Customize to your monitor, situation, and IT processes
   • RightScale example: Run a RightScript if swap is too high
   • Integrate with 3rd party services like PagerDuty



                                                             Cloud Management Platform
27#



Best practices: Monitoring and alerts
• Monitor your critical processes off-the-shelf
    • Set monitors with scripts on your ServerTemplates
    • Use mon_process (e.g. Ruby)


• Customize to your application needs
    • Use collectd plug-ins or easily build your own
    • The monitor is graphed in the RightScale dashboard


• Plan out your critical alerts
    • Set your response plan: warnings vs. critical




                                                           Cloud Management Platform
Operational Best Practices in the Cloud



   When servers fail
29#



How to think about server failure in the cloud
• Design for failure
    •   Make sure your application remains healthy after the failure of a node
    •   Don’t use sticky sessions
    •   Distribute your application services
•   Debug ServerTemplates and not servers
•   Use alerts to reboot and/or relaunch
•   Auto-scale app server arrays
•   Use dynamic DNS and static IPs for load balancers
    •   Your app servers and databases will always know where to look




                                                              Cloud Management Platform
30#



Deep dive on database failure
• Use database backups for rollbacks or disaster scenarios
   • Restore from backups in event of complete system failure
   • One-click with fully automated RightScale Database Managers


• Use database redundancy for high availability (example master/slave)
   •   Promote slave if master fails
   •   Possible to prime your slave database to make failover more seamless
   •   After promotion is complete, quick to launch a new slave
   •   Worry about troubleshooting when you have time
   •   One-click with fully automated RightScale Database Managers




                                                            Cloud Management Platform
31#



Backups to block volumes and object stores
• Block volumes: EBS snapshots     • Object stores: S3/Cloud Files
   • + Easy to snapshot               • + Backup into other clouds
   • + Easy to rotate                 • + Backup individual folders or files
   • + Easy consistency               • + Incremental backups (e.g. as
   • + Instant restore (mount)          files/data are flushed)
   • - Difficult to move between      • - More coding, customization
     clouds/regions                   • - Custom rotation strategy
   • - Must backup entire volume      • - Download time



• What we do:                      • What we do:
   • EBS: Databases                   • S3: Monitoring system (Cassandra
                                        in the future)


                                                 Cloud Management Platform
32#



Best practices: Planning for failure
• No excuse for not backing up your servers
   • RightScale Database Manager + EBS tools make it easy to take backups
• Plan your rotation policy
   • Database Manager helps you tailor daily, weekly, and monthly backups
• Backup across clouds and regions
   • Database Manager for MySQL and SQL Server make it easy to backup to S3 or
     CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace
• Organize your backups
   • Keep track with lineages and timelines using the Database Managers
• Test your backups!
   • It is easy and cheap on the cloud
   • A crisis is the worst time to find out your backups are corrupted


                                                             Cloud Management Platform
Operational Best Practices in the Cloud



  Our best practices
34#



Best practices for operating in the cloud
• Keep your environment organized and consistent
   • Accounts, deployments, ServerTemplates, and macros
• Change and debug configurations not servers
   • ServerTemplates, MultiCloudImages, fail-forward
• Monitor your servers efficiently
   • Off-the-shelf and custom monitoring and alerts
• Automate, automate and also automate
   • Server arrays, macros/API for more complex flows, alert actions …
• Backup your databases (organize, multi-cloud, rotate, test)
   • Database Manager ServerTemplates




                                                           Cloud Management Platform
35#



Getting Started and Q&A
Contact RightScale                    RightScale Conference
(866) 720-0208                        Nov 9 in Santa Clara, CA
sales@rightscale.com                  www.RightScale.com/Conference
                                      •Attend technical breakout sessions
www.rightscale.com
                                      •Talk with RightScale customers
                                      •Ask questions at the Expert Bar
                                      •Training on 11/8 and 11/10


More Info
Webinar archive: RightScale.com/webinars
White Papers: RightScale.com/whitepapers
Free Edition: RightScale.com/Free




                                                                 Cloud Management Platform

More Related Content

PDF
Operations Playbook: Monitoring and Automation - RightScale Compute 2013
PPTX
Migrating enterprise workloads to AWS
PPTX
Azure Stack Overview (Dec/2018)
PPTX
Accenture Oracle on AWS Jumpstart Program
PPTX
Azure Stack Fundamentals
PDF
Moving your SAP Environment to the Cloud
PPTX
Oracle Peoplesoft on AWS: A quick introduction
PDF
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...
Operations Playbook: Monitoring and Automation - RightScale Compute 2013
Migrating enterprise workloads to AWS
Azure Stack Overview (Dec/2018)
Accenture Oracle on AWS Jumpstart Program
Azure Stack Fundamentals
Moving your SAP Environment to the Cloud
Oracle Peoplesoft on AWS: A quick introduction
RightScale Webinar: Operationalize Your Enterprise AWS Usage Through an IT Ve...

What's hot (8)

PPTX
Citrix - Open Elastic Platform for the Private Cloud
PPTX
How to migrate workloads to the google cloud platform
PDF
Ask The Architect: RightScale & AWS Dive Deep into Hybrid IT
PPTX
Introduction to ibm cloud paks concept license and minimum config public
PPTX
Enterprise Cloud Architecture Best Practices
PDF
Migrating Your Windows Datacenter to AWS
PPTX
When networks meets apps (open stack atlanta)
PDF
Best practices for cloud migration (June 2016)
Citrix - Open Elastic Platform for the Private Cloud
How to migrate workloads to the google cloud platform
Ask The Architect: RightScale & AWS Dive Deep into Hybrid IT
Introduction to ibm cloud paks concept license and minimum config public
Enterprise Cloud Architecture Best Practices
Migrating Your Windows Datacenter to AWS
When networks meets apps (open stack atlanta)
Best practices for cloud migration (June 2016)
Ad

Viewers also liked (20)

PDF
Enterprise Cloud Operating Model Design
PDF
Reengineering The IT Operating Model to Embrace The Power Of The Cloud
PPTX
Cloud Operating Model Design
PDF
The marriage between Cloud and ITSM
PPTX
Operating Model
PPTX
Azure dev ops integrations with Jenkins
PDF
Integrated Cloud Framework: Security, Governance, Compliance, Content Applica...
PPTX
A new IT Operating Model Emerges
PDF
Maximizing EA Impact: Using Business Architecture to Achieve Alignment
PDF
Surviving in The New Normal of regulation within financial markets
PDF
ETech2008 DisasterTech Robbins Maron 20080305a
PDF
Cloud Operations Bootcamp: Culture - Jesse Robbins
PPTX
Cloud Security for U.S. Military Agencies
PDF
A Framework to Measure and Maximize Cloud ROI
PPTX
What about run? Considerations for Agile/DevOps: its not over once its live
PDF
CCSK, cloud security framework, Indonesia
PDF
Security & Governance for the Cloud: a Savvis Case Study (Presented at Cloud ...
PPTX
Cloud is not an option, but is security?
PDF
10 security concerns cloud computing
PPTX
Finding Evil In DNS Traffic
Enterprise Cloud Operating Model Design
Reengineering The IT Operating Model to Embrace The Power Of The Cloud
Cloud Operating Model Design
The marriage between Cloud and ITSM
Operating Model
Azure dev ops integrations with Jenkins
Integrated Cloud Framework: Security, Governance, Compliance, Content Applica...
A new IT Operating Model Emerges
Maximizing EA Impact: Using Business Architecture to Achieve Alignment
Surviving in The New Normal of regulation within financial markets
ETech2008 DisasterTech Robbins Maron 20080305a
Cloud Operations Bootcamp: Culture - Jesse Robbins
Cloud Security for U.S. Military Agencies
A Framework to Measure and Maximize Cloud ROI
What about run? Considerations for Agile/DevOps: its not over once its live
CCSK, cloud security framework, Indonesia
Security & Governance for the Cloud: a Savvis Case Study (Presented at Cloud ...
Cloud is not an option, but is security?
10 security concerns cloud computing
Finding Evil In DNS Traffic
Ad

Similar to Operational Best Practices in the Cloud (20)

PPTX
Using Nano Server for Hyper-V Training 0
PPTX
DR_PRESENT 1
PDF
IBM InterConnect 2015 - IIB in the Cloud
PDF
Enabling Business Agility with SUSE CaaS Platform
PDF
The Kubernetes WebLogic revival (part 1)
PPTX
Moving Windows Applications to the Cloud
PPTX
Cloudify 4.6 highlights webinar
PPTX
Getting Started with PaaS
PPTX
Cont0519
PPTX
20191201 kubernetes managed weblogic revival - part 1
PPT
Integration in the Cloud
PPTX
Patterns
PPTX
Kubernetes solutions
PPTX
Getting Started with Platform-as-a-Service
DOC
Adhila_CV_DevOps_Linux_Profile
PPTX
CloudStackFinalProject
PPTX
Platform as a Service with Kubernetes and Mesos
PPTX
Integration in the age of DevOps
PPTX
Cloudexpowest opensourcecloudcomputing-1by arun kumar
PPTX
Cloudexpowest opensourcecloudcomputing-1by arun kumar
Using Nano Server for Hyper-V Training 0
DR_PRESENT 1
IBM InterConnect 2015 - IIB in the Cloud
Enabling Business Agility with SUSE CaaS Platform
The Kubernetes WebLogic revival (part 1)
Moving Windows Applications to the Cloud
Cloudify 4.6 highlights webinar
Getting Started with PaaS
Cont0519
20191201 kubernetes managed weblogic revival - part 1
Integration in the Cloud
Patterns
Kubernetes solutions
Getting Started with Platform-as-a-Service
Adhila_CV_DevOps_Linux_Profile
CloudStackFinalProject
Platform as a Service with Kubernetes and Mesos
Integration in the age of DevOps
Cloudexpowest opensourcecloudcomputing-1by arun kumar
Cloudexpowest opensourcecloudcomputing-1by arun kumar

More from RightScale (20)

PDF
10 Must-Have Automated Cloud Policies for IT Governance
PDF
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
PDF
Optimize Software, SaaS, and Cloud with Flexera and RightScale
PDF
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
PDF
How to Set Up a Cloud Cost Optimization Process for your Enterprise
PDF
Multi-Cloud Management with RightScale CMP (Demo)
PDF
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
PDF
How to Allocate and Report Cloud Costs with RightScale Optima
PDF
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
PDF
Using RightScale CMP with Cloud Provider Tools
PDF
Best Practices for Multi-Cloud Security and Compliance
PDF
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
PDF
The 5 Stages of Cloud Management for Enterprises
PDF
9 Ways to Reduce Cloud Storage Costs
PDF
Serverless Comparison: AWS vs Azure vs Google vs IBM
PDF
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
PDF
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
PDF
2018 Cloud Trends: RightScale State of the Cloud Report
PDF
Got a Multi-Cloud Strategy? How RightScale CMP Helps
PDF
How to Manage Cloud Costs with RightScale Optima
10 Must-Have Automated Cloud Policies for IT Governance
Kubernetes and Terraform in the Cloud: How RightScale Does DevOps
Optimize Software, SaaS, and Cloud with Flexera and RightScale
Prepare Your Enterprise Cloud Strategy for 2019: 7 Things to Think About Now
How to Set Up a Cloud Cost Optimization Process for your Enterprise
Multi-Cloud Management with RightScale CMP (Demo)
Comparing Cloud VM Types and Prices: AWS vs Azure vs Google vs IBM
How to Allocate and Report Cloud Costs with RightScale Optima
Should You Move Between AWS, Azure, or Google Clouds? Considerations, Pros an...
Using RightScale CMP with Cloud Provider Tools
Best Practices for Multi-Cloud Security and Compliance
Automating Multi-Cloud Policies for AWS, Azure, Google, and More
The 5 Stages of Cloud Management for Enterprises
9 Ways to Reduce Cloud Storage Costs
Serverless Comparison: AWS vs Azure vs Google vs IBM
Best Practices for Cloud Managed Services Providers: The Path to CMP Success
Cloud Storage Comparison: AWS vs Azure vs Google vs IBM
2018 Cloud Trends: RightScale State of the Cloud Report
Got a Multi-Cloud Strategy? How RightScale CMP Helps
How to Manage Cloud Costs with RightScale Optima

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
Big Data Technologies - Introduction.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Advanced IT Governance
PDF
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Advanced Soft Computing BINUS July 2025.pdf
Empathic Computing: Creating Shared Understanding
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Big Data Technologies - Introduction.pptx
Unlocking AI with Model Context Protocol (MCP)
Reach Out and Touch Someone: Haptics and Empathic Computing
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Advanced IT Governance
[발표본] 너의 과제는 클라우드에 있어_KTDS_김동현_20250524.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf

Operational Best Practices in the Cloud

  • 1. Operational Best Practices in the Cloud October 27, 2011 Watch the video of this webinar
  • 2. 2# Your Panel Today Presenting • Rafael H. Saavedra, VP Engineering, RightScale • Josep Blanquer, Sr. Systems Architect, RightScale Q&A • David Manriquez, Account Manager, RightScale Please use the “Questions” window to ask questions any time! Cloud Management Platform
  • 3. 3# Agenda • RightScale architecture • The release cycle • Monitoring, alerts and escalations • When servers fail • Our best practices Today’s material will discuss how we run RightScale in the cloud. From this, we distill best practices that are relevant for all. Please use the “Questions” window to ask questions any time! Cloud Management Platform
  • 4. Operational Best Practices in the Cloud RightScale architecture
  • 5. 5# The scale of RightScale • > 3M servers launched by RightScale • RightScale continuously monitors > 100K servers • Every day at RightScale: • 2,000 array resize actions are executed • 35,000 alert escalations are triggered • 20,000 escalation emails are sent to users • 9.0TB of monitoring data is exchanged with our servers • 1.6TB of logging data is sent to our servers Cloud Management Platform
  • 6. 6# Architecture of a cloud-based SaaS app • RightScale is a SaaS application that runs completely in the cloud • Databases • Core web app and API • Services such as monitoring, logging, and MultiCloud Marketplace Cloud Management Platform
  • 7. 7# A quick primer on ServerTemplates Configuring servers through bundling Images: Configuring servers with ServerTemplates: Custom MySQL 5.0.24 (CentOS 5.2) Custom MySQL 5.0.24 (CentOS 5.4) MySQL 5.0.36 (CentOS 5.4) Setup DNS and IPs MySQL 5.0.36 (Ubuntu 8.10) boot sequence MySQL 5.0.36 (Ubuntu 8.10) 64bit A set Restore last backup of configuration directives that will install and Frontend Apache 1.3 (Ubuntu 8.10) configure Configure MySQL of software on top Frontend Apache 2.0 (Ubuntu 9.10) - patched the base image CMS v1.0 (CentOS 5.4) Install MySQL Server CMS v1.1 (CentOS 5.4) Install monitoring My ASP appserver (windows 2008) My ASP.net (windows 2008) – security update 1 Base Image My ASP.net (windows 2008) – security update 8 MultiCloudImage Very few and basic SharePoint v4 (windows 2003) – 32bit SharePoint v4 (windows 2003) –64bit SharePoint v4.5 (windows 2003) –64bit CentOS 5.2 Ubuntu 8.10 Win 2003 CentOS 5.4 Ubuntu 9.10 Win 2007 … Cloud Management Platform
  • 8. 8# We use the same ServerTemplates our customers do • RightScale uses 15-20 different ServerTemplates in Production • We don’t build images, we use pre-built MultiCloud Images with RightLink • We make heavy use of RightScale provided tool boxes (EBS, DNS, LB) • Off-the shelf: 1 template (MySQL) • Customized: App servers and load balancers • Written with RightScripts in Ruby, Bash, etc. • Mostly Rail apps to run our core services: front-end, API, Marketplace, etc. • From MultiCloud Image: Messaging and databases • RabbitMQ, Cassandra Cloud Management Platform
  • 9. 9# Deployments group RightScale services Cloud Management Platform
  • 10. 10# Best practices: Architecture • ServerTemplates can be used off the shelf or customized • Don’t bundle images • Make heavy use of MCI’s instead of hardcoding base RightImages • Deployments let you stage servers in the cloud • The use of inputs guarantee consistency across all servers • Easily test or failover • Macros/API automation can quickly stand up entire deployments Cloud Management Platform
  • 11. Operational Best Practices in the Cloud The release cycle
  • 12. 12# Challenges of the release cycle • Limited resources and lead time for procuring and provisioning equipment • Maintaining multiple environments from development through production • Maintaining consistency for reusability and QA • Distributed teams and team members Cloud Management Platform
  • 13. 13# A typical release cycle flow Cloud Management Platform
  • 14. 14# Our development environment • We keep a number of different deployments • Each development team has its own mini-environment • A larger integrated staging environment • One production environment • Accounts keep things organized and secure • We keep a separate accounts for staging and production • One team of sys admins manage all environments Cloud Management Platform
  • 15. 15# RightScale release cycle • One set of scripts and ServerTemplates are used everywhere • Gate accounts for security, development vs. production, etc. • Less test variance between Production and Staging • Only difference is size of environment • Easy to bring up development environment on demand using deployments and macros • Get it up and running, on demand in less than an hour • Cloud is pay-by-the-hour, so it is cheap to run temporary environments Cloud Management Platform
  • 16. 16# Best practices: Release cycle • Don’t be afraid to run many environments • Dynamically clone, launch and teardown environments for quick tests • Configure a fixed set of environment for development, integration, staging • Use different accounts to segregate users and configurations. • Sys admins are expensive. Cloud servers are cheap. • Reuse ServerTemplates to keep environments consistent • Make use of the versioning and freeze software repositories • Share or Publish them through the MultiCloud Marketplace • Create all-in-one ServerTemplates from the same RightScripts and recipes • Avoid upgrading existing servers, fail forward instead • Keep old servers running so you can rollback, or do post-mortem later on • For databases: Launch additional slaves. Freeze replication at upgrade point. Take snapshots! Cloud Management Platform
  • 17. 17# Release night steps 2) Servers with new code 7) Take snapshot at cutoff Main App 9) Reconnect 10) Open access all servers to site 8) Update schema Databases Front Ends DB Master DB Slave Main App DB Slave 3) Add second slave 4) Cut access 6) Stop replication to site 5) Stop all access to databases 1) Servers with current code Cloud Management Platform
  • 18. Operational Best Practices in the Cloud Monitoring, alerts and escalations
  • 19. 19# Monitoring and alerts: Diagnose & optimize • Off-the-shelf monitoring • OS: CPU, Disk, Memory, Network, Processes, System • App: Apache, IIS, MySQL, Nginx, SQL Server • Plus many more CollectD plug-ins! • Custom monitoring • Cluster monitoring • Alerts & escalations Cloud Management Platform
  • 20. 20# Monitoring, alerts & escalations • We monitor as much relevant data as possible and display it in insightful ways to quickly detect patterns and abnormalities • We proactively eliminate the conditions that raise critical alerts • No broken windows policy. No critical alerts can remain unresolved. API Network Activity Dashboard Network Activity Cloud Management Platform
  • 21. 21# Off-the-shelf: MySQL Collectd Plugin Cloud Management Platform
  • 22. 22# Off-the-shelf: MySQL reads graphs • Read-random-next represents a table scan • Read-next represents an index scan Cloud Management Platform
  • 23. 23# Custom: Whatever you want with collectd • Any statistic you can think of can easily be added as a monitor. • All of these are graph-able and alert-able in our dashboard! • Many can be written in less than an hour. • As easy as printing a line of formatted numbers every few seconds • support.rightscale.com is an authority on collectd • How we do it: • We use Ruby to write our custom monitors • Cassandra: jcollectd with JMX to pull out monitoring data from JavaBeans • Passenger: Ruby script that parses data from Passenger command line interface Cloud Management Platform
  • 24. 24# Custom: Cassandra monitors Cloud Management Platform
  • 25. 25# Cluster: Monitor hundreds of servers • We leverage a monitoring data warehouse to develop heat maps & stacked graphs Cloud Management Platform
  • 26. 26# Automated actions using alerts from monitors • Create an alert for any monitor, even your custom ones • RightScale example: Cassandra pending reads signals overloading • Break alerts into critical and warning • Critical: Wake me up! Page me! • Warning: Send email to team. • Trigger many actions: email, run script, scale, relaunch, reboot,… • Customize to your monitor, situation, and IT processes • RightScale example: Run a RightScript if swap is too high • Integrate with 3rd party services like PagerDuty Cloud Management Platform
  • 27. 27# Best practices: Monitoring and alerts • Monitor your critical processes off-the-shelf • Set monitors with scripts on your ServerTemplates • Use mon_process (e.g. Ruby) • Customize to your application needs • Use collectd plug-ins or easily build your own • The monitor is graphed in the RightScale dashboard • Plan out your critical alerts • Set your response plan: warnings vs. critical Cloud Management Platform
  • 28. Operational Best Practices in the Cloud When servers fail
  • 29. 29# How to think about server failure in the cloud • Design for failure • Make sure your application remains healthy after the failure of a node • Don’t use sticky sessions • Distribute your application services • Debug ServerTemplates and not servers • Use alerts to reboot and/or relaunch • Auto-scale app server arrays • Use dynamic DNS and static IPs for load balancers • Your app servers and databases will always know where to look Cloud Management Platform
  • 30. 30# Deep dive on database failure • Use database backups for rollbacks or disaster scenarios • Restore from backups in event of complete system failure • One-click with fully automated RightScale Database Managers • Use database redundancy for high availability (example master/slave) • Promote slave if master fails • Possible to prime your slave database to make failover more seamless • After promotion is complete, quick to launch a new slave • Worry about troubleshooting when you have time • One-click with fully automated RightScale Database Managers Cloud Management Platform
  • 31. 31# Backups to block volumes and object stores • Block volumes: EBS snapshots • Object stores: S3/Cloud Files • + Easy to snapshot • + Backup into other clouds • + Easy to rotate • + Backup individual folders or files • + Easy consistency • + Incremental backups (e.g. as • + Instant restore (mount) files/data are flushed) • - Difficult to move between • - More coding, customization clouds/regions • - Custom rotation strategy • - Must backup entire volume • - Download time • What we do: • What we do: • EBS: Databases • S3: Monitoring system (Cassandra in the future) Cloud Management Platform
  • 32. 32# Best practices: Planning for failure • No excuse for not backing up your servers • RightScale Database Manager + EBS tools make it easy to take backups • Plan your rotation policy • Database Manager helps you tailor daily, weekly, and monthly backups • Backup across clouds and regions • Database Manager for MySQL and SQL Server make it easy to backup to S3 or CloudFiles from AWS, CloudStack, Eucalyptus, and Rackspace • Organize your backups • Keep track with lineages and timelines using the Database Managers • Test your backups! • It is easy and cheap on the cloud • A crisis is the worst time to find out your backups are corrupted Cloud Management Platform
  • 33. Operational Best Practices in the Cloud Our best practices
  • 34. 34# Best practices for operating in the cloud • Keep your environment organized and consistent • Accounts, deployments, ServerTemplates, and macros • Change and debug configurations not servers • ServerTemplates, MultiCloudImages, fail-forward • Monitor your servers efficiently • Off-the-shelf and custom monitoring and alerts • Automate, automate and also automate • Server arrays, macros/API for more complex flows, alert actions … • Backup your databases (organize, multi-cloud, rotate, test) • Database Manager ServerTemplates Cloud Management Platform
  • 35. 35# Getting Started and Q&A Contact RightScale RightScale Conference (866) 720-0208 Nov 9 in Santa Clara, CA sales@rightscale.com www.RightScale.com/Conference •Attend technical breakout sessions www.rightscale.com •Talk with RightScale customers •Ask questions at the Expert Bar •Training on 11/8 and 11/10 More Info Webinar archive: RightScale.com/webinars White Papers: RightScale.com/whitepapers Free Edition: RightScale.com/Free Cloud Management Platform

Editor's Notes

  • #6: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #7: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #9: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #10: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #11: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #13: More specifically, we hear the following challenges: (Again, use this to unearth where they are having challenges.) Limited resources – In almost every phase, limited hardware poses problems. In architecting new systems there are rarely enough resources to experiment with alternative architectures or new technologies. For developers, limited resources usually means sharing hardware for testing. Testers rarely have enough hardware or time to do all the testing they would like to do - full performance and load testing, testing on complete production architectures, or testing disaster recovery scenarios. And, delays in development often puts pressure on testers to do their work faster to still reach the same deadline. The inability to spin-up additional testing resources at these times causes quality to suffer. The result is that errors are found later in the cycle where they are more expensive to fix. Limited equipment also means staff are constantly provisioning, tearing down, and re-provisioning the same equipment. It takes time, and if environments are not completely wiped clean, additional errors are potentially introduced. Time to procure and provision equipment - As the load on IT departments increases and the release cycles shorten, the wait for equipment to be procured and provisioned takes time away from valuable work. One customer stated it took 3-5 weeks to procure and provision new hardware. Maintaining consistent environments – As code moves through development, test, staging and production, changes to configurations in one stage rarely make it back into earlier stages. As new code is implemented from environments that haven’t been updated, the same errors are re-introduced. Maintaining multiple environments – As if maintaining one consistent environment across many servers isn’t hard enough, most software requires testing on several different types of configurations – different versions of stacks, for different end user environments – one for each possible production scenario. For example, a software company may need to test their software on different operating systems or alongside various software packages. Most companies need to clone production environments to debug problems without impacting the current users.Whether it happens in development or QA - maintaining & reproducing environments is a time consuming task. If the task is distributed across multiple administrators, the coordination of changes made becomes challenging. If the task is consolidated under one administrator, there is a limit to the number of different environments s/he can reliably maintain.Distributed teams or team members – add collaboration requirements and exacerbate all of the issues mentioned.
  • #14: With RightScale it’s easy to create consistent, reproducible configurations in each stage. In a typical development lifecycle, the systems architect creates a reference architecture that serves as a model for production, and then that architecture specifies what components are needed in each configuration.
  • #15: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #16: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #17: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #18: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #20: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #21: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #24: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #26: RightScale'sServerTemplates allow you to capture best practices for provisioning and automating cloud infrastructure.  In this breakout session, we will explore how you can leverage the RightScale platform to share ServerTemplates with others.  Specifically, we'll walk through the steps to share and update ServerTemplates across your organization.  We'll also show you how to publish ServerTemplates publicly for the whole world to use.  This topic is best for: IT members who are responsible for maintaining server configurations within the organization, developers who would like to share work product within their group or ISVs wishing to reach cloud users by publishing through RightScale.
  • #27: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #28: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #30: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #31: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #32: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #33: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers
  • #35: The cluster monitoring is very powerful in that it provides different types of views into the operation of large clusters of servers