SlideShare a Scribd company logo
Servers fail, who cares? (Answer: I do, sort of)
Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12


                                                                  1
June 29,
  2012     2
3
4
[1]
      5
From the Netflix tech blog:



“Cassandra, our distributed cloud persistence store which
is distributed across all zones and regions, dealt with the
loss of one third of its regional nodes without any loss of
data or availability.[2]”




                                                              6
Topics
 Cassandra at Netflix
 Constructing clusters in AWS with Priam
 Resiliency
 Observations on AWS, Cassandra and AWS/Cassandra
 Monitoring and maintenances
 References



                                                     7
Cassandra by the numbers

41         Number of production clusters
13         Number of multi-region clusters
4          Max regions, one cluster
90         Total TB of data across all clusters
621        Number of Cassandra nodes
72/34      Largest Cassandra cluster (nodes/data in TB)
80k/250k   Max read/writes per second on a single cluster
3*         Size of Operations team

           * We are hiring DevOps and Developers.   Stop by our booth!
                                                                         8
Netflix Deployed on AWS
Content       Logs           Play          WWW        API              CS

   Content       S3                                                   International
                                DRM         Sign-Up    Metadata
 Management   Terabytes                                                CS lookup



    EC2                                                 Device        Diagnostics &
                 EMR         CDN routing    Search
  Encoding                                            Configuration      Actions



     S3                                     Movie      TV Movie       Customer Call
              Hive & Pig     Bookmarks
  Petabytes                                Choosing    Choosing           Log



               Business                                 Social
                              Logging       Ratings                   CS Analytics
              Intelligence                             Facebook

   CDNs
   ISPs
  Terabits
 Customers
Constructing clusters in AWS with Priam
 Tomcat webapp for Cassandra administration
 Token management
 Full and incremental backups
 JMX metrics collection
 cassandra.yaml configuration
 REST API for most nodetool commands
 AWS Security Groups for multi-region clusters
 Open sourced, available on github [3]
                                                  10
Autoscaling Groups
                 Region                                                              ASGs do not map directly to
                                                                                     nodetool ring output, but are
                                                                                     used to define the cluster (#
                                                                                     of instances, AZs, etc).
Address          DC        Rack       Status State      Load        Owns     Token
                                                                             …
###.##.##.###    eu-west   1a         Up       Normal   108.97 GB   16.67%   …
###.##.#.##      us-east   1e         Up       Normal   103.72 GB   0.00%    …       Amazon machine image
##.###.###.###   eu-west   1b         Up       Normal   104.82 GB   16.67%   …
##.##.##.###     us-east   1c         Up       Normal   111.87 GB   0.00%    …       Image loaded on to an AWS
##.###.##.###    eu-west   1c         Up       Normal   95.51 GB    16.67%   …       instance; all packages needed
##.##.##.##      us-east   1d         Up       Normal   105.85 GB   0.00%    …       to run an application.
##.###.##.###    eu-west   1a         Up       Normal   91.25 GB    16.67%   …
###.##.##.###    us-east   1e         Up       Normal   102.71 GB   0.00%    …
##.###.###.###   eu-west   1b         Up       Normal   101.87 GB   16.67%   …
##.##.###.##     us-east   1c         Up       Normal   102.83 GB   0.00%    …       Security Group
###.##.###.##    eu-west   1c         Up       Normal   96.66 GB    16.67%   …
##.##.##.###     us-east   1d         Up       Normal   99.68 GB    0.00%    …       Defines access control
                                                                                     between ASGs

Instance                   Availability Zone
                                 (AZ)
                                                                                 AWS Terminology
                                                                           A     Constructing a cluster in AWS
                                                                                                                     11
APP is not an AWS
                                                                                                            entity, but one that we
                                                  App = cass_cluster                                        use internally to denote
                                                                                                            a service. This is part
                                                                                                            of asgard [4], our open-
                                ASG # 1                    ASG # 2                       ASG # 3            sourced cloud
Multi-region clusters                                                                                       application web
have the same              Availabilty Zone = A    Availability Zone = B           Availability Zone = C    interface
configuration in each
region. Just repeat what    Region = us-east           Region = us-east             Region = us-east
you see here!
                           Instance count = 6       Instance count = 6              Instance count = 6

                             Instance type =           Instance type =               Instance type =
                               m2.4xlarge                m2.4xlarge                    m2.4xlarge




                                                                External full backups
                                                                to an alternate region
                                                                saved for 30 days.
Full and incremental
Backups to local-region      S3                   S3
S3 via Priam
                                                                                            Cassandra Configuration
                                                                                    B       Constructing a cluster in AWS
                                                                                                                                   12
AMI contains os, base netflix packages                      Priam runs on each node and
                           and Cassandra and Priam                                     will:

                                                                                       * Assign tokens to each
                                                                                         node, alternating (1) the
                                                                    Cassandra
(1) Alternate                   C        A         B                  Priam
                                                                                         AZs around the ring (2).
availability zones                                                                     * Perform nightly snapshot
                                                                     Tomcat
(a, b, c) around the                                                                     backup to S3
ring to ensure data        B                              C
is written to                                                                          * Perform incremental
multiple data                                                                            SSTable backups to S3
centers.
                       A                                      A                        * Bootstrap replacement
(2) Survive the                                                                          nodes to use vacated
loss of a data                                                                           tokens
center by ensuring         C                              B           S3               * Collect JMX metrics for our
that we only lose                                                                        monitoring systems
one node from
each replication                B                  c                                   * REST API calls to most
set.                                     A                                               nodetool functions



                                                                                Putting it all together
                                                                     C          Constructing a cluster in AWS
                                                                                                                     13
Resiliency - Instance
• RF=AZ=3
• Cassandra bootstrapping works really well
• Replace nodes immediately
• Repair often




                                              15
Resiliency – One availability zone
 RF=AZ=3
 Alternating AZs ensures that each AZ has a full replica of
  data
 Provision cluster to run at 2/3 capacity
 Ride out a zone outage; do not move to another zone
 Bootstrap one node at a time
 Repair after recovery


                                                               16
What happened on June 29th?
 During outage
   All Cassandra instances in us-east-1a were inaccessible
   nodetool ring showed all nodes as DOWN
   Monitoring other AZs to ensure availability
 Recovery – power restored to us-east-1a
   Majority of instances rejoined the cluster without issue
   Majority of remainder required a reboot to fix
   Remainder of nodes needed to be replaced, one at a time


                                                               17
Resiliency – Multiple availability zones
 Outage; can no longer satisfy quorum
 Restore from backup and repair




                                           18
Resiliency - Region
 Connectivity loss between regions – operate as island
  clusters until service restored
 Repair data between regions
 If an entire region disappears, watch DVDs instead




                                                          19
Observations: AWS
 Ephemeral drive performance is better than EBS
 S3-backed AMIs help us weather EBS outages
 Instances seldom die on their own
 Use as many availability zones as you can afford
 Understand how AWS launches instances
 I/O is constrained in most AWS instance types
   Repairs are very I/O intensive
   Large size-tiered compactions can impact latency
 SSDs[5] are game changers [6]
                                                       20
Observations: Cassandra
 A slow node is worse than a down node
 Cold cache increases load and kills latency
 Use whatever dials you can find in an emergency
   Remove node from coordinator list
   Compaction throttling
   Min/max compaction thresholds
   Enable/disable gossip
 Leveled compaction performance is very promising
 1.1.x and 1.2.x should address some big issues
                                                     21
Monitoring
 Actionable
   Hardware and network issues
   Cluster consistency
 Cumulative trends
 Informational
   Schema changes
   Log file errors/exceptions
   Recent restarts


                                  22
Dashboards - identify anomalies




                                  23
Maintenances
 Repair clusters regularly
 Run off-line major compactions to avoid latency
    SSDs will make this unnecessary
 Always replace nodes when they fail
 Periodically replace all nodes in the cluster
 Upgrade to new versions
    Binary (rpm) for major upgrades or emergencies
    Rolling AMI push over time

                                                      24
References
1. A bad night: Netflix and Instagram go down amid
   Amazon Web Services outage (theverge.com)
2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)
3. github / Netflix / priam (github.com)
4. github / Netflix / asgard (github.com)
5. Announcing High I/O Instances for Amazon (aws.amazon.com)
6. Benchmarking High Performance I/O with SSD for
   Cassandra on AWS (techblog.netflix.com)
                                                                   25

More Related Content

PDF
Netflix Moving To Cloud
PPTX
Continuous Deployment with Amazon Web Services by Carlos Conde
PPT
Ram chinta hug-20120922-v1
PDF
S3 and EC2 Rails Scenarios
PPTX
Open stack in sina
PDF
Netflix on Cloud - combined slides for Dev and Ops
PDF
Netflix in the cloud 2011
PPTX
NetflixOSS Meetup
Netflix Moving To Cloud
Continuous Deployment with Amazon Web Services by Carlos Conde
Ram chinta hug-20120922-v1
S3 and EC2 Rails Scenarios
Open stack in sina
Netflix on Cloud - combined slides for Dev and Ops
Netflix in the cloud 2011
NetflixOSS Meetup

What's hot (13)

PDF
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
ODP
Supporting and Using EC2/CIMI on top of Cloud Environments via Deltacloud
PDF
An intro to Amazon Web Services (AWS)
PDF
Masterworks talk on Big Data and the implications of petascale science
PPTX
#lspe: Dynamic Scaling
PDF
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
PPTX
How to Make Hadoop Easy, Dependable and Fast
PDF
SV Forum Platform Architecture SIG - Netflix Open Source Platform
PPTX
Amazon ec2
PPT
Island: Local Storage Volume for Cinder
PPTX
cloud conference 2013 - Infrastructure as a Service in Amazon Web Services
PPTX
Meetup Niort Data - AWS Intelligence Artificielle
PPTX
CloudStack-Development-Story
Global Netflix - HPTS Workshop - Scaling Cassandra benchmark to over 1M write...
Supporting and Using EC2/CIMI on top of Cloud Environments via Deltacloud
An intro to Amazon Web Services (AWS)
Masterworks talk on Big Data and the implications of petascale science
#lspe: Dynamic Scaling
컴퓨팅 서비스 업데이트 - EC2, ECS, Lambda (김상필) :: re:Invent re:Cap Webinar 2015
How to Make Hadoop Easy, Dependable and Fast
SV Forum Platform Architecture SIG - Netflix Open Source Platform
Amazon ec2
Island: Local Storage Volume for Cinder
cloud conference 2013 - Infrastructure as a Service in Amazon Web Services
Meetup Niort Data - AWS Intelligence Artificielle
CloudStack-Development-Story
Ad

Viewers also liked (18)

PDF
461361 1013243 chapter_2_dec__11
PPTX
თეას გეგმა
DOCX
Oracle and its related technologies
PDF
Blue Ocean: case study
PPTX
ინგლისის კულტურა
PPTX
Brendaly guerra
DOCX
Oracle and its related technologies
PDF
Jason Hanin: case study
PDF
PDF
Talking effectively about code
PPTX
Optimizing Cassandra in AWS
PDF
Boost3 / Investor Pitchbook
PPTX
ინგლისის კულტურა
PPTX
Cuffless blood pressure monitoring project
PPTX
ინგლისის კულტურა
PPTX
Polygraph machine
PPT
Building self confidence
PPTX
Cassandra Operations at Netflix
461361 1013243 chapter_2_dec__11
თეას გეგმა
Oracle and its related technologies
Blue Ocean: case study
ინგლისის კულტურა
Brendaly guerra
Oracle and its related technologies
Jason Hanin: case study
Talking effectively about code
Optimizing Cassandra in AWS
Boost3 / Investor Pitchbook
ინგლისის კულტურა
Cuffless blood pressure monitoring project
ინგლისის კულტურა
Polygraph machine
Building self confidence
Cassandra Operations at Netflix
Ad

Similar to Servers fail, who cares? (20)

PPTX
Netflix and Open Source
PPTX
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
PPTX
Architectures for High Availability - QConSF
PPTX
Getting ready for the cloud iaa s
PPTX
Web Scale Applications using NeflixOSS Cloud Platform
PPTX
Apache CloudStack Architecture by Alex Huang
PDF
The Netflix Open Source Platform
PDF
Games + Amazon = Love - Presentation quo vadis 2011
PDF
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
PDF
A scalable server environment for your applications
KEY
Ga cloud scaling 3 30-2012
PDF
Intro to Scaling your Web App on the Cloud with AWS (for frontend developers)...
PPTX
Cassandra Performance and Scalability on AWS
PPTX
Windows Azure Platform
PDF
Google Compute and MapR
PDF
Netflix Global Cloud Architecture
PDF
CloudStack Best Practice in PPTV
PPTX
PDF
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
PPTX
Amazon web services in the cloud computing landscape
Netflix and Open Source
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Architectures for High Availability - QConSF
Getting ready for the cloud iaa s
Web Scale Applications using NeflixOSS Cloud Platform
Apache CloudStack Architecture by Alex Huang
The Netflix Open Source Platform
Games + Amazon = Love - Presentation quo vadis 2011
Scalable Architecture on Amazon AWS Cloud - Indicthreads cloud computing conf...
A scalable server environment for your applications
Ga cloud scaling 3 30-2012
Intro to Scaling your Web App on the Cloud with AWS (for frontend developers)...
Cassandra Performance and Scalability on AWS
Windows Azure Platform
Google Compute and MapR
Netflix Global Cloud Architecture
CloudStack Best Practice in PPTV
Xebia Knowledge Exchange (jan 2011) - Trends in Enterprise Applications Archi...
Amazon web services in the cloud computing landscape

Recently uploaded (20)

PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
Encapsulation theory and applications.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The Rise and Fall of 3GPP – Time for a Sabbatical?
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Monthly Chronicles - July 2025
Reach Out and Touch Someone: Haptics and Empathic Computing
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Servers fail, who cares?

  • 1. Servers fail, who cares? (Answer: I do, sort of) Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12 1
  • 2. June 29, 2012 2
  • 3. 3
  • 4. 4
  • 5. [1] 5
  • 6. From the Netflix tech blog: “Cassandra, our distributed cloud persistence store which is distributed across all zones and regions, dealt with the loss of one third of its regional nodes without any loss of data or availability.[2]” 6
  • 7. Topics  Cassandra at Netflix  Constructing clusters in AWS with Priam  Resiliency  Observations on AWS, Cassandra and AWS/Cassandra  Monitoring and maintenances  References 7
  • 8. Cassandra by the numbers 41 Number of production clusters 13 Number of multi-region clusters 4 Max regions, one cluster 90 Total TB of data across all clusters 621 Number of Cassandra nodes 72/34 Largest Cassandra cluster (nodes/data in TB) 80k/250k Max read/writes per second on a single cluster 3* Size of Operations team * We are hiring DevOps and Developers. Stop by our booth! 8
  • 9. Netflix Deployed on AWS Content Logs Play WWW API CS Content S3 International DRM Sign-Up Metadata Management Terabytes CS lookup EC2 Device Diagnostics & EMR CDN routing Search Encoding Configuration Actions S3 Movie TV Movie Customer Call Hive & Pig Bookmarks Petabytes Choosing Choosing Log Business Social Logging Ratings CS Analytics Intelligence Facebook CDNs ISPs Terabits Customers
  • 10. Constructing clusters in AWS with Priam  Tomcat webapp for Cassandra administration  Token management  Full and incremental backups  JMX metrics collection  cassandra.yaml configuration  REST API for most nodetool commands  AWS Security Groups for multi-region clusters  Open sourced, available on github [3] 10
  • 11. Autoscaling Groups Region ASGs do not map directly to nodetool ring output, but are used to define the cluster (# of instances, AZs, etc). Address DC Rack Status State Load Owns Token … ###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% … ###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image ##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% … ##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS ##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed ##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application. ##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% … ###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% … ##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% … ##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group ###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% … ##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control between ASGs Instance Availability Zone (AZ) AWS Terminology A Constructing a cluster in AWS 11
  • 12. APP is not an AWS entity, but one that we App = cass_cluster use internally to denote a service. This is part of asgard [4], our open- ASG # 1 ASG # 2 ASG # 3 sourced cloud Multi-region clusters application web have the same Availabilty Zone = A Availability Zone = B Availability Zone = C interface configuration in each region. Just repeat what Region = us-east Region = us-east Region = us-east you see here! Instance count = 6 Instance count = 6 Instance count = 6 Instance type = Instance type = Instance type = m2.4xlarge m2.4xlarge m2.4xlarge External full backups to an alternate region saved for 30 days. Full and incremental Backups to local-region S3 S3 S3 via Priam Cassandra Configuration B Constructing a cluster in AWS 12
  • 13. AMI contains os, base netflix packages Priam runs on each node and and Cassandra and Priam will: * Assign tokens to each node, alternating (1) the Cassandra (1) Alternate C A B Priam AZs around the ring (2). availability zones * Perform nightly snapshot Tomcat (a, b, c) around the backup to S3 ring to ensure data B C is written to * Perform incremental multiple data SSTable backups to S3 centers. A A * Bootstrap replacement (2) Survive the nodes to use vacated loss of a data tokens center by ensuring C B S3 * Collect JMX metrics for our that we only lose monitoring systems one node from each replication B c * REST API calls to most set. A nodetool functions Putting it all together C Constructing a cluster in AWS 13
  • 14. Resiliency - Instance • RF=AZ=3 • Cassandra bootstrapping works really well • Replace nodes immediately • Repair often 15
  • 15. Resiliency – One availability zone  RF=AZ=3  Alternating AZs ensures that each AZ has a full replica of data  Provision cluster to run at 2/3 capacity  Ride out a zone outage; do not move to another zone  Bootstrap one node at a time  Repair after recovery 16
  • 16. What happened on June 29th?  During outage  All Cassandra instances in us-east-1a were inaccessible  nodetool ring showed all nodes as DOWN  Monitoring other AZs to ensure availability  Recovery – power restored to us-east-1a  Majority of instances rejoined the cluster without issue  Majority of remainder required a reboot to fix  Remainder of nodes needed to be replaced, one at a time 17
  • 17. Resiliency – Multiple availability zones  Outage; can no longer satisfy quorum  Restore from backup and repair 18
  • 18. Resiliency - Region  Connectivity loss between regions – operate as island clusters until service restored  Repair data between regions  If an entire region disappears, watch DVDs instead 19
  • 19. Observations: AWS  Ephemeral drive performance is better than EBS  S3-backed AMIs help us weather EBS outages  Instances seldom die on their own  Use as many availability zones as you can afford  Understand how AWS launches instances  I/O is constrained in most AWS instance types  Repairs are very I/O intensive  Large size-tiered compactions can impact latency  SSDs[5] are game changers [6] 20
  • 20. Observations: Cassandra  A slow node is worse than a down node  Cold cache increases load and kills latency  Use whatever dials you can find in an emergency  Remove node from coordinator list  Compaction throttling  Min/max compaction thresholds  Enable/disable gossip  Leveled compaction performance is very promising  1.1.x and 1.2.x should address some big issues 21
  • 21. Monitoring  Actionable  Hardware and network issues  Cluster consistency  Cumulative trends  Informational  Schema changes  Log file errors/exceptions  Recent restarts 22
  • 22. Dashboards - identify anomalies 23
  • 23. Maintenances  Repair clusters regularly  Run off-line major compactions to avoid latency  SSDs will make this unnecessary  Always replace nodes when they fail  Periodically replace all nodes in the cluster  Upgrade to new versions  Binary (rpm) for major upgrades or emergencies  Rolling AMI push over time 24
  • 24. References 1. A bad night: Netflix and Instagram go down amid Amazon Web Services outage (theverge.com) 2. Lessons Netflix learned from AWS Storm (techblog.netflix.com) 3. github / Netflix / priam (github.com) 4. github / Netflix / asgard (github.com) 5. Announcing High I/O Instances for Amazon (aws.amazon.com) 6. Benchmarking High Performance I/O with SSD for Cassandra on AWS (techblog.netflix.com) 25

Editor's Notes

  • #2: Outline of presentationJun 29 outageContext - cassandra and aws - updated usage numbers - include architecture diagram with cassandra called outHow clusters are constructed – blueprint diagrams should include #1 – aws make-up – ASG and Azs #2 - instance particulars #3 - priam s3Resiliency - node, zone and region outagespriam – bootstrapping, monitoring, backup and restore, open sourceMonitoring - what we monitor - tools we use - epic/atlas and dashboards, and Maintenance tasks - jenkinsThings we monitor Issues we haveNote on SSDs
  • #9: Minimum cluster size = 6
  • #21: … Developer in house …Quickly find problems by looking into codeDocumentation/tools for troubleshooting are scarce… repairs …Affect entire replication set, cause very high latency in I/O constrained environment… multi-tenant …Hard to track changes being madeShared resources mean that one service can affect another oneIndividual usage only growsMoving services to a new cluster with the service live is non-trivial… smaller per-node data …Instance level operations (bootstrap, compact, etc) are faster
  • #24: Extension of Epic, using preconfigured dashboards for each clusterAdd additional metrics as we learn which to monitor