Servers fail, who cares?

Servers fail, who cares? (Answer: I do, sort of)
Gregg Ulrich, Netflix – @eatupmartha #netflixcloud #cassandra12

1

From the Netflix tech blog:

“Cassandra, our distributed cloud persistence store which
is distributed across all zones and regions, dealt with the
loss of one third of its regional nodes without any loss of
data or availability.[2]”

6

Topics
 Cassandra at Netflix
 Constructing clusters in AWS with Priam
 Resiliency
 Observations on AWS, Cassandra and AWS/Cassandra
 Monitoring and maintenances
 References

7

Cassandra by the numbers

41 Number of production clusters
13 Number of multi-region clusters
4 Max regions, one cluster
90 Total TB of data across all clusters
621 Number of Cassandra nodes
72/34 Largest Cassandra cluster (nodes/data in TB)
80k/250k Max read/writes per second on a single cluster
3* Size of Operations team

* We are hiring DevOps and Developers. Stop by our booth!
8

Netflix Deployed on AWS
Content Logs Play WWW API CS

Content S3 International
DRM Sign-Up Metadata
Management Terabytes CS lookup

EC2 Device Diagnostics &
EMR CDN routing Search
Encoding Configuration Actions

S3 Movie TV Movie Customer Call
Hive & Pig Bookmarks
Petabytes Choosing Choosing Log

Business Social
Logging Ratings CS Analytics
Intelligence Facebook

CDNs
ISPs
Terabits
Customers

Constructing clusters in AWS with Priam
 Tomcat webapp for Cassandra administration
 Token management
 Full and incremental backups
 JMX metrics collection
 cassandra.yaml configuration
 REST API for most nodetool commands
 AWS Security Groups for multi-region clusters
 Open sourced, available on github [3]
10

Autoscaling Groups
Region ASGs do not map directly to
nodetool ring output, but are
used to define the cluster (#
of instances, AZs, etc).
Address DC Rack Status State Load Owns Token
…
###.##.##.### eu-west 1a Up Normal 108.97 GB 16.67% …
###.##.#.## us-east 1e Up Normal 103.72 GB 0.00% … Amazon machine image
##.###.###.### eu-west 1b Up Normal 104.82 GB 16.67% …
##.##.##.### us-east 1c Up Normal 111.87 GB 0.00% … Image loaded on to an AWS
##.###.##.### eu-west 1c Up Normal 95.51 GB 16.67% … instance; all packages needed
##.##.##.## us-east 1d Up Normal 105.85 GB 0.00% … to run an application.
##.###.##.### eu-west 1a Up Normal 91.25 GB 16.67% …
###.##.##.### us-east 1e Up Normal 102.71 GB 0.00% …
##.###.###.### eu-west 1b Up Normal 101.87 GB 16.67% …
##.##.###.## us-east 1c Up Normal 102.83 GB 0.00% … Security Group
###.##.###.## eu-west 1c Up Normal 96.66 GB 16.67% …
##.##.##.### us-east 1d Up Normal 99.68 GB 0.00% … Defines access control
between ASGs

Instance Availability Zone
(AZ)
AWS Terminology
A Constructing a cluster in AWS
11

APP is not an AWS
entity, but one that we
App = cass_cluster use internally to denote
a service. This is part
of asgard [4], our open-
ASG # 1 ASG # 2 ASG # 3 sourced cloud
Multi-region clusters application web
have the same Availabilty Zone = A Availability Zone = B Availability Zone = C interface
configuration in each
region. Just repeat what Region = us-east Region = us-east Region = us-east
you see here!
Instance count = 6 Instance count = 6 Instance count = 6

Instance type = Instance type = Instance type =
m2.4xlarge m2.4xlarge m2.4xlarge

External full backups
to an alternate region
saved for 30 days.
Full and incremental
Backups to local-region S3 S3
S3 via Priam
Cassandra Configuration
B Constructing a cluster in AWS
12

AMI contains os, base netflix packages Priam runs on each node and
and Cassandra and Priam will:

* Assign tokens to each
node, alternating (1) the
Cassandra
(1) Alternate C A B Priam
AZs around the ring (2).
availability zones * Perform nightly snapshot
Tomcat
(a, b, c) around the backup to S3
ring to ensure data B C
is written to * Perform incremental
multiple data SSTable backups to S3
centers.
A A * Bootstrap replacement
(2) Survive the nodes to use vacated
loss of a data tokens
center by ensuring C B S3 * Collect JMX metrics for our
that we only lose monitoring systems
one node from
each replication B c * REST API calls to most
set. A nodetool functions

Putting it all together
C Constructing a cluster in AWS
13

Resiliency - Instance
• RF=AZ=3
• Cassandra bootstrapping works really well
• Replace nodes immediately
• Repair often

15

Resiliency – One availability zone
 RF=AZ=3
 Alternating AZs ensures that each AZ has a full replica of
data
 Provision cluster to run at 2/3 capacity
 Ride out a zone outage; do not move to another zone
 Bootstrap one node at a time
 Repair after recovery

16

What happened on June 29th?
 During outage
 All Cassandra instances in us-east-1a were inaccessible
 nodetool ring showed all nodes as DOWN
 Monitoring other AZs to ensure availability
 Recovery – power restored to us-east-1a
 Majority of instances rejoined the cluster without issue
 Majority of remainder required a reboot to fix
 Remainder of nodes needed to be replaced, one at a time

17

Resiliency – Multiple availability zones
 Outage; can no longer satisfy quorum
 Restore from backup and repair

18

Resiliency - Region
 Connectivity loss between regions – operate as island
clusters until service restored
 Repair data between regions
 If an entire region disappears, watch DVDs instead

19

Observations: AWS
 Ephemeral drive performance is better than EBS
 S3-backed AMIs help us weather EBS outages
 Instances seldom die on their own
 Use as many availability zones as you can afford
 Understand how AWS launches instances
 I/O is constrained in most AWS instance types
 Repairs are very I/O intensive
 Large size-tiered compactions can impact latency
 SSDs[5] are game changers [6]
20

Observations: Cassandra
 A slow node is worse than a down node
 Cold cache increases load and kills latency
 Use whatever dials you can find in an emergency
 Remove node from coordinator list
 Compaction throttling
 Min/max compaction thresholds
 Enable/disable gossip
 Leveled compaction performance is very promising
 1.1.x and 1.2.x should address some big issues
21

Monitoring
 Actionable
 Hardware and network issues
 Cluster consistency
 Cumulative trends
 Informational
 Schema changes
 Log file errors/exceptions
 Recent restarts

22

Dashboards - identify anomalies

23

Maintenances
 Repair clusters regularly
 Run off-line major compactions to avoid latency
 SSDs will make this unnecessary
 Always replace nodes when they fail
 Periodically replace all nodes in the cluster
 Upgrade to new versions
 Binary (rpm) for major upgrades or emergencies
 Rolling AMI push over time

24

References
1. A bad night: Netflix and Instagram go down amid
Amazon Web Services outage (theverge.com)
2. Lessons Netflix learned from AWS Storm (techblog.netflix.com)
3. github / Netflix / priam (github.com)
4. github / Netflix / asgard (github.com)
5. Announcing High I/O Instances for Amazon (aws.amazon.com)
6. Benchmarking High Performance I/O with SSD for
Cassandra on AWS (techblog.netflix.com)
25

Servers fail, who cares?

More Related Content

What's hot (13)

Viewers also liked (18)

Similar to Servers fail, who cares? (20)

Recently uploaded (20)

Servers fail, who cares?

Editor's Notes