SlideShare a Scribd company logo
Running Cassandra in AWS
Patrick Eaton, PhD
patrick@stackdriver.com
@PatrickREaton

Joey Imbasciano
joey@stackdriver.com
@_joeyi
Stackdriver at a Glance

Stackdriver's hosted intelligent monitoring service helps
SaaS companies innovate more by reducing the burden of
day-to-day operations
● Cloud-native and cloud-aware
● Designed for complex distributed applications
● Founded by cloud/infrastructure industry veterans
(Microsoft, VMware, EMC, Endeca, Red Hat) with deep
systems and DevOps expertise
● Team of ~25, based in Downtown Boston
Intelligent Monitoring
Discover customer’s cloud-hosted
applications
●
●
●
●

Infrastructure inventory
Logical units, like groups/clusters
Services, hosted and self-managed
Elastic resources

Monitor
●

●

Various data sources
● Provider metrics
● Host metrics
● Custom metrics
● Endpoints
● Events
● Health
Rich visualizations

Analyze
●
●
●
●
●

Integrate data sources
Aggregate metrics
Report utilization, cost, etc.
Detect policy violations
Recommend actions
Lambda Architecture
●
●
●
●

●

●

Typical of modern architectures for on-line
applications.
Formalized by Nathan Marz
Composed of "batch", "speed", and "serving" layers
Batch layer
○ Store of record
○ Compute arbitrary views
Speed layer
○ Low latency updates
○ Streaming algorithms
Serving layer
○ Combine data from batch and speed layers to
answer queries

Serving

Speed

Batch

Data
Stackdriver Architecture
●
●
●

●

●

Shares characteristics of lambda architecture
Indexing (speed) path
○ Make "live" data available "pre-analysis"
Analysis (batch) path
○ Compute aggregations
○ Create recommendations
Query (serving) layer
○ Combine "live" and analyzed
data to answer queries
○ May require on-the-fly analysis
Alerting (speed) path (not discussed here)
○ Stream processing to detect

Query
(Serving)
Notification
(Serving)

Database

Indexing
(Speed)

Analysis
(Batch)

policy-based anomalies
Data

Alerting
(Speed)
Database Options
● We chose Cassandra!
○ True P2P architecture
○ Good support for write-heavy workloads
○ Compatible data model for time series data
■ Column per metric type, timestamps as columns
● Why not MySQL?
○ Experience with operating large, sharded deployments
○ Relational data model not a good match
● Why not HBase?
○ Operational complexity - zk, hadoop, hdfs, ...
○ Special "Master" role
● Why not Dynamo?
○ Avoid vendor lock-in and high cost
Stackdriver Architecture ++
●

Archival pipeline stores all data
● Very small surface area, battle-tested
● Critical for disaster recovery
● S3 considered durable enough
● Replicated for availability

Query

Cassandra

Roll-ups
Analysis
Recs

Inventory
Data Series
Analyze

●
●
●

Archive means Cassandra is "soft state"
C* consolidates analysis and indexing results
Properties of data in C*
● Immutable data
● Append-only
● Read-1, write-1 consistency

S3

Archive

Index

●

Scales out easily
● Indexers, archivers, analyzers, query servers
Data
Cassandra at Stackdriver Cluster Configuration

●
●
●
●
●

●

Version: Datastax Community Edition 1.2.10
Replication Factor: 3
Vnodes
Murmur3Partitioner
Ec2Snitch
○ Aids in request efficiency
○ Enables Cassandra to ensure replicas are in
different Availability Zones
phi_convict_threshold: 8 -> 12
○ Used to determine when nodes are down
○ AWS network can be spotty
Cassandra Topology in AWS
Where we started...

Where we are...

1
us-east-1a
us-east-1a

3

2

us-east-1c

us-east-1b
us-east-1c

Keep it balanced!

us-east-1b
Cassandra EC2 Node Configuration
● m1.xlarge
○ 4 cores
○ 15 GB RAM
○ 4 ephemeral disks available

● 4 disks RAID-0 for Data Volume and CommitLog
○
○
○
○

ext4 - defaults,noatime
mdadm RAID-0
Compactions
Heavy Read/Write IO
Cassandra Automation and Operations
● Combination of Boto, Fabric, &

Puppet
○ Boto for AWS API
○ Fabric + Puppet for Bootstrapping
○ Fabric for Operations

● One command to:
○
○
○
○
○

Launch a new cluster
Upsize a cluster
Replace a dead node
Remove existing nodes
List nodes in a cluster
Our (Internal) Slogan
Cassandra Backups using S3
● No Cassandra Powered Backups
● Restore from S3
● Useful for major version upgrades
Data

S3

Bulk
Loader

Map
Reduce

1. Data is archived when it is received
2. Bulk loader reads from S3
3. M/R re-analyzes data
4. Cassandra is repopulated

Cassandra
Disaster Recover in the Wild
●
●
●
●
●
●
●
●

October 23, Stackdriver suffered a total loss of our C* cluster
● Exhausted memory due to number of open file descriptors (see graph)
We did not notice the problem until it was too late
● Nodes began crashing, resulted in inconsistent view of the ring
Attempted to restart the cluster unsuccessfully for ~2 hours
Provisioned new 36 node cluster in ~2 hours
Directed “live” data to new cluster
Started bulk restore operation from archive
● Full-fidelity data and aggregations
No data loss due to archival pipeline
See http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/
Cluster Restoration Process
S3

Map
Reduce

Bulk
Loader

Historical Data
New Cluster
UI
UI
UI

UI
UI
API

UI
UI
Gateway
New Data

Old Cluster
Thank you!
Yes, we are hiring!
Patrick Eaton - patrick@stackdriver.com - @PatrickREaton
Joey Imbasciano - joey@stackdriver.com - @_joeyi

More Related Content

PPTX
Running Cassandra on Amazon EC2
PPTX
Cassandra on Docker @ Walmart Labs
PDF
Mesosphere and Contentteam: A New Way to Run Cassandra
PDF
Critical Attributes for a High-Performance, Low-Latency Database
PDF
1 Million Writes per second on 60 nodes with Cassandra and EBS
PDF
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
PPTX
Cassandra Performance and Scalability on AWS
PDF
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Running Cassandra on Amazon EC2
Cassandra on Docker @ Walmart Labs
Mesosphere and Contentteam: A New Way to Run Cassandra
Critical Attributes for a High-Performance, Low-Latency Database
1 Million Writes per second on 60 nodes with Cassandra and EBS
Leveraging Docker and CoreOS to provide always available Cassandra at Instacl...
Cassandra Performance and Scalability on AWS
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform

What's hot (20)

PPTX
How to size up an Apache Cassandra cluster (Training)
PDF
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
PDF
Apache Cassandra in the Real World
PPTX
Load testing Cassandra applications
PDF
Micro-batching: High-performance writes
PPTX
Scylla Summit 2018: Consensus in Eventually Consistent Databases
PDF
ScyllaDB @ Apache BigData, may 2016
PPTX
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
PDF
Cassandra Introduction & Features
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PDF
Boyan Krosnov - Building a software-defined cloud - our experience
PDF
Back to the future with C++ and Seastar
PPTX
mParticle's Journey to Scylla from Cassandra
PDF
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
PPTX
Everyday I’m scaling... Cassandra
PPTX
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
PPTX
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
ODP
Intro to cassandra
PDF
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
PDF
How to Monitor and Size Workloads on AWS i3 instances
How to size up an Apache Cassandra cluster (Training)
How we got to 1 millisecond latency in 99% under repair, compaction, and flus...
Apache Cassandra in the Real World
Load testing Cassandra applications
Micro-batching: High-performance writes
Scylla Summit 2018: Consensus in Eventually Consistent Databases
ScyllaDB @ Apache BigData, may 2016
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
Cassandra Introduction & Features
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Boyan Krosnov - Building a software-defined cloud - our experience
Back to the future with C++ and Seastar
mParticle's Journey to Scylla from Cassandra
Cassandra Summit 2014: Active-Active Cassandra Behind the Scenes
Everyday I’m scaling... Cassandra
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Scylla Summit 2018: Keeping Your Latency SLAs No Matter What!
Intro to cassandra
Dynomite: A Highly Available, Distributed and Scalable Dynamo Layer--Ioannis ...
How to Monitor and Size Workloads on AWS i3 instances
Ad

Viewers also liked (17)

PDF
Cassandra Summit 2013 Keynote
PDF
Monitoring with Stackdriver
PPTX
Google Cloud Platform monitoring with Zabbix
PPTX
Bootify your spring application
PDF
GumGum: Multi-Region Cassandra in AWS
PPT
сувид практическое пособие по применению
PDF
Cloud Connect 2013- Lock Stock and x Smoking EC2's
PDF
Lightning Hedis
PDF
Introduction to cassandra 2014
PDF
Cassandra 3.0
PPTX
Disaster Recovery Planning using Azure Site Recovery
PPTX
Introduction to DataStax Enterprise Graph Database
PDF
Cassandra 2.1 簡介
PPTX
阿里自研数据库 Ocean base实践
PDF
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
PPTX
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
PPTX
Amazon AWS basics needed to run a Cassandra Cluster in AWS
Cassandra Summit 2013 Keynote
Monitoring with Stackdriver
Google Cloud Platform monitoring with Zabbix
Bootify your spring application
GumGum: Multi-Region Cassandra in AWS
сувид практическое пособие по применению
Cloud Connect 2013- Lock Stock and x Smoking EC2's
Lightning Hedis
Introduction to cassandra 2014
Cassandra 3.0
Disaster Recovery Planning using Azure Site Recovery
Introduction to DataStax Enterprise Graph Database
Cassandra 2.1 簡介
阿里自研数据库 Ocean base实践
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Amazon AWS basics needed to run a Cassandra Cluster in AWS
Ad

Similar to Running Cassandra in AWS (20)

PDF
Cassandra at teads
PDF
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
PPTX
Cassandra Architecture FTW
PPTX
Presentation of Apache Cassandra
PPTX
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
PDF
Data Stores @ Netflix
PDF
Cisco: Cassandra adoption on Cisco UCS & OpenStack
PPTX
Devops kc
PPTX
Cassandra implementation for collecting data and presenting data
PDF
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
PPTX
Cassandra & Python - Springfield MO User Group
PPTX
Why Cassandra?
PDF
Exoscale: Pithos: your personal S3 object store on cassandra
PPTX
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Measuring Database Performance on Bare Metal AWS Instances
PDF
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
PDF
Cassandra CLuster Management by Japan Cassandra Community
PDF
Apache Cassandra: NoSQL in the enterprise
Cassandra at teads
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
Cassandra Architecture FTW
Presentation of Apache Cassandra
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Data Stores @ Netflix
Cisco: Cassandra adoption on Cisco UCS & OpenStack
Devops kc
Cassandra implementation for collecting data and presenting data
LesFurets.com: From 0 to Cassandra on AWS in 30 days - Tsunami Alerting Syste...
Cassandra & Python - Springfield MO User Group
Why Cassandra?
Exoscale: Pithos: your personal S3 object store on cassandra
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Measuring Database Performance on Bare Metal AWS Instances
Thinking DevOps in the era of the Cloud - Demi Ben-Ari
Cassandra CLuster Management by Japan Cassandra Community
Apache Cassandra: NoSQL in the enterprise

More from DataStax Academy (20)

PDF
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
PPTX
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
PDF
Cassandra 3.0 Data Modeling
PPTX
Cassandra Adoption on Cisco UCS & Open stack
PDF
Data Modeling for Apache Cassandra
PDF
Coursera Cassandra Driver
PDF
Production Ready Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 2
PDF
Standing Up Your First Cluster
PDF
Real Time Analytics with Dse
PDF
Introduction to Data Modeling with Apache Cassandra
PDF
Cassandra Core Concepts
PPTX
Enabling Search in your Cassandra Application with DataStax Enterprise
PPTX
Bad Habits Die Hard
PDF
Advanced Data Modeling with Apache Cassandra
PDF
Advanced Cassandra
PDF
Apache Cassandra and Drivers
PDF
Getting Started with Graph Databases
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Cassandra 3.0 Data Modeling
Cassandra Adoption on Cisco UCS & Open stack
Data Modeling for Apache Cassandra
Coursera Cassandra Driver
Production Ready Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 2
Standing Up Your First Cluster
Real Time Analytics with Dse
Introduction to Data Modeling with Apache Cassandra
Cassandra Core Concepts
Enabling Search in your Cassandra Application with DataStax Enterprise
Bad Habits Die Hard
Advanced Data Modeling with Apache Cassandra
Advanced Cassandra
Apache Cassandra and Drivers
Getting Started with Graph Databases

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Modernizing your data center with Dell and AMD
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Modernizing your data center with Dell and AMD
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Network Security Unit 5.pdf for BCA BBA.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
The AUB Centre for AI in Media Proposal.docx
Per capita expenditure prediction using model stacking based on satellite ima...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication

Running Cassandra in AWS

  • 1. Running Cassandra in AWS Patrick Eaton, PhD patrick@stackdriver.com @PatrickREaton Joey Imbasciano joey@stackdriver.com @_joeyi
  • 2. Stackdriver at a Glance Stackdriver's hosted intelligent monitoring service helps SaaS companies innovate more by reducing the burden of day-to-day operations ● Cloud-native and cloud-aware ● Designed for complex distributed applications ● Founded by cloud/infrastructure industry veterans (Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise ● Team of ~25, based in Downtown Boston
  • 3. Intelligent Monitoring Discover customer’s cloud-hosted applications ● ● ● ● Infrastructure inventory Logical units, like groups/clusters Services, hosted and self-managed Elastic resources Monitor ● ● Various data sources ● Provider metrics ● Host metrics ● Custom metrics ● Endpoints ● Events ● Health Rich visualizations Analyze ● ● ● ● ● Integrate data sources Aggregate metrics Report utilization, cost, etc. Detect policy violations Recommend actions
  • 4. Lambda Architecture ● ● ● ● ● ● Typical of modern architectures for on-line applications. Formalized by Nathan Marz Composed of "batch", "speed", and "serving" layers Batch layer ○ Store of record ○ Compute arbitrary views Speed layer ○ Low latency updates ○ Streaming algorithms Serving layer ○ Combine data from batch and speed layers to answer queries Serving Speed Batch Data
  • 5. Stackdriver Architecture ● ● ● ● ● Shares characteristics of lambda architecture Indexing (speed) path ○ Make "live" data available "pre-analysis" Analysis (batch) path ○ Compute aggregations ○ Create recommendations Query (serving) layer ○ Combine "live" and analyzed data to answer queries ○ May require on-the-fly analysis Alerting (speed) path (not discussed here) ○ Stream processing to detect Query (Serving) Notification (Serving) Database Indexing (Speed) Analysis (Batch) policy-based anomalies Data Alerting (Speed)
  • 6. Database Options ● We chose Cassandra! ○ True P2P architecture ○ Good support for write-heavy workloads ○ Compatible data model for time series data ■ Column per metric type, timestamps as columns ● Why not MySQL? ○ Experience with operating large, sharded deployments ○ Relational data model not a good match ● Why not HBase? ○ Operational complexity - zk, hadoop, hdfs, ... ○ Special "Master" role ● Why not Dynamo? ○ Avoid vendor lock-in and high cost
  • 7. Stackdriver Architecture ++ ● Archival pipeline stores all data ● Very small surface area, battle-tested ● Critical for disaster recovery ● S3 considered durable enough ● Replicated for availability Query Cassandra Roll-ups Analysis Recs Inventory Data Series Analyze ● ● ● Archive means Cassandra is "soft state" C* consolidates analysis and indexing results Properties of data in C* ● Immutable data ● Append-only ● Read-1, write-1 consistency S3 Archive Index ● Scales out easily ● Indexers, archivers, analyzers, query servers Data
  • 8. Cassandra at Stackdriver Cluster Configuration ● ● ● ● ● ● Version: Datastax Community Edition 1.2.10 Replication Factor: 3 Vnodes Murmur3Partitioner Ec2Snitch ○ Aids in request efficiency ○ Enables Cassandra to ensure replicas are in different Availability Zones phi_convict_threshold: 8 -> 12 ○ Used to determine when nodes are down ○ AWS network can be spotty
  • 9. Cassandra Topology in AWS Where we started... Where we are... 1 us-east-1a us-east-1a 3 2 us-east-1c us-east-1b us-east-1c Keep it balanced! us-east-1b
  • 10. Cassandra EC2 Node Configuration ● m1.xlarge ○ 4 cores ○ 15 GB RAM ○ 4 ephemeral disks available ● 4 disks RAID-0 for Data Volume and CommitLog ○ ○ ○ ○ ext4 - defaults,noatime mdadm RAID-0 Compactions Heavy Read/Write IO
  • 11. Cassandra Automation and Operations ● Combination of Boto, Fabric, & Puppet ○ Boto for AWS API ○ Fabric + Puppet for Bootstrapping ○ Fabric for Operations ● One command to: ○ ○ ○ ○ ○ Launch a new cluster Upsize a cluster Replace a dead node Remove existing nodes List nodes in a cluster
  • 13. Cassandra Backups using S3 ● No Cassandra Powered Backups ● Restore from S3 ● Useful for major version upgrades Data S3 Bulk Loader Map Reduce 1. Data is archived when it is received 2. Bulk loader reads from S3 3. M/R re-analyzes data 4. Cassandra is repopulated Cassandra
  • 14. Disaster Recover in the Wild ● ● ● ● ● ● ● ● October 23, Stackdriver suffered a total loss of our C* cluster ● Exhausted memory due to number of open file descriptors (see graph) We did not notice the problem until it was too late ● Nodes began crashing, resulted in inconsistent view of the ring Attempted to restart the cluster unsuccessfully for ~2 hours Provisioned new 36 node cluster in ~2 hours Directed “live” data to new cluster Started bulk restore operation from archive ● Full-fidelity data and aggregations No data loss due to archival pipeline See http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/
  • 15. Cluster Restoration Process S3 Map Reduce Bulk Loader Historical Data New Cluster UI UI UI UI UI API UI UI Gateway New Data Old Cluster
  • 16. Thank you! Yes, we are hiring! Patrick Eaton - patrick@stackdriver.com - @PatrickREaton Joey Imbasciano - joey@stackdriver.com - @_joeyi