SlideShare a Scribd company logo
Ben Bromhead
Cassandra… Every day I’m scaling
2© DataStax, All Rights Reserved.
Who am I and What do I do?
• Co-founder and CTO of Instaclustr -> www.instaclustr.com
• Instaclustr provides Cassandra-as-a-Service in the cloud.
• Currently support AWS, Azure, Heroku, Softlayer and Private DCs with more to come.
• Approaching 1000 nodes under management
• Yes… we are hiring! Come live in Australia!
© DataStax, All Rights Reserved. 3
1 Why scaling sucks in Cassandra
2 It gets better
3 Then it gets really awesome
4© DataStax, All Rights Reserved.
Linear Scalability – In theory
© DataStax, All Rights Reserved. 5
Linear Scalability – In practice
© DataStax, All Rights Reserved. 6
What’s supposed to happen
• Scaling Cassandra is just “bootstrap new nodes”
• That works if your cluster is under provisioned and has 30% disk usage
© DataStax, All Rights Reserved. 7
What actually happens
• Add 1 node
• Bootstrapping node fails (1 day)
• WTF - Full disk on bootstrapping node? (5 minutes)
• If STCS run SSTableSplit on large SSTables on original nodes (2 days)
• Attach super sized network storage (EBS) and bind mount to bootstrapping node.
© DataStax, All Rights Reserved. 8
What actually happens
• Restart bootstrapping process
• Disk alert 70% (2 days later)
• Throttle streaming throughput to below compaction throughput
• Bootstrapping finishes (5 days later)
• Cluster latency spikes cause bootstrap finished but their was a million compactions
remaining
• Take node offline and let compaction finish
• Run repair on node (10 years)
• Add next node.
© DataStax, All Rights Reserved. 9
What actually happens
© DataStax, All Rights Reserved. 10
Scalability in Cassandra sucks
• Soo much over streaming
• LCS and Bootstrap – Over stream then compact all the data!
• STCS and bootstrap – Over stream all the data and run out of disk space
© DataStax, All Rights Reserved. 11
Scalability in Cassandra sucks
• No vnodes? – Can only double your cluster
• Vnodes? – Can only add one node at a time
• Bootstrap – Fragile and not guaranteed to be consistent
© DataStax, All Rights Reserved. 12
Why does it suck for you
Your database never meets your business requirements from a capacity perspective
(bad) and if you try…
• You could interrupt availability and performance (really bad)
• You could loose data (really really bad)
© DataStax, All Rights Reserved. 13
How did it get this way?
It’s actually a hard problem:
• Moving large amounts of data between nodes requires just as much attention to it
from a CAP perspective as client facing stuff.
• New features don’t tend to consider impact on scaling operations
• Features that help ops tends to be less sexy
© DataStax, All Rights Reserved. 14
Does it get better?
© DataStax, All Rights Reserved. 15
Yes!
Does it get better? Consistent bootstrap
Strongly consistent membership and ownership – CASSANDRA-9667
• Using LWT to propose and claim ownership of new token allocations in a consistent
manner
• Work in progress
• You can do this today by pre-assigning non-overlapping (inc replicas) vnode tokens
and using cassandra.consistent.simultaneousmoves.allow=true as a JVM property
before bootstrapping your nodes
© DataStax, All Rights Reserved. 16
Does it get better? Bootstrap stability
Keep-alives for all streaming operations – CASSANDRA-11841
• Currently implements a timeout, you can reduce this to be more aggressive, but large
SSTables will then never stream
Resummable bootstrap – CASSANDRA-8942 & CASSANDRA-8838
• You can do this in 2.2+
Incremental bootstrap – CASSANDRA-8494
• Being worked on, hard to do with vnodes right now (try it… the error message uses
the word “thusly”), instead throttle streaming and uncap compaction to ensure the
node doesn’t get overloaded during bootstrap
© DataStax, All Rights Reserved. 17
Can we make it even better?
© DataStax, All Rights Reserved. 18
Yes!
Can we make it even better?
© DataStax, All Rights Reserved. 19
• Let’s try scaling without data ownership changes
• Take advantage of Cassandras normal partition and availability mechanisms
• With a little help from our cloud providers!
Introducing token pinned scaling
© DataStax, All Rights Reserved. 20
• Probably needs a better name
• Here is how it works
Introducing token pinned scaling
© DataStax, All Rights Reserved. 21
With the introduction of:
• Partitioning SSTables by Range (CASSANDRA-6696)
• Range Aware Compaction (CASSANDRA-10540)
• A few extra lines of code to save/load a map of token to disks (coming soon)
Cassandra will now keep data associated with specific tokens in a single data directory,
this could let us treat a disk as a unit in which to scale around!
But first what do these two features actually let us do?
Introducing token pinned scaling
© DataStax, All Rights Reserved. 22
Before Partitioning SSTables by Range and Range Aware Compaction:
1 - 100
901 - 1000
1401-1500
Disk0 Disk1
SSTables
Introducing token pinned scaling
© DataStax, All Rights Reserved. 23
After Partitioning SSTables by Range and Range Aware Compaction:
1 - 100
901 - 1000
1401-1500
Disk0 Disk1
SSTables
Data within a token range is now kept on a specific disk
Introducing token pinned scaling
© DataStax, All Rights Reserved. 24
Your SSTables will converge to contain a single vnode range when things get big enough
1 - 100
901 - 1000
1401-1500
Disk0 Disk1
SSTables
Leveraging EBS to separate I/O from CPU
© DataStax, All Rights Reserved. 25
• Amazon Web Services provides a networked attached block store called EBS (Elastic
Block Store).
• Isolated to each availability zone
• We can attach and reattach EBS disk ad-hoc and in seconds/minutes
Adding it all together
© DataStax, All Rights Reserved. 26
• Make each EBS disk a data directory in Cassandra
• Cassandra guarantees only data from a specific token range will exist on a given disk
• When throughput is low attach all disks in a single AZ to a single node, specify all the
ranges from each disk via a comma separated list of tokens.
• Up to 40 disks per instance!
• When load is high, launch more instances and spread the disks across the new
instances.
Adding it all together
© DataStax, All Rights Reserved. 27
• Make each EBS disk a data directory in Cassandra
sda
sdd
sdb
sde
sdc
sdf
Amazon EBS
Adding it all together
© DataStax, All Rights Reserved. 28
• Cassandra guarantees only data from a specific token range will exist on a given disk
Amazon EBS
Adding it all together
© DataStax, All Rights Reserved. 29
• When throughput is low attach all disks in a single AZ to a single node
Amazon EBS
200 op/s
Adding it all together
© DataStax, All Rights Reserved. 30
• When load is high, launch more instances and spread the disks across the new instances.
Amazon EBS
10,000 op/s
How it works - Scaling
© DataStax, All Rights Reserved. 31
• Normally you have to provision your cluster at your maximum operations per second +
30% (headroom in case your get it wrong).
• Provision enough IOPS, CPU, RAM etc
• Makes Cassandra an $$$ solution
Provisioned workload
Actual workload
© DataStax, All Rights Reserved. 32
How it works - Scaling
© DataStax, All Rights Reserved. 33
• Let’s make our resources match our workload
Provisioned IOPS workload
Actual workload
Provisioned CPU & RAM workload
How it works - Scaling
© DataStax, All Rights Reserved. 34
• Let’s make our resources match our workload
Provisioned IOPS workload
Actual workload
Provisioned CPU & RAM workload
How it works - Consistency
© DataStax, All Rights Reserved. 35
• No range movements! You don’t need a Jepsen test to see how bad range
movements are for consistency.
• Tokens and ranges are fixed during all scaling operations
• Range movements are where you see most consistency badness in Cassandra
(bootstrap, node replacement, decommission) and need to rely on repair.
How it works - Consistency
© DataStax, All Rights Reserved. 36
• Treats Racks as a giant “meta-node”, network topology strategy ensures replicas are
on different racks.
• AWS Rack == AZ
• As tokens for a node change based on the disk they have, replica topology stays the
same
• You can only swap disks between instances within the same AZ
• Scale one rack at a time… scale your cluster in constant time!
• If you want to do this with a single rack, you will have a bad time
How it works - Consistency
© DataStax, All Rights Reserved. 37
1,5,10
2,6,11
3, 22, 44
4,23,45
102,134,167
101,122,155
1,2,3,4,5,6,10,11,22,23,44,45,101 …
How it works - Consistency
© DataStax, All Rights Reserved. 38
1,5,10
2,6,11
3, 22, 44
4,23,45
102,134,167
101,122,155
1,5,10
2,6,11
3,22,44
4,23,45
102,134,167
101,122,155
How it works - TODO
© DataStax, All Rights Reserved. 39
Some issues remain:
• Hinted handoff breaks (handoff is based on endpoint rather than token)
• Time for gossip to settle on any decent sized cluster
• Currently just clearing out the system.local folder to allow booting
• Can’t do this while repair is running… for some people this is all the time
• You’ll need to run repair more often as scaling intentionally introduces outages
• Breaks consistency and everything where RF > number of racks (usually the
system_auth keyspace).
• More work needed!
How it works – Real world
© DataStax, All Rights Reserved. 40
• No production tests yet 
• Have gone from a 3 node cluster to a 36 node cluster in around 50 minutes.
• Plenty left to optimize (e.g. bake everything into an AMI to reduce startup time)
• Could get this down to 10 minutes per rack depending on how responsive AWS is!
• No performance overhead compared to Cassandra on EBS.
• Check out the code here: https://github.com/benbromhead/Cassandra/tree/ic-token-
pinning
How it works – Real world
© DataStax, All Rights Reserved. 41
• Really this is bending some new and impending changes to do funky stuff 
Questions?
Questions?

More Related Content

PDF
Micro-batching: High-performance writes
PDF
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
PDF
Instaclustr webinar 2017 feb 08 japan
PDF
Cassandra CLuster Management by Japan Cassandra Community
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
PPTX
M6d cassandrapresentation
Micro-batching: High-performance writes
Instaclustr Webinar 50,000 Transactions Per Second with Apache Spark on Apach...
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Instaclustr webinar 2017 feb 08 japan
Cassandra CLuster Management by Japan Cassandra Community
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
M6d cassandrapresentation

What's hot (17)

PPTX
How to size up an Apache Cassandra cluster (Training)
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
PPTX
Processing 50,000 events per second with Cassandra and Spark
PDF
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
PDF
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PDF
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
PDF
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
PDF
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
PPT
Webinar: Getting Started with Apache Cassandra
PPTX
Cassandra Tuning - above and beyond
PPTX
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PPTX
Running Cassandra on Amazon EC2
PPTX
Load testing Cassandra applications
PPTX
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
How to size up an Apache Cassandra cluster (Training)
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Processing 50,000 events per second with Cassandra and Spark
Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netfl...
Terror & Hysteria: Cost Effective Scaling of Time Series Data with Cassandra ...
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...
Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Ca...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Webinar: Getting Started with Apache Cassandra
Cassandra Tuning - above and beyond
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Running Cassandra on Amazon EC2
Load testing Cassandra applications
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Ad

Similar to Everyday I’m scaling... Cassandra (20)

PDF
Cassandra Bootstap from Backups
PDF
Cassandra Bootstrap from Backups
PDF
Building Apache Cassandra clusters for massive scale
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...
PDF
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PPTX
Devops kc
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PDF
Cassandra Day NY 2014: From Proof of Concept to Production
PDF
1 Million Writes per second on 60 nodes with Cassandra and EBS
PPTX
Boot Strapping in Cassandra
PPTX
Cassandra Architecture FTW
PPTX
Cassandra in Operation
PDF
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
PDF
Apache Cassandra multi-datacenter essentials
PDF
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
PDF
An Introduction to Apache Cassandra
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Cassandra Bootstap from Backups
Cassandra Bootstrap from Backups
Building Apache Cassandra clusters for massive scale
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
DataStax Enterprise & Apache Cassandra – Essentials for Financial Services – ...
Highly available, scalable and secure data with Cassandra and DataStax Enterp...
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Devops kc
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Cassandra Day NY 2014: From Proof of Concept to Production
1 Million Writes per second on 60 nodes with Cassandra and EBS
Boot Strapping in Cassandra
Cassandra Architecture FTW
Cassandra in Operation
Apache Cassandra Multi-Datacenter Essentials (Julien Anguenot, iLand Internet...
Apache Cassandra multi-datacenter essentials
Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastr...
An Introduction to Apache Cassandra
DataStax: 7 Deadly Sins for Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Ad

More from Instaclustr (16)

PDF
Apache Cassandra Community Health
PDF
Instaclustr introduction to managing cassandra
PDF
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
PDF
Instaclustr Apache Cassandra Best Practices & Toubleshooting
PPTX
Processing 50,000 events per second with Cassandra and Spark
PPTX
Load Testing Cassandra Applications
PDF
Cassandra-as-a-Service
PDF
Cassandra Front Lines
PDF
Multi-Region Cassandra Clusters
PDF
Migrating to Cassandra
PDF
Cassandra on Docker
PDF
Securing Cassandra
PDF
Apache Cassandra Management
PDF
Apache Cassandra in the Cloud
PDF
Introduction to Apache Cassandra
PDF
Development Nirvana with Cassandra
Apache Cassandra Community Health
Instaclustr introduction to managing cassandra
Instaclustr webinar 50,000 transactions per second with Apache Spark on Apach...
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Processing 50,000 events per second with Cassandra and Spark
Load Testing Cassandra Applications
Cassandra-as-a-Service
Cassandra Front Lines
Multi-Region Cassandra Clusters
Migrating to Cassandra
Cassandra on Docker
Securing Cassandra
Apache Cassandra Management
Apache Cassandra in the Cloud
Introduction to Apache Cassandra
Development Nirvana with Cassandra

Recently uploaded (20)

PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
ai tools demonstartion for schools and inter college
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
Transform Your Business with a Software ERP System
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
medical staffing services at VALiNTRY
PDF
Nekopoi APK 2025 free lastest update
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
top salesforce developer skills in 2025.pdf
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
System and Network Administraation Chapter 3
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
ai tools demonstartion for schools and inter college
CHAPTER 2 - PM Management and IT Context
How Creative Agencies Leverage Project Management Software.pdf
Transform Your Business with a Software ERP System
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 41
medical staffing services at VALiNTRY
Nekopoi APK 2025 free lastest update
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
top salesforce developer skills in 2025.pdf
Wondershare Filmora 15 Crack With Activation Key [2025
System and Network Administraation Chapter 3
Navsoft: AI-Powered Business Solutions & Custom Software Development

Everyday I’m scaling... Cassandra

  • 2. 2© DataStax, All Rights Reserved.
  • 3. Who am I and What do I do? • Co-founder and CTO of Instaclustr -> www.instaclustr.com • Instaclustr provides Cassandra-as-a-Service in the cloud. • Currently support AWS, Azure, Heroku, Softlayer and Private DCs with more to come. • Approaching 1000 nodes under management • Yes… we are hiring! Come live in Australia! © DataStax, All Rights Reserved. 3
  • 4. 1 Why scaling sucks in Cassandra 2 It gets better 3 Then it gets really awesome 4© DataStax, All Rights Reserved.
  • 5. Linear Scalability – In theory © DataStax, All Rights Reserved. 5
  • 6. Linear Scalability – In practice © DataStax, All Rights Reserved. 6
  • 7. What’s supposed to happen • Scaling Cassandra is just “bootstrap new nodes” • That works if your cluster is under provisioned and has 30% disk usage © DataStax, All Rights Reserved. 7
  • 8. What actually happens • Add 1 node • Bootstrapping node fails (1 day) • WTF - Full disk on bootstrapping node? (5 minutes) • If STCS run SSTableSplit on large SSTables on original nodes (2 days) • Attach super sized network storage (EBS) and bind mount to bootstrapping node. © DataStax, All Rights Reserved. 8
  • 9. What actually happens • Restart bootstrapping process • Disk alert 70% (2 days later) • Throttle streaming throughput to below compaction throughput • Bootstrapping finishes (5 days later) • Cluster latency spikes cause bootstrap finished but their was a million compactions remaining • Take node offline and let compaction finish • Run repair on node (10 years) • Add next node. © DataStax, All Rights Reserved. 9
  • 10. What actually happens © DataStax, All Rights Reserved. 10
  • 11. Scalability in Cassandra sucks • Soo much over streaming • LCS and Bootstrap – Over stream then compact all the data! • STCS and bootstrap – Over stream all the data and run out of disk space © DataStax, All Rights Reserved. 11
  • 12. Scalability in Cassandra sucks • No vnodes? – Can only double your cluster • Vnodes? – Can only add one node at a time • Bootstrap – Fragile and not guaranteed to be consistent © DataStax, All Rights Reserved. 12
  • 13. Why does it suck for you Your database never meets your business requirements from a capacity perspective (bad) and if you try… • You could interrupt availability and performance (really bad) • You could loose data (really really bad) © DataStax, All Rights Reserved. 13
  • 14. How did it get this way? It’s actually a hard problem: • Moving large amounts of data between nodes requires just as much attention to it from a CAP perspective as client facing stuff. • New features don’t tend to consider impact on scaling operations • Features that help ops tends to be less sexy © DataStax, All Rights Reserved. 14
  • 15. Does it get better? © DataStax, All Rights Reserved. 15 Yes!
  • 16. Does it get better? Consistent bootstrap Strongly consistent membership and ownership – CASSANDRA-9667 • Using LWT to propose and claim ownership of new token allocations in a consistent manner • Work in progress • You can do this today by pre-assigning non-overlapping (inc replicas) vnode tokens and using cassandra.consistent.simultaneousmoves.allow=true as a JVM property before bootstrapping your nodes © DataStax, All Rights Reserved. 16
  • 17. Does it get better? Bootstrap stability Keep-alives for all streaming operations – CASSANDRA-11841 • Currently implements a timeout, you can reduce this to be more aggressive, but large SSTables will then never stream Resummable bootstrap – CASSANDRA-8942 & CASSANDRA-8838 • You can do this in 2.2+ Incremental bootstrap – CASSANDRA-8494 • Being worked on, hard to do with vnodes right now (try it… the error message uses the word “thusly”), instead throttle streaming and uncap compaction to ensure the node doesn’t get overloaded during bootstrap © DataStax, All Rights Reserved. 17
  • 18. Can we make it even better? © DataStax, All Rights Reserved. 18 Yes!
  • 19. Can we make it even better? © DataStax, All Rights Reserved. 19 • Let’s try scaling without data ownership changes • Take advantage of Cassandras normal partition and availability mechanisms • With a little help from our cloud providers!
  • 20. Introducing token pinned scaling © DataStax, All Rights Reserved. 20 • Probably needs a better name • Here is how it works
  • 21. Introducing token pinned scaling © DataStax, All Rights Reserved. 21 With the introduction of: • Partitioning SSTables by Range (CASSANDRA-6696) • Range Aware Compaction (CASSANDRA-10540) • A few extra lines of code to save/load a map of token to disks (coming soon) Cassandra will now keep data associated with specific tokens in a single data directory, this could let us treat a disk as a unit in which to scale around! But first what do these two features actually let us do?
  • 22. Introducing token pinned scaling © DataStax, All Rights Reserved. 22 Before Partitioning SSTables by Range and Range Aware Compaction: 1 - 100 901 - 1000 1401-1500 Disk0 Disk1 SSTables
  • 23. Introducing token pinned scaling © DataStax, All Rights Reserved. 23 After Partitioning SSTables by Range and Range Aware Compaction: 1 - 100 901 - 1000 1401-1500 Disk0 Disk1 SSTables Data within a token range is now kept on a specific disk
  • 24. Introducing token pinned scaling © DataStax, All Rights Reserved. 24 Your SSTables will converge to contain a single vnode range when things get big enough 1 - 100 901 - 1000 1401-1500 Disk0 Disk1 SSTables
  • 25. Leveraging EBS to separate I/O from CPU © DataStax, All Rights Reserved. 25 • Amazon Web Services provides a networked attached block store called EBS (Elastic Block Store). • Isolated to each availability zone • We can attach and reattach EBS disk ad-hoc and in seconds/minutes
  • 26. Adding it all together © DataStax, All Rights Reserved. 26 • Make each EBS disk a data directory in Cassandra • Cassandra guarantees only data from a specific token range will exist on a given disk • When throughput is low attach all disks in a single AZ to a single node, specify all the ranges from each disk via a comma separated list of tokens. • Up to 40 disks per instance! • When load is high, launch more instances and spread the disks across the new instances.
  • 27. Adding it all together © DataStax, All Rights Reserved. 27 • Make each EBS disk a data directory in Cassandra sda sdd sdb sde sdc sdf Amazon EBS
  • 28. Adding it all together © DataStax, All Rights Reserved. 28 • Cassandra guarantees only data from a specific token range will exist on a given disk Amazon EBS
  • 29. Adding it all together © DataStax, All Rights Reserved. 29 • When throughput is low attach all disks in a single AZ to a single node Amazon EBS 200 op/s
  • 30. Adding it all together © DataStax, All Rights Reserved. 30 • When load is high, launch more instances and spread the disks across the new instances. Amazon EBS 10,000 op/s
  • 31. How it works - Scaling © DataStax, All Rights Reserved. 31 • Normally you have to provision your cluster at your maximum operations per second + 30% (headroom in case your get it wrong). • Provision enough IOPS, CPU, RAM etc • Makes Cassandra an $$$ solution Provisioned workload Actual workload
  • 32. © DataStax, All Rights Reserved. 32
  • 33. How it works - Scaling © DataStax, All Rights Reserved. 33 • Let’s make our resources match our workload Provisioned IOPS workload Actual workload Provisioned CPU & RAM workload
  • 34. How it works - Scaling © DataStax, All Rights Reserved. 34 • Let’s make our resources match our workload Provisioned IOPS workload Actual workload Provisioned CPU & RAM workload
  • 35. How it works - Consistency © DataStax, All Rights Reserved. 35 • No range movements! You don’t need a Jepsen test to see how bad range movements are for consistency. • Tokens and ranges are fixed during all scaling operations • Range movements are where you see most consistency badness in Cassandra (bootstrap, node replacement, decommission) and need to rely on repair.
  • 36. How it works - Consistency © DataStax, All Rights Reserved. 36 • Treats Racks as a giant “meta-node”, network topology strategy ensures replicas are on different racks. • AWS Rack == AZ • As tokens for a node change based on the disk they have, replica topology stays the same • You can only swap disks between instances within the same AZ • Scale one rack at a time… scale your cluster in constant time! • If you want to do this with a single rack, you will have a bad time
  • 37. How it works - Consistency © DataStax, All Rights Reserved. 37 1,5,10 2,6,11 3, 22, 44 4,23,45 102,134,167 101,122,155 1,2,3,4,5,6,10,11,22,23,44,45,101 …
  • 38. How it works - Consistency © DataStax, All Rights Reserved. 38 1,5,10 2,6,11 3, 22, 44 4,23,45 102,134,167 101,122,155 1,5,10 2,6,11 3,22,44 4,23,45 102,134,167 101,122,155
  • 39. How it works - TODO © DataStax, All Rights Reserved. 39 Some issues remain: • Hinted handoff breaks (handoff is based on endpoint rather than token) • Time for gossip to settle on any decent sized cluster • Currently just clearing out the system.local folder to allow booting • Can’t do this while repair is running… for some people this is all the time • You’ll need to run repair more often as scaling intentionally introduces outages • Breaks consistency and everything where RF > number of racks (usually the system_auth keyspace). • More work needed!
  • 40. How it works – Real world © DataStax, All Rights Reserved. 40 • No production tests yet  • Have gone from a 3 node cluster to a 36 node cluster in around 50 minutes. • Plenty left to optimize (e.g. bake everything into an AMI to reduce startup time) • Could get this down to 10 minutes per rack depending on how responsive AWS is! • No performance overhead compared to Cassandra on EBS. • Check out the code here: https://github.com/benbromhead/Cassandra/tree/ic-token- pinning
  • 41. How it works – Real world © DataStax, All Rights Reserved. 41 • Really this is bending some new and impending changes to do funky stuff 