SlideShare a Scribd company logo
SOMETHING SOMETHING
COORDINATOR NODES
Cheating Our Way to Better Performance
JON HADDAD
LEARN DATA MODELING BY EXAMPLE
THIS IS
AWESOME!!
GO TO ROOM 210A NOW!
Eric Lubow @elubow #CassandraSummit
PERSONAL VANITY
àč CTO of SimpleReach
àč Co-Author of Practical
Cassandra
àč Skydiver, Mixed Martial
Artist, Motorcyclist, Dog Dad
(IG: @charliedognyc), NY
Giants fan
Eric Lubow @elubow #CassandraSummit
SIMPLEREACH
àč Help marketers organize
àč Identify the best content
àč Use engagement metrics
àč WorkïŹ‚ow solution
àč Stream processing ingest
àč Many metrics, time sliced
àč Multiple data stores
Eric Lubow @elubow #CassandraSummit
CONCEPTS YOU SHOULD UNDERSTAND
1. Thick clients and thin clients
2. CPU utilization and load average
3. Database tuning may not have anything to do with the database
Eric Lubow @elubow #CassandraSummit
A fat client (also called heavy, rich, or thick client) is a computer
(client) in client–server architecture or networks that typically
provides rich functionality independent of the central server.
— Wikipedia
WHAT IS A FAT CLIENT?
Eric Lubow @elubow #CassandraSummit
‱ Thin clients typically canÊŒt operate without the “server”
‱ Thick clients try to do more locally compared to thin
clients which try to do more remotely
‱ Thick clients require more resources, but fewer servers.
‱ Thin clients require fewer resources and more servers.
THIN CLIENTS AND THICK CLIENTS
Eric Lubow @elubow #CassandraSummit
‱ Fat clients have the Cassandra binary, but no data
‱ Data nodes are denser and more focused on storage
‱ For context, we can call them proxy nodes
‱ Proxy nodes are more compute heavy
‱ Fat clients only handles coordination
ONCE MORE, BUT WITH CASSANDRA
Eric Lubow @elubow #CassandraSummit
àč Fat clients are effectively just changing settings
àč < Cassandra 2.2 -Djoin_ring=false (hack)
àč No data on the nodes, just coordination responsibility
àč Intentionally sidestepping Cassandra homogenous nature in
favor of performance
àč Can be lots of room for adding proxy nodes without incurring
additional performance loss from increasing the ring size
àč Reduces per node work on the data nodes
WHAT’S REALLY GOING ON HERE?
Eric Lubow @elubow #CassandraSummit
NORMAL CASSANDRA SETUP
CASSANDRA CLUSTERAPPLICATION TIER
C*
App 1
App 2
C* C* C*
C* C* C* C*
Eric Lubow @elubow #CassandraSummit
Eric Lubow @elubow #CassandraSummit
CASSANDRA PROXY TIER SETUP
CASSANDRA CLUSTER
App 1
App 2
PROXY TIER
Proxy
Proxy
Data Data
Data Data Data
Data Data
Data
APPLICATION TIER
Data Nodes: c3.4xlargeProxy Nodes: c3.4xlarge
Eric Lubow @elubow #CassandraSummit
àč More compute power for token calculations
àč More compute power for writing data
àč More focused compute on coordination
tasks
àč Smarter allocation of instance types
àč Cheaper hardware for proxy instances
TRADEOFFS
àč More instance types to manage
àč More infrastructure overhead
àč Requires different monitoring
àč High potential for nasty accident (forget to
make proxy node)
Eric Lubow @elubow #CassandraSummit
Before
Before
After
After
AVERAGE CLUSTER CPU UTILIZATION
Eric Lubow @elubow #CassandraSummit
HOW DID WE DO?
Eric Lubow @elubow #CassandraSummit
àč Why are we talking about CPU utilization and load average
àč Terminology is important
àč Understanding gains/losses is important
àč LetÊŒs talk about CPU utilization and load average
HOW DO WE KNOW?
Eric Lubow @elubow #CassandraSummit
àč LetÊŒs use a trafïŹc analogy for load average
àč Imagine you are the bridge operator of a single lane bridge (single CPU):
àč 0.00 means there's no trafïŹc on the bridge at all. In fact, between 0.00 and
1.00 means there's no backup, and an arriving car will just go right on.
àč 1.00 means the bridge is exactly at capacity. All is still good, but if trafïŹc
gets a little heavier, things are going to slow down.
àč over 1.00 means there's backup. How much? Well, 2.00 means that there
are two lanes worth of cars total -- one lane's worth on the bridge, and one
lane's worth waiting. 3.00 means there are three lane's worth total -- one
lane's worth on the bridge, and two lanes' worth waiting. Etc.
àč Best load average for a single CPU system is between 0.7 and 0.8 (headroom)
àč Different for multi-core systems
WHAT IS LOAD AVERAGE?
Eric Lubow @elubow #CassandraSummit
àč Each core on a CPU has itÊŒs own utilization graph
àč CPU utilization isnÊŒt straight forward
àč Assume you have a single core processor ïŹxed at a frequency of 2.0 GHz.
CPU utilization in this scenario is the percentage of time the processor
spends doing work (as opposed to being idle). If this 2.0 GHz processor
does 1 billion cycles worth of work in a second, it is 50% utilized for that
second.
àč Current multiple cores processors exist with dynamically changing
frequencies, hardware multithreading, and shared caches all of which
effect reporting.
àč Resource sharing makes monitoring CPU utilization difïŹcult
NOTES ABOUT CPU UTILIZATION
Eric Lubow @elubow #CassandraSummit
Eric Lubow @elubow #CassandraSummit
Before
Before
After
After
AVERAGE CLUSTER CPU UTILIZATION
Eric Lubow @elubow #CassandraSummit
àč Batches work better when there is a coordinator dispatching batches to the
correct data node without additional processing on the part of the data node
àč One of the many downsides of vnodes is massive coordination requirements
àč Removing coordination responsibilities from data nodes makes them more
performant
àč less context switching
àč less network trafïŹc/gossip/GC
àč less CPU utilization
WHAT ACTUALLY HAPPENED 1/2
Eric Lubow @elubow #CassandraSummit
àč 30 nodes * 256 tokens/node = 7,680 token ranges
àč Queries go through a nearly 8,000 item list, slow, context switch, lots of
GCable objects
àč Considering just reads, at 30k requests per second, this is a signiïŹcant
reduction in work on a per query basis
àč We are able to tune the JVMs differently
WHAT ACTUALLY HAPPENED 2/2
Eric Lubow @elubow #CassandraSummit
RESULTS
Eric Lubow @elubow #CassandraSummit
àč Went from 72 nodes down to 30 nodes
àč All data is now stored on AWS ST1 EBS volumes
àč Works best for write heavy workloads
àč Roughly 300% increase in available and burstable capacity
àč Less footprint to watch over; fewer machines, more roles
RESULTS
Eric Lubow @elubow #CassandraSummit
àč Command line option or cassandra.yaml option for
coordinator only mode
àč Code path short cuts for performance
àč SpeciïŹc JMX beans around query coordination
àč Allow query mutation by coordinator nodes (Lua?)
FEATURE NOT HACK
Eric Lubow @elubow #CassandraSummit
WHAT DID I SAY?
àč Fat clients can save you money
àč DonÊŒt start out with complexity
àč Know the basics
àč Know what your baseline measurements are
àč Monitor everything
àč Sometimes database tuning doesnÊŒt require
making changes to the database
Eric Lubow @elubow #CassandraSummit
QUESTIONS IN LIFE ARE GUARANTEED,
ANSWERS AREN’T.
Eric Lubow
@elubow

More Related Content

PPTX
Room 1 - 7 - LĂȘ Quốc ĐáșĄt - Upgrading network of Openstack to SDN with Tungste...
PDF
Spark and S3 with Ryan Blue
PPTX
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
PDF
New Directions for Apache Arrow
PPTX
Apache Spark Architecture
PDF
SQL Transactions - What they are good for and how they work
PDF
Photon Technical Deep Dive: How to Think Vectorized
PDF
Using ClickHouse for Experimentation
Room 1 - 7 - LĂȘ Quốc ĐáșĄt - Upgrading network of Openstack to SDN with Tungste...
Spark and S3 with Ryan Blue
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
New Directions for Apache Arrow
Apache Spark Architecture
SQL Transactions - What they are good for and how they work
Photon Technical Deep Dive: How to Think Vectorized
Using ClickHouse for Experimentation

What's hot (20)

PDF
Ceph Object Storage Reference Architecture Performance and Sizing Guide
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
PDF
MyRocks Deep Dive
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Ceph scale testing with 10 Billion Objects
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Autoscaling Flink with Reactive Mode
PDF
Understanding Presto - Presto meetup @ Tokyo #1
PDF
Cassandra at eBay - Cassandra Summit 2012
PDF
Redis
PPTX
The Current State of Table API in 2022
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
ClickHouse Monitoring 101: What to monitor and how
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
PDF
2019.06.27 Intro to Ceph
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
ODP
Presto
PDF
á„á…źá„á…”á„…á…„á†« 1년, ᄉᅄᄇᅄᄀᅹ벌 ᄇᅟᆫᄐᅟᄀᅔ
PDF
Using all of the high availability options in MariaDB
PDF
Iceberg + Alluxio for Fast Data Analytics
Ceph Object Storage Reference Architecture Performance and Sizing Guide
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
MyRocks Deep Dive
Apache Iceberg - A Table Format for Hige Analytic Datasets
Ceph scale testing with 10 Billion Objects
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Autoscaling Flink with Reactive Mode
Understanding Presto - Presto meetup @ Tokyo #1
Cassandra at eBay - Cassandra Summit 2012
Redis
The Current State of Table API in 2022
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
ClickHouse Monitoring 101: What to monitor and how
A Thorough Comparison of Delta Lake, Iceberg and Hudi
2019.06.27 Intro to Ceph
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Presto
á„á…źá„á…”á„…á…„á†« 1년, ᄉᅄᄇᅄᄀᅹ벌 ᄇᅟᆫᄐᅟᄀᅔ
Using all of the high availability options in MariaDB
Iceberg + Alluxio for Fast Data Analytics
Ad

Similar to Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016 (20)

PDF
Simplereach: Counters at Scale: A Cautionary Tale
PDF
Counters At Scale - A Cautionary Tale
PDF
Making It To Veteren Cassandra Status
PPTX
Performance Tipping Points - Hitting Hardware Bottlenecks
PPTX
Low latency in java 8 by Peter Lawrey
PDF
Building a Database for the End of the World
 
PDF
Intro to Databases
PDF
Erlang as a Cloud Citizen
 
PDF
Erlang and the Cloud: A Fractal Approach to Throughput
 
PDF
Erlang as a cloud citizen, a fractal approach to throughput
PPTX
Acsug scalable windows azure patterns
PPTX
Network Redundancy Elimination
PPT
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
PPT
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
PDF
Real Time Big Data (w/ NoSQL)
PPTX
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
PDF
Cloud Computing in Practice
PDF
Optimizing elastic search on google compute engine
PDF
Running ElasticSearch on Google Compute Engine in Production
PDF
Cloud computing-1224001671523233-9
Simplereach: Counters at Scale: A Cautionary Tale
Counters At Scale - A Cautionary Tale
Making It To Veteren Cassandra Status
Performance Tipping Points - Hitting Hardware Bottlenecks
Low latency in java 8 by Peter Lawrey
Building a Database for the End of the World
 
Intro to Databases
Erlang as a Cloud Citizen
 
Erlang and the Cloud: A Fractal Approach to Throughput
 
Erlang as a cloud citizen, a fractal approach to throughput
Acsug scalable windows azure patterns
Network Redundancy Elimination
Chapter_1.ppt Peter S Pacheco, Matthew Malensek – An Introduction to Parallel...
Chapter_1_16_10_2024.pptPeter S Pacheco, Matthew Malensek – An Introduction t...
Real Time Big Data (w/ NoSQL)
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Cloud Computing in Practice
Optimizing elastic search on google compute engine
Running ElasticSearch on Google Compute Engine in Production
Cloud computing-1224001671523233-9
Ad

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandraℱ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandraℱ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandraℱ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandraℱ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
AI in Product Development-omnex systems
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
medical staffing services at VALiNTRY
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PDF
Design an Analysis of Algorithms II-SECS-1021-03
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Wondershare Filmora 15 Crack With Activation Key [2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Odoo Companies in India – Driving Business Transformation.pdf
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
AI in Product Development-omnex systems
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Navsoft: AI-Powered Business Solutions & Custom Software Development
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Understanding Forklifts - TECH EHS Solution
medical staffing services at VALiNTRY
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
How Creative Agencies Leverage Project Management Software.pdf
Softaken Excel to vCard Converter Software.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Design an Analysis of Algorithms II-SECS-1021-03

Optimizing Your Cluster with Coordinator Nodes (Eric Lubow, SimpleReach) | Cassandra Summit 2016

  • 1. SOMETHING SOMETHING COORDINATOR NODES Cheating Our Way to Better Performance
  • 2. JON HADDAD LEARN DATA MODELING BY EXAMPLE THIS IS AWESOME!! GO TO ROOM 210A NOW!
  • 3. Eric Lubow @elubow #CassandraSummit PERSONAL VANITY àč CTO of SimpleReach àč Co-Author of Practical Cassandra àč Skydiver, Mixed Martial Artist, Motorcyclist, Dog Dad (IG: @charliedognyc), NY Giants fan
  • 4. Eric Lubow @elubow #CassandraSummit SIMPLEREACH àč Help marketers organize àč Identify the best content àč Use engagement metrics àč WorkïŹ‚ow solution àč Stream processing ingest àč Many metrics, time sliced àč Multiple data stores
  • 5. Eric Lubow @elubow #CassandraSummit CONCEPTS YOU SHOULD UNDERSTAND 1. Thick clients and thin clients 2. CPU utilization and load average 3. Database tuning may not have anything to do with the database
  • 6. Eric Lubow @elubow #CassandraSummit A fat client (also called heavy, rich, or thick client) is a computer (client) in client–server architecture or networks that typically provides rich functionality independent of the central server. — Wikipedia WHAT IS A FAT CLIENT?
  • 7. Eric Lubow @elubow #CassandraSummit ‱ Thin clients typically canÊŒt operate without the “server” ‱ Thick clients try to do more locally compared to thin clients which try to do more remotely ‱ Thick clients require more resources, but fewer servers. ‱ Thin clients require fewer resources and more servers. THIN CLIENTS AND THICK CLIENTS
  • 8. Eric Lubow @elubow #CassandraSummit ‱ Fat clients have the Cassandra binary, but no data ‱ Data nodes are denser and more focused on storage ‱ For context, we can call them proxy nodes ‱ Proxy nodes are more compute heavy ‱ Fat clients only handles coordination ONCE MORE, BUT WITH CASSANDRA
  • 9. Eric Lubow @elubow #CassandraSummit àč Fat clients are effectively just changing settings àč < Cassandra 2.2 -Djoin_ring=false (hack) àč No data on the nodes, just coordination responsibility àč Intentionally sidestepping Cassandra homogenous nature in favor of performance àč Can be lots of room for adding proxy nodes without incurring additional performance loss from increasing the ring size àč Reduces per node work on the data nodes WHAT’S REALLY GOING ON HERE?
  • 10. Eric Lubow @elubow #CassandraSummit NORMAL CASSANDRA SETUP CASSANDRA CLUSTERAPPLICATION TIER C* App 1 App 2 C* C* C* C* C* C* C*
  • 11. Eric Lubow @elubow #CassandraSummit
  • 12. Eric Lubow @elubow #CassandraSummit CASSANDRA PROXY TIER SETUP CASSANDRA CLUSTER App 1 App 2 PROXY TIER Proxy Proxy Data Data Data Data Data Data Data Data APPLICATION TIER Data Nodes: c3.4xlargeProxy Nodes: c3.4xlarge
  • 13. Eric Lubow @elubow #CassandraSummit àč More compute power for token calculations àč More compute power for writing data àč More focused compute on coordination tasks àč Smarter allocation of instance types àč Cheaper hardware for proxy instances TRADEOFFS àč More instance types to manage àč More infrastructure overhead àč Requires different monitoring àč High potential for nasty accident (forget to make proxy node)
  • 14. Eric Lubow @elubow #CassandraSummit Before Before After After AVERAGE CLUSTER CPU UTILIZATION
  • 15. Eric Lubow @elubow #CassandraSummit HOW DID WE DO?
  • 16. Eric Lubow @elubow #CassandraSummit àč Why are we talking about CPU utilization and load average àč Terminology is important àč Understanding gains/losses is important àč LetÊŒs talk about CPU utilization and load average HOW DO WE KNOW?
  • 17. Eric Lubow @elubow #CassandraSummit àč LetÊŒs use a trafïŹc analogy for load average àč Imagine you are the bridge operator of a single lane bridge (single CPU): àč 0.00 means there's no trafïŹc on the bridge at all. In fact, between 0.00 and 1.00 means there's no backup, and an arriving car will just go right on. àč 1.00 means the bridge is exactly at capacity. All is still good, but if trafïŹc gets a little heavier, things are going to slow down. àč over 1.00 means there's backup. How much? Well, 2.00 means that there are two lanes worth of cars total -- one lane's worth on the bridge, and one lane's worth waiting. 3.00 means there are three lane's worth total -- one lane's worth on the bridge, and two lanes' worth waiting. Etc. àč Best load average for a single CPU system is between 0.7 and 0.8 (headroom) àč Different for multi-core systems WHAT IS LOAD AVERAGE?
  • 18. Eric Lubow @elubow #CassandraSummit àč Each core on a CPU has itÊŒs own utilization graph àč CPU utilization isnÊŒt straight forward àč Assume you have a single core processor ïŹxed at a frequency of 2.0 GHz. CPU utilization in this scenario is the percentage of time the processor spends doing work (as opposed to being idle). If this 2.0 GHz processor does 1 billion cycles worth of work in a second, it is 50% utilized for that second. àč Current multiple cores processors exist with dynamically changing frequencies, hardware multithreading, and shared caches all of which effect reporting. àč Resource sharing makes monitoring CPU utilization difïŹcult NOTES ABOUT CPU UTILIZATION
  • 19. Eric Lubow @elubow #CassandraSummit
  • 20. Eric Lubow @elubow #CassandraSummit Before Before After After AVERAGE CLUSTER CPU UTILIZATION
  • 21. Eric Lubow @elubow #CassandraSummit àč Batches work better when there is a coordinator dispatching batches to the correct data node without additional processing on the part of the data node àč One of the many downsides of vnodes is massive coordination requirements àč Removing coordination responsibilities from data nodes makes them more performant àč less context switching àč less network trafïŹc/gossip/GC àč less CPU utilization WHAT ACTUALLY HAPPENED 1/2
  • 22. Eric Lubow @elubow #CassandraSummit àč 30 nodes * 256 tokens/node = 7,680 token ranges àč Queries go through a nearly 8,000 item list, slow, context switch, lots of GCable objects àč Considering just reads, at 30k requests per second, this is a signiïŹcant reduction in work on a per query basis àč We are able to tune the JVMs differently WHAT ACTUALLY HAPPENED 2/2
  • 23. Eric Lubow @elubow #CassandraSummit RESULTS
  • 24. Eric Lubow @elubow #CassandraSummit àč Went from 72 nodes down to 30 nodes àč All data is now stored on AWS ST1 EBS volumes àč Works best for write heavy workloads àč Roughly 300% increase in available and burstable capacity àč Less footprint to watch over; fewer machines, more roles RESULTS
  • 25. Eric Lubow @elubow #CassandraSummit àč Command line option or cassandra.yaml option for coordinator only mode àč Code path short cuts for performance àč SpeciïŹc JMX beans around query coordination àč Allow query mutation by coordinator nodes (Lua?) FEATURE NOT HACK
  • 26. Eric Lubow @elubow #CassandraSummit WHAT DID I SAY? àč Fat clients can save you money àč DonÊŒt start out with complexity àč Know the basics àč Know what your baseline measurements are àč Monitor everything àč Sometimes database tuning doesnÊŒt require making changes to the database
  • 27. Eric Lubow @elubow #CassandraSummit QUESTIONS IN LIFE ARE GUARANTEED, ANSWERS AREN’T. Eric Lubow @elubow