SlideShare a Scribd company logo
Cassandra tuning - above and beyond
Matija Gobec
Co-founder & Senior Consultant @ SmartCat.io
© DataStax, All Rights Reserved.
Why this talk
We were challenged with an interesting requirement…
“99.999%”
2
© DataStax, All Rights Reserved.
1 Initial investigation and setup
2 Metrics and reporting
3 Test setup
4 AWS deployment
5 Did we make it?
3
© DataStax, All Rights Reserved.
What makes a distributed system?
A bunch of stuff that magically works together
4
© DataStax, All Rights Reserved.
How to start?
Investigate the current setup (if any)
Understand your use case
Understand your data
Set a base configuration
Define target performance (goal)
5
© DataStax, All Rights Reserved.
Initial investigation
• What type of deployment are you working with?
• What is the available hardware?
• CPU cores and threads
• Memory amount and type
• Storage size and type
• Network interfaces amount and type
• Limitations
6
Hardware and setup
© DataStax, All Rights Reserved.
Hardware configuration
8-16 cores
32GB ram
Commit log SSD
Data drive SSD
10GbE
Placement groups
Availability zones
Enhanced networking
8
© DataStax, All Rights Reserved.
OS - Swap, storage, cpu
1. Swap is bad
• remove swap from stab
• disable swap: swapoff -a
2. Optimize block layer
• echo 1 > /sys/block/XXX/queue/nomerges
• echo 8 > /sys/block/XXX/queue/read_ahead_kb
• echo deadline > /sys/block/XXX/queue/scheduler
3. Disable cpu scaling
9
© DataStax, All Rights Reserved.
sysctl.d - network
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.ip_local_port_range = 10000 65535
net.ipv4.tcp_tw_recycle = 1
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.somaxconn = 4096
net.core.netdev_max_backlog = 16384
10
# read buffer space allocatable in units of pages
# write buffer space allocatable in units of pages
# disable explicit congestion notification
# enable window scaling (higher throughput)
# allowed local port range
# enable fast time-wait recycle
# max socket receive buffer in bytes
# max socket send buffer in bytes
# number of incoming connections
# incoming connections backlog
© DataStax, All Rights Reserved.
sysctl.d - vm and fs
11
vm.swappiness = 1
vm.max_map_count = 1073741824
vm.dirty_background_bytes = 10485760
vm.dirty_bytes = 1073741824
fs.file-max = 1073741824
vm.min_free_kbytes = 1048576
# memory swapping threshold
# max memory map areas a process can have
# dirty memory amount threshold (kernel)
# dirty memory amount threshold (process)
# max number of open files
# min number of VM free kilobytes
© DataStax, All Rights Reserved.
JVM - CMS
MAX_HEAP_SIZE=“8G" # Good starting point
HEAP_NEWSIZE=“2G" # Good starting point
JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem"
JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking”
# Tunable settings
JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2"
JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=16"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions"
JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=4096”
# Instagram settings
JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark"
JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000"
JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000"
12
© DataStax, All Rights Reserved.
JVM - G1GC
JVM_OPTS="$JVM_OPTS -XX:+UseG1GC"
JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500"
JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5"
JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25”
JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16” # Set to number of full cores
JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16” # Set to number of full cores
13
© DataStax, All Rights Reserved.
Cassandra
concurrent_reads: 128
concurrent_writes: 128
concurrent_counter_writes: 128
memtable_allocation_type: heap_buffers
memtable_flush_writers: 8
memtable_cleanup_threshold: 0.15
memtable_heap_space_in_mb: 2048
memtable_offheap_space_in_mb: 2048
trickle_fsync: true
trickle_fsync_interval_in_kb: 1024
internode_compression: dc
14
Data model and compaction strategy
© DataStax, All Rights Reserved.
Data model
Data model impacts performance a lot
Optimize so that you read from one partition
Make sure your data can be distributed
SSTable compression depending on the use case
16
© DataStax, All Rights Reserved.
Compaction strategy
1. Size tiered compaction strategy
• Good as a default
• Performance and size constraints
2. Leveled compaction strategy
• Great for low latency read requirements
• Constant compactions
3. Date tiered / Time window compaction strategy
• Good fit for time series use cases
17
© DataStax, All Rights Reserved.
Ok, what now?
After we set the base configuration it’s time for testing and observing
18
Metrics and reporting stack
© DataStax, All Rights Reserved.
Metrics and reporting stack
OS metrics (SmartCat)
Metrics reporter config (AddThis)
Cassandra diagnostics (SmartCat)
Filebeat
Riemann
InfluxDB
Grafana
Elasticsearch
Logstash
Kibana
20
© DataStax, All Rights Reserved.
Grafana
21
© DataStax, All Rights Reserved.
Kibana
22
© DataStax, All Rights Reserved.
Slow queries
Track query execution times above some threshold
Gain insights into the long processing queries
Relate that to what’s going on on the node
Compare app and cluster slow queries
https://github.com/smartcat-labs/cassandra-diagnostics
23
© DataStax, All Rights Reserved.
Slow queries - cluster
24
© DataStax, All Rights Reserved.
Slow queries - cluster vs app
25
© DataStax, All Rights Reserved.
Ops center
Pros:
Great when starting out
Everything you need in a nice GUI
Cluster metrics
Cons:
Metrics stored in the same cluster
Issues with some of the services (repair, slow query,...)
Additional agents on the nodes
26
Test setup
© DataStax, All Rights Reserved.
Test setup
Make sure you have repeatable tests
Fixed rate tests
Variable rate tests
Production like tests
Cassandra Stress
Various loadgen tools (gatling, wrk, loader,...)
28
© DataStax, All Rights Reserved.
Coordinated omission
29
© DataStax, All Rights Reserved.
Tuning methodology
30
AWS
© DataStax, All Rights Reserved.
AWS deployment
Choose your instance based on calculations
Use placement groups and availability zones
Don’t overdo it just because you can ($$$)
Are you sure you need ephemeral storage?
Go for EBS volumes (gp2)
32
© DataStax, All Rights Reserved.
EBS volumes
Pros:
3.4TB+ volume has 10.000 IOPs
Average latency is ~0.38ms
Durable across reboots
AWS snapshots
Can be attached/detached
Easy to recreate
33
Cons:
Rare latency spikes
Average latency is ~0.38ms
Degrading factor
© DataStax, All Rights Reserved.
EBS volumes - problems
34
© DataStax, All Rights Reserved.
End result
Did we meet our goal?
Can we go any further?
35
© DataStax, All Rights Reserved.
Whats next?
Torture testing
Failure scenarios
Latency and delay inducers
Automate everything
36
Q&A
Thank you
Matija Gobec
matija@smartcat.io
@mad_max0204
smartcat-labs.github.io
smartcat.io

More Related Content

PPTX
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
PPTX
Everyday I’m scaling... Cassandra
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PPTX
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
PPTX
Processing 50,000 events per second with Cassandra and Spark
PDF
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
PPTX
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
PDF
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016
Storing Cassandra Metrics (Chris Lohfink, DataStax) | C* Summit 2016
Everyday I’m scaling... Cassandra
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Processing 50,000 events per second with Cassandra and Spark
Micro-batching: High-performance Writes (Adam Zegelin, Instaclustr) | Cassand...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Monitoring Cassandra at Scale (Jason Cacciatore, Netflix) | C* Summit 2016

What's hot (20)

PPTX
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
PDF
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
PPTX
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
PDF
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
PDF
Managing Cassandra at Scale by Al Tobey
PPTX
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PPTX
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
PDF
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
PPTX
How to size up an Apache Cassandra cluster (Training)
PPTX
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
PPTX
Load testing Cassandra applications
PPTX
Large partition in Cassandra
PPTX
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
PPTX
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
PDF
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
PDF
Instaclustr webinar 2017 feb 08 japan
PPTX
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Designing & Optimizing Micro Batching Systems Using 100+ Nodes (Ananth Ram, R...
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
CassieQ: The Distributed Message Queue Built on Cassandra (Anton Kropp, Cural...
Managing Cassandra at Scale by Al Tobey
Replication and Consistency in Cassandra... What Does it All Mean? (Christoph...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax | Building a Spark Streaming App with DSE File System (Rocco Varela)...
How to size up an Apache Cassandra cluster (Training)
Myths of Big Partitions (Robert Stupp, DataStax) | Cassandra Summit 2016
Load testing Cassandra applications
Large partition in Cassandra
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
DataStax | Data Science with DataStax Enterprise (Brian Hess) | Cassandra Sum...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Instaclustr webinar 2017 feb 08 japan
Running 400-node Cassandra + Spark Clusters in Azure (Anubhav Kale, Microsoft...
Ad

Viewers also liked (7)

PDF
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...
PPTX
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
PDF
KillrVideo: Data Modeling Evolved (Patrick McFadin, Datastax) | Cassandra Sum...
PDF
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...
PPTX
Optimizing Cassandra in AWS
PPTX
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
PDF
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Light Weight Transactions Under Stress (Christopher Batey, The Last Pickle) ...
C* Capacity Forecasting (Ajay Upadhyay, Jyoti Shandil, Arun Agrawal, Netflix)...
KillrVideo: Data Modeling Evolved (Patrick McFadin, Datastax) | Cassandra Sum...
Advanced Cassandra Operations via JMX (Nate McCall, The Last Pickle) | C* Sum...
Optimizing Cassandra in AWS
A look at the CQL changes in 3.x (Benjamin Lerer, Datastax) | Cassandra Summi...
Operations, Consistency, Failover for Multi-DC Clusters (Alexander Dejanovski...
Ad

Similar to Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summit 2016 (20)

PDF
DataStax: Extreme Cassandra Optimization: The Sequel
PPTX
Performance Tuning a Cloud Application: A Real World Case Study
PPTX
Cassandra on Ubuntu AUTOMATIC Install
PPTX
Devops kc
PPTX
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
PDF
Cassandra Day Atlanta 2015: Troubleshooting with Apache Cassandra
PDF
An Introduction to Cassandra on Linux
PDF
CASSANDRA MEETUP - Choosing the right cloud instances for success
PDF
Cassandra at Pollfish
PDF
Cassandra at Pollfish
PDF
Cassandra Summit 2014: Successful Software Development with Apache Cassandra
PDF
Successful Software Development with Apache Cassandra
PDF
Testing Persistent Storage Performance in Kubernetes with Sherlock
PPTX
Cassandra in Operation
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PDF
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
PDF
Target: Performance Tuning Cassandra at Target
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
DataStax: Extreme Cassandra Optimization: The Sequel
Performance Tuning a Cloud Application: A Real World Case Study
Cassandra on Ubuntu AUTOMATIC Install
Devops kc
Cassandra on Mesos Across Multiple Datacenters at Uber (Abhishek Verma) | C* ...
Cassandra Day Atlanta 2015: Troubleshooting with Apache Cassandra
An Introduction to Cassandra on Linux
CASSANDRA MEETUP - Choosing the right cloud instances for success
Cassandra at Pollfish
Cassandra at Pollfish
Cassandra Summit 2014: Successful Software Development with Apache Cassandra
Successful Software Development with Apache Cassandra
Testing Persistent Storage Performance in Kubernetes with Sherlock
Cassandra in Operation
DataStax: 7 Deadly Sins for Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
OpenNebulaConf 2016 - Measuring and tuning VM performance by Boyan Krosnov, S...
Target: Performance Tuning Cassandra at Target
Cassandra Day London 2015: Diagnosing Problems in Production

More from DataStax (20)

PPTX
Is Your Enterprise Ready to Shine This Holiday Season?
PPTX
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
PPTX
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
PPTX
Best Practices for Getting to Production with DataStax Enterprise Graph
PPTX
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
PPTX
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
PDF
Webinar | Better Together: Apache Cassandra and Apache Kafka
PDF
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
PDF
Introduction to Apache Cassandra™ + What’s New in 4.0
PPTX
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
PPTX
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
PDF
Designing a Distributed Cloud Database for Dummies
PDF
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
PDF
How to Evaluate Cloud Databases for eCommerce
PPTX
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
PPTX
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
PPTX
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
PPTX
Datastax - The Architect's guide to customer experience (CX)
PPTX
An Operational Data Layer is Critical for Transformative Banking Applications
PPTX
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking
Is Your Enterprise Ready to Shine This Holiday Season?
Designing Fault-Tolerant Applications with DataStax Enterprise and Apache Cas...
Running DataStax Enterprise in VMware Cloud and Hybrid Environments
Best Practices for Getting to Production with DataStax Enterprise Graph
Webinar | Data Management for Hybrid and Multi-Cloud: A Four-Step Journey
Webinar | How to Understand Apache Cassandra™ Performance Through Read/Writ...
Webinar | Better Together: Apache Cassandra and Apache Kafka
Top 10 Best Practices for Apache Cassandra and DataStax Enterprise
Introduction to Apache Cassandra™ + What’s New in 4.0
Webinar: How Active Everywhere Database Architecture Accelerates Hybrid Cloud...
Webinar | Aligning GDPR Requirements with Today's Hybrid Cloud Realities
Designing a Distributed Cloud Database for Dummies
How to Power Innovation with Geo-Distributed Data Management in Hybrid Cloud
How to Evaluate Cloud Databases for eCommerce
Webinar: DataStax Enterprise 6: 10 Ways to Multiply the Power of Apache Cassa...
Webinar: DataStax and Microsoft Azure: Empowering the Right-Now Enterprise wi...
Webinar - Real-Time Customer Experience for the Right-Now Enterprise featurin...
Datastax - The Architect's guide to customer experience (CX)
An Operational Data Layer is Critical for Transformative Banking Applications
Becoming a Customer-Centric Enterprise Via Real-Time Data and Design Thinking

Recently uploaded (20)

PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PPTX
ai tools demonstartion for schools and inter college
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
System and Network Administration Chapter 2
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PPTX
L1 - Introduction to python Backend.pptx
PPTX
Transform Your Business with a Software ERP System
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Nekopoi APK 2025 free lastest update
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Design an Analysis of Algorithms II-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
wealthsignaloriginal-com-DS-text-... (1).pdf
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
ai tools demonstartion for schools and inter college
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
How to Migrate SBCGlobal Email to Yahoo Easily
How to Choose the Right IT Partner for Your Business in Malaysia
System and Network Administration Chapter 2
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
L1 - Introduction to python Backend.pptx
Transform Your Business with a Software ERP System
CHAPTER 2 - PM Management and IT Context
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Operating system designcfffgfgggggggvggggggggg
Reimagine Home Health with the Power of Agentic AI​
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Nekopoi APK 2025 free lastest update
Upgrade and Innovation Strategies for SAP ERP Customers

Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summit 2016

  • 1. Cassandra tuning - above and beyond Matija Gobec Co-founder & Senior Consultant @ SmartCat.io
  • 2. © DataStax, All Rights Reserved. Why this talk We were challenged with an interesting requirement… “99.999%” 2
  • 3. © DataStax, All Rights Reserved. 1 Initial investigation and setup 2 Metrics and reporting 3 Test setup 4 AWS deployment 5 Did we make it? 3
  • 4. © DataStax, All Rights Reserved. What makes a distributed system? A bunch of stuff that magically works together 4
  • 5. © DataStax, All Rights Reserved. How to start? Investigate the current setup (if any) Understand your use case Understand your data Set a base configuration Define target performance (goal) 5
  • 6. © DataStax, All Rights Reserved. Initial investigation • What type of deployment are you working with? • What is the available hardware? • CPU cores and threads • Memory amount and type • Storage size and type • Network interfaces amount and type • Limitations 6
  • 8. © DataStax, All Rights Reserved. Hardware configuration 8-16 cores 32GB ram Commit log SSD Data drive SSD 10GbE Placement groups Availability zones Enhanced networking 8
  • 9. © DataStax, All Rights Reserved. OS - Swap, storage, cpu 1. Swap is bad • remove swap from stab • disable swap: swapoff -a 2. Optimize block layer • echo 1 > /sys/block/XXX/queue/nomerges • echo 8 > /sys/block/XXX/queue/read_ahead_kb • echo deadline > /sys/block/XXX/queue/scheduler 3. Disable cpu scaling 9
  • 10. © DataStax, All Rights Reserved. sysctl.d - network net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 net.ipv4.tcp_ecn = 0 net.ipv4.tcp_window_scaling = 1 net.ipv4.ip_local_port_range = 10000 65535 net.ipv4.tcp_tw_recycle = 1 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.somaxconn = 4096 net.core.netdev_max_backlog = 16384 10 # read buffer space allocatable in units of pages # write buffer space allocatable in units of pages # disable explicit congestion notification # enable window scaling (higher throughput) # allowed local port range # enable fast time-wait recycle # max socket receive buffer in bytes # max socket send buffer in bytes # number of incoming connections # incoming connections backlog
  • 11. © DataStax, All Rights Reserved. sysctl.d - vm and fs 11 vm.swappiness = 1 vm.max_map_count = 1073741824 vm.dirty_background_bytes = 10485760 vm.dirty_bytes = 1073741824 fs.file-max = 1073741824 vm.min_free_kbytes = 1048576 # memory swapping threshold # max memory map areas a process can have # dirty memory amount threshold (kernel) # dirty memory amount threshold (process) # max number of open files # min number of VM free kilobytes
  • 12. © DataStax, All Rights Reserved. JVM - CMS MAX_HEAP_SIZE=“8G" # Good starting point HEAP_NEWSIZE=“2G" # Good starting point JVM_OPTS="$JVM_OPTS -XX:+PerfDisableSharedMem" JVM_OPTS="$JVM_OPTS -XX:-UseBiasedLocking” # Tunable settings JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=2" JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=16" JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions" JVM_OPTS="$JVM_OPTS -XX:ParGCCardsPerStrideChunk=4096” # Instagram settings JVM_OPTS="$JVM_OPTS -XX:+CMSScavengeBeforeRemark" JVM_OPTS="$JVM_OPTS -XX:CMSMaxAbortablePrecleanTime=60000" JVM_OPTS="$JVM_OPTS -XX:CMSWaitDuration=30000" 12
  • 13. © DataStax, All Rights Reserved. JVM - G1GC JVM_OPTS="$JVM_OPTS -XX:+UseG1GC" JVM_OPTS="$JVM_OPTS -XX:MaxGCPauseMillis=500" JVM_OPTS="$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5" JVM_OPTS="$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25” JVM_OPTS="$JVM_OPTS -XX:ParallelGCThreads=16” # Set to number of full cores JVM_OPTS="$JVM_OPTS -XX:ConcGCThreads=16” # Set to number of full cores 13
  • 14. © DataStax, All Rights Reserved. Cassandra concurrent_reads: 128 concurrent_writes: 128 concurrent_counter_writes: 128 memtable_allocation_type: heap_buffers memtable_flush_writers: 8 memtable_cleanup_threshold: 0.15 memtable_heap_space_in_mb: 2048 memtable_offheap_space_in_mb: 2048 trickle_fsync: true trickle_fsync_interval_in_kb: 1024 internode_compression: dc 14
  • 15. Data model and compaction strategy
  • 16. © DataStax, All Rights Reserved. Data model Data model impacts performance a lot Optimize so that you read from one partition Make sure your data can be distributed SSTable compression depending on the use case 16
  • 17. © DataStax, All Rights Reserved. Compaction strategy 1. Size tiered compaction strategy • Good as a default • Performance and size constraints 2. Leveled compaction strategy • Great for low latency read requirements • Constant compactions 3. Date tiered / Time window compaction strategy • Good fit for time series use cases 17
  • 18. © DataStax, All Rights Reserved. Ok, what now? After we set the base configuration it’s time for testing and observing 18
  • 20. © DataStax, All Rights Reserved. Metrics and reporting stack OS metrics (SmartCat) Metrics reporter config (AddThis) Cassandra diagnostics (SmartCat) Filebeat Riemann InfluxDB Grafana Elasticsearch Logstash Kibana 20
  • 21. © DataStax, All Rights Reserved. Grafana 21
  • 22. © DataStax, All Rights Reserved. Kibana 22
  • 23. © DataStax, All Rights Reserved. Slow queries Track query execution times above some threshold Gain insights into the long processing queries Relate that to what’s going on on the node Compare app and cluster slow queries https://github.com/smartcat-labs/cassandra-diagnostics 23
  • 24. © DataStax, All Rights Reserved. Slow queries - cluster 24
  • 25. © DataStax, All Rights Reserved. Slow queries - cluster vs app 25
  • 26. © DataStax, All Rights Reserved. Ops center Pros: Great when starting out Everything you need in a nice GUI Cluster metrics Cons: Metrics stored in the same cluster Issues with some of the services (repair, slow query,...) Additional agents on the nodes 26
  • 28. © DataStax, All Rights Reserved. Test setup Make sure you have repeatable tests Fixed rate tests Variable rate tests Production like tests Cassandra Stress Various loadgen tools (gatling, wrk, loader,...) 28
  • 29. © DataStax, All Rights Reserved. Coordinated omission 29
  • 30. © DataStax, All Rights Reserved. Tuning methodology 30
  • 31. AWS
  • 32. © DataStax, All Rights Reserved. AWS deployment Choose your instance based on calculations Use placement groups and availability zones Don’t overdo it just because you can ($$$) Are you sure you need ephemeral storage? Go for EBS volumes (gp2) 32
  • 33. © DataStax, All Rights Reserved. EBS volumes Pros: 3.4TB+ volume has 10.000 IOPs Average latency is ~0.38ms Durable across reboots AWS snapshots Can be attached/detached Easy to recreate 33 Cons: Rare latency spikes Average latency is ~0.38ms Degrading factor
  • 34. © DataStax, All Rights Reserved. EBS volumes - problems 34
  • 35. © DataStax, All Rights Reserved. End result Did we meet our goal? Can we go any further? 35
  • 36. © DataStax, All Rights Reserved. Whats next? Torture testing Failure scenarios Latency and delay inducers Automate everything 36
  • 37. Q&A