SlideShare a Scribd company logo
©2013 DataStax Confidential. Do not distribute without consent.
Jon Haddad, Technical Evangelist
@rustyrazorblade
Diagnosing Problems in Production
1
First Step: Preparation
DataStax OpsCenter
• Will help with 90% of problems you
encounter
• Should be first place you look when
there's an issue
• Community version is free
• Enterprise version has additional
features
Server Monitoring & Alerts
• Monit
• monitor processes
• monitor disk usage
• send alerts
• Munin / collectd
• system perf statistics
• Nagios / Icinga
• Various 3rd party services
• Use whatever works for
you
Application Metrics
• Statsd / Graphite
• Grafana
• Gather constant metrics from
your application
• Measure anything & everything
• Microtimers, counters
• Graph events
• user signup
• error rates
• Cassandra Metrics Integration
• jmxtrans
Log Aggregation
• Hosted - Splunk, Loggly
• OSS - Logstash + Kibana, Greylog
• Many more…
• For best results all logs should be
aggregated here
• Oh yeah, and log your errors.
Gotchas
Incorrect Server Times
• Everything is written with a timestamp
• Last write wins
• Usually supplied by coordinator
• Can also be supplied by client
• What if your timestamps are wrong
because your clocks are off?
• Always install ntpd!
server
time: 10
server
time: 20
INSERT
real time: 12
DELETE
real time: 15
insert:20
delete:10
Tombstones
• Tombstones are a marker that data
no longer exists
• Tombstones have a timestamp just
like normal data
• They say "at time X, this no longer
exists"
Tombstone Hell
• Queries on partitions with a lot of tombstones require a lot of filtering
• This can be reaaaaaaally slow
• Consider:
• 100,000 rows in a partition
• 99,999 are tombstones
• How long to get a single row?
• Cassandra is not a queue!
read 99,999 tombstones
finally get the
right data
Not using a Snitch
• Snitch lets us distribute data in a fault tolerant way
• Changing this with a large cluster is time
consuming
• Dynamic Snitching
• use the fastest replica for reads
• RackInferring (uses IP to pick replicas)
• DC aware
• PropertyFileSnitch (cassandra-topology.properties)
• EC2Snitch & EC2MultiRegion
• GoogleCloudSnitch
• GossipingPropertyFileSnitch (recommended)
Version Mismatch
• SSTable format changed between
versions, making streaming
incompatible
• Version mismatch can break bootstrap,
repair, and decommission
• Introducing new nodes? Stick w/ the
same version
• Upgrade nodes in place
• One at a time
• One rack / AZ at a time (requires proper snitch)
Disk Space not Reclaimed
• When you add new nodes, data is
streamed from existing nodes
• … but it's not deleted from them after
• You need to run a nodetool cleanup
• Otherwise you'll run out of space just by
adding nodes
Using Shared Storage
• Single point of failure
• High latency
• Expensive
• Performance is about latency
• Can increase throughput with more
disks
• Avoid EBS, SAN, NAS
Compaction
• Compaction merges SSTables
• Too much compaction?
• Opscenter provides insight into compaction
cluster wide
• nodetool
• compactionhistory
• getcompactionthroughput
• Leveled vs Size Tiered
• Leveled on SSD + Read Heavy
• Size tiered on Spinning rust
• Size tiered is great for write heavy time series workloads
Diagnostic Tools
htop
• Process overview - nicer than top
iostat
• Disk stats
• Queue size, wait times
• Ignore %util
vmstat
• virtual memory statistics
• Am I swapping?
• Reports at an interval, to an optional count
dstat
• Flexible look at network, CPU, memory, disk
strace
• What is my process doing?
• See all system calls
• Filterable with -e
• Can attach to running
processes
tcpdump
• Watch network traffic
nodetool tpstats
• What's blocked?
• MemtableFlushWriter? - Slow
disks!
• also leads to GC issues
• Dropped mutations?
• need repair!
Histograms
• proxyhistograms
• High level read and write times
• Includes network latency
• cfhistograms <keyspace> <table>
• reports stats for single table on a single
node
• Used to identify tables with
performance problems
Query Tracing
JVM Garbage Collection
JVM GC Overview
• What is garbage collection?
• Manual vs automatic memory management
• Generational garbage collection (ParNew & CMS)
• New Generation
• Old Generation
New Generation
• New objects are created in the new gen (eden)
• Comprised of Eden & 2 survivor spaces (SurvivorRatio)
• Space identified by HEAP_NEWSIZE in cassandra-env.sh
• Historically limited to 800MB
Minor GC
• Occurs when Eden fills up
• Stop the world
• Dead objects are removed
• Copy current survivor to empty survivor
• Live objects are promoted into survivor (S0 & S1) then old gen
• Survivor objects promoted to old gen (MaxTenuringThreshold)
• Spillover promoted to old gen
• Removing objects is fast, promoting objects is slow
Old Generation
• Objects are promoted to new gen from old gen
• Major GC
• Mostly concurrent
• 2 short stop the world pauses
Full GC
• Occurs when old gen fills up or
objects can’t be promoted
• Stop the world
• Collects all generations
• Defragments old gen
• These are bad!
• Massive pauses
Workload 1: Write Heavy
• Objects promoted: Memtables
• New gen too big
• Remember: promoting objects is slow!
• Huge new gen = potentially a lot of promotion
new gen old gen
too much promotion
Workload 2: Read Heavy
• Short lived objects being promoted into old gen
• Lots of minor GCs
• Read heavy workloads on SSD
• Results in frequent full GC
new gen old gen (full of short lived objects)
early promotion
fills up quickly
GC Profiling
• Opscenter gc stats
• Look for correlations between gc spikes
and read/write latency
• Cassandra GC Logging
• Can be activated in cassandra-env.sh
• jstat
• prints gc activity
GC Profiling
• What to look out for:
• Long, multi-second pauses
• Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with
it. Typically means garbage is being promoted out of the new gen too soon
• Long minor GC
• Many of the objects in the new gen are being promoted to the old gen.
• Most commonly caused by new gen being too big
• Sometimes caused by objects being promoted prematurely
How much does it matter?
Stuff is broken, fix it!
Narrow Down the Problem
• Is it even Cassandra? Check your
metrics!
• Nodes flapping / failing
• Check ops center
• Dig into system metrics
• Slow queries
• Find your bottleneck
• Check system stats
• JVM GC
• Compaction
• Histograms
• Tracing
©2013 DataStax Confidential. Do not distribute without consent. 39

More Related Content

PDF
Apache Kafka Architecture & Fundamentals Explained
PPTX
Elastic stack Presentation
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PDF
DataStax: Extreme Cassandra Optimization: The Sequel
PPTX
Evening out the uneven: dealing with skew in Flink
ODP
Stream processing using Kafka
PPTX
Stability Patterns for Microservices
PDF
Apache Kafka - Martin Podval
Apache Kafka Architecture & Fundamentals Explained
Elastic stack Presentation
APACHE KAFKA / Kafka Connect / Kafka Streams
DataStax: Extreme Cassandra Optimization: The Sequel
Evening out the uneven: dealing with skew in Flink
Stream processing using Kafka
Stability Patterns for Microservices
Apache Kafka - Martin Podval

What's hot (20)

PPTX
Introduction to Apache Kafka
PDF
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
PDF
Google Spanner : our understanding of concepts and implications
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
PDF
So You Want to Write a Connector?
PPTX
Transactional operations in Apache Hive: present and future
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Redis cluster
PDF
VLDB 2009 Tutorial on Column-Stores
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
PDF
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
PDF
Apache storm vs. Spark Streaming
PDF
Big Data Security in Apache Projects by Gidon Gershinsky
PPTX
Autoscaling Flink with Reactive Mode
PPTX
Elastic Stack Introduction
PPTX
HBase and HDFS: Understanding FileSystem Usage in HBase
Introduction to Apache Kafka
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
Google Spanner : our understanding of concepts and implications
Apache Kafka Fundamentals for Architects, Admins and Developers
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
So You Want to Write a Connector?
Transactional operations in Apache Hive: present and future
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Redis cluster
VLDB 2009 Tutorial on Column-Stores
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Cassandra at Instagram 2016 (Dikang Gu, Facebook) | Cassandra Summit 2016
Apache storm vs. Spark Streaming
Big Data Security in Apache Projects by Gidon Gershinsky
Autoscaling Flink with Reactive Mode
Elastic Stack Introduction
HBase and HDFS: Understanding FileSystem Usage in HBase
Ad

Viewers also liked (20)

PDF
Crash course intro to cassandra
PDF
Cassandra 3.0 Awesomeness
PDF
Cassandra Core Concepts - Cassandra Day Toronto
PDF
Cassandra Core Concepts
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Enter the Snake Pit for Fast and Easy Spark
PDF
Spark and cassandra (Hulu Talk)
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
PDF
Python and cassandra
PDF
Introduction to Cassandra - Denver
PDF
Python & Cassandra - Best Friends
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PDF
Intro to Cassandra
PDF
Python performance profiling
PDF
Risks in the Software Supply Chain
PDF
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
PDF
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
Crash course intro to cassandra
Cassandra 3.0 Awesomeness
Cassandra Core Concepts - Cassandra Day Toronto
Cassandra Core Concepts
Diagnosing Problems in Production (Nov 2015)
Enter the Snake Pit for Fast and Easy Spark
Spark and cassandra (Hulu Talk)
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Python and cassandra
Introduction to Cassandra - Denver
Python & Cassandra - Best Friends
Diagnosing Problems in Production: Cassandra Summit 2014
Intro to Cassandra
Python performance profiling
Risks in the Software Supply Chain
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax: 7 Deadly Sins for Cassandra Ops
Ad

Similar to Diagnosing Problems in Production - Cassandra (20)

PDF
Advanced Operations
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Standing Up Your First Cluster
PDF
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
PPTX
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
PDF
Instaclustr introduction to managing cassandra
PDF
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
PPTX
Cassandra Troubleshooting 3.0
PDF
Cassandra CLuster Management by Japan Cassandra Community
PPTX
Devops kc
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PDF
Instaclustr Apache Cassandra Best Practices & Toubleshooting
PPTX
Monitoring Cassandra With An EYE
Advanced Operations
Cassandra Day London 2015: Diagnosing Problems in Production
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
Standing Up Your First Cluster
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Cassandra Summit 2014: Diagnosing Problems in Production
Cassandra Summit 2014: Diagnosing Problems in Production
Lessons Learned From Running 1800 Clusters (Brooke Jensen, Instaclustr) | Cas...
Instaclustr introduction to managing cassandra
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Cassandra Troubleshooting 3.0
Cassandra CLuster Management by Japan Cassandra Community
Devops kc
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Monitoring Cassandra With An EYE

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
Big Data Technologies - Introduction.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPT
Teaching material agriculture food technology
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Encapsulation theory and applications.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Cloud computing and distributed systems.
Machine learning based COVID-19 study performance prediction
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Unlocking AI with Model Context Protocol (MCP)
Big Data Technologies - Introduction.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Teaching material agriculture food technology
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Encapsulation theory and applications.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
Understanding_Digital_Forensics_Presentation.pptx
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx
Cloud computing and distributed systems.

Diagnosing Problems in Production - Cassandra

  • 1. ©2013 DataStax Confidential. Do not distribute without consent. Jon Haddad, Technical Evangelist @rustyrazorblade Diagnosing Problems in Production 1
  • 3. DataStax OpsCenter • Will help with 90% of problems you encounter • Should be first place you look when there's an issue • Community version is free • Enterprise version has additional features
  • 4. Server Monitoring & Alerts • Monit • monitor processes • monitor disk usage • send alerts • Munin / collectd • system perf statistics • Nagios / Icinga • Various 3rd party services • Use whatever works for you
  • 5. Application Metrics • Statsd / Graphite • Grafana • Gather constant metrics from your application • Measure anything & everything • Microtimers, counters • Graph events • user signup • error rates • Cassandra Metrics Integration • jmxtrans
  • 6. Log Aggregation • Hosted - Splunk, Loggly • OSS - Logstash + Kibana, Greylog • Many more… • For best results all logs should be aggregated here • Oh yeah, and log your errors.
  • 8. Incorrect Server Times • Everything is written with a timestamp • Last write wins • Usually supplied by coordinator • Can also be supplied by client • What if your timestamps are wrong because your clocks are off? • Always install ntpd! server time: 10 server time: 20 INSERT real time: 12 DELETE real time: 15 insert:20 delete:10
  • 9. Tombstones • Tombstones are a marker that data no longer exists • Tombstones have a timestamp just like normal data • They say "at time X, this no longer exists"
  • 10. Tombstone Hell • Queries on partitions with a lot of tombstones require a lot of filtering • This can be reaaaaaaally slow • Consider: • 100,000 rows in a partition • 99,999 are tombstones • How long to get a single row? • Cassandra is not a queue! read 99,999 tombstones finally get the right data
  • 11. Not using a Snitch • Snitch lets us distribute data in a fault tolerant way • Changing this with a large cluster is time consuming • Dynamic Snitching • use the fastest replica for reads • RackInferring (uses IP to pick replicas) • DC aware • PropertyFileSnitch (cassandra-topology.properties) • EC2Snitch & EC2MultiRegion • GoogleCloudSnitch • GossipingPropertyFileSnitch (recommended)
  • 12. Version Mismatch • SSTable format changed between versions, making streaming incompatible • Version mismatch can break bootstrap, repair, and decommission • Introducing new nodes? Stick w/ the same version • Upgrade nodes in place • One at a time • One rack / AZ at a time (requires proper snitch)
  • 13. Disk Space not Reclaimed • When you add new nodes, data is streamed from existing nodes • … but it's not deleted from them after • You need to run a nodetool cleanup • Otherwise you'll run out of space just by adding nodes
  • 14. Using Shared Storage • Single point of failure • High latency • Expensive • Performance is about latency • Can increase throughput with more disks • Avoid EBS, SAN, NAS
  • 15. Compaction • Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction cluster wide • nodetool • compactionhistory • getcompactionthroughput • Leveled vs Size Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads
  • 17. htop • Process overview - nicer than top
  • 18. iostat • Disk stats • Queue size, wait times • Ignore %util
  • 19. vmstat • virtual memory statistics • Am I swapping? • Reports at an interval, to an optional count
  • 20. dstat • Flexible look at network, CPU, memory, disk
  • 21. strace • What is my process doing? • See all system calls • Filterable with -e • Can attach to running processes
  • 23. nodetool tpstats • What's blocked? • MemtableFlushWriter? - Slow disks! • also leads to GC issues • Dropped mutations? • need repair!
  • 24. Histograms • proxyhistograms • High level read and write times • Includes network latency • cfhistograms <keyspace> <table> • reports stats for single table on a single node • Used to identify tables with performance problems
  • 27. JVM GC Overview • What is garbage collection? • Manual vs automatic memory management • Generational garbage collection (ParNew & CMS) • New Generation • Old Generation
  • 28. New Generation • New objects are created in the new gen (eden) • Comprised of Eden & 2 survivor spaces (SurvivorRatio) • Space identified by HEAP_NEWSIZE in cassandra-env.sh • Historically limited to 800MB
  • 29. Minor GC • Occurs when Eden fills up • Stop the world • Dead objects are removed • Copy current survivor to empty survivor • Live objects are promoted into survivor (S0 & S1) then old gen • Survivor objects promoted to old gen (MaxTenuringThreshold) • Spillover promoted to old gen • Removing objects is fast, promoting objects is slow
  • 30. Old Generation • Objects are promoted to new gen from old gen • Major GC • Mostly concurrent • 2 short stop the world pauses
  • 31. Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • Defragments old gen • These are bad! • Massive pauses
  • 32. Workload 1: Write Heavy • Objects promoted: Memtables • New gen too big • Remember: promoting objects is slow! • Huge new gen = potentially a lot of promotion new gen old gen too much promotion
  • 33. Workload 2: Read Heavy • Short lived objects being promoted into old gen • Lots of minor GCs • Read heavy workloads on SSD • Results in frequent full GC new gen old gen (full of short lived objects) early promotion fills up quickly
  • 34. GC Profiling • Opscenter gc stats • Look for correlations between gc spikes and read/write latency • Cassandra GC Logging • Can be activated in cassandra-env.sh • jstat • prints gc activity
  • 35. GC Profiling • What to look out for: • Long, multi-second pauses • Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with it. Typically means garbage is being promoted out of the new gen too soon • Long minor GC • Many of the objects in the new gen are being promoted to the old gen. • Most commonly caused by new gen being too big • Sometimes caused by objects being promoted prematurely
  • 36. How much does it matter?
  • 37. Stuff is broken, fix it!
  • 38. Narrow Down the Problem • Is it even Cassandra? Check your metrics! • Nodes flapping / failing • Check ops center • Dig into system metrics • Slow queries • Find your bottleneck • Check system stats • JVM GC • Compaction • Histograms • Tracing
  • 39. ©2013 DataStax Confidential. Do not distribute without consent. 39