SlideShare a Scribd company logo
Diagnosing Problems in Production 
Jon Haddad, Technical Evangelist, Datastax 
Blake Eggleston, Software Developer, Datastax 
©2013 DataStax Confidential. Do not distribute without consent. 
1
Preventative Measures 
• Opscenter 
• Metrics Integration 
• Munin 
• Monit 
• Nagios / Icinga 
• Graphite / Statsd (application level) 
• Variety of 3rd party monitoring services
Narrow Down the Problem 
• Weird consistency issues - NTP? 
• Last write wins - if servers have different time, which is the last write? 
• Problems with Streaming / Repair - version conflicts 
• Cleanup after you add nodes (reclaim disk space) 
• Slow queries 
• Compaction 
• Histograms 
• Tracing 
• Nodes flapping / failing 
• Check ops center 
• Dig into system metrics 
• JVM GC issues
Compaction 
• Compaction merges SSTables 
• Too much compaction? 
• Opscenter provides insight into compaction cluster wide 
• nodetool 
• compactionhistory 
• getcompactionthroughput 
• Leveled vs Size Tiered 
• Leveled on SSD + Read Heavy 
• Size tiered on Spinning rust 
• Size tiered is great for write heavy time series workloads
System Utilities 
• iostat 
• disk level statistics 
• htop 
• process overview 
• iftop & netstat 
• network utilities 
• dstat 
• all the above in 1 tool 
• strace 
• …for the hardcore
Histograms 
• proxyhistograms 
• High level read and write times 
• Includes network latency 
• cfhistograms <keyspace> <table> 
• reports stats for single table on a single 
node 
• Used to identify tables with 
performance problems
Query Tracing
JVM GC Overview 
• What is garbage collection? 
• Manual vs automatic memory management 
• Generational garbage collection (ParNew & CMS) 
• New Generation 
• Old Generation
New Generation 
• New objects are created in the new gen 
• Minor GC 
• Occurs when new gen fills up 
• Stop the world 
• Dead objects are removed 
• Live objects are promoted into old gen 
• Removing objects is fast, promoting objects is slow
Old Generation 
• Objects are promoted to new gen from old gen 
• Major GC 
• Old generations fills up some percentage. 
• Mostly concurrent 
• 2 short stop the world pauses 
• Full GC 
• Occurs when old gen fills up or objects can’t be promoted 
• Stop the world 
• Collects all generations 
• These are bad!
GC Profiling 
• Opscenter gc stats 
• Look for correlations between gc spikes 
and read/write latency 
• Cassandra GC Logging 
• Can be activated in cassandra-env.sh 
• jstat 
• prints gc activity
GC Profiling 
• What to look out for: 
• Long, multi-second pauses 
• Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with 
it. Typically means garbage is being promoted out of the new gen too soon 
• Long minor GC 
• Many of the objects in the new gen are being promoted to the old gen. 
• Most commonly caused by new gen being too big 
• Sometimes caused by objects being promoted prematurely
Jon: @rustyrazorblade 
Blake: @blakeeggleston 
©2013 DataStax Confidential. Do not distribute without consent. 13

More Related Content

PDF
Python performance profiling
PDF
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
PDF
Python & Cassandra - Best Friends
PDF
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
PDF
Virtualization at Gilt - Rangarajan Radhakrishnan
POTX
Mobile 3: Launch Like a Boss!
PDF
Cassandra Summit 2014: Diagnosing Problems in Production
PDF
Cassandra Day Atlanta 2015: Diagnosing Problems in Production
Python performance profiling
Cassandra Day Denver 2014: Setting up a DataStax Enterprise Instance on Micro...
Python & Cassandra - Best Friends
Docker Usage Patterns - Meetup Docker Paris - November, 10th 2015
Virtualization at Gilt - Rangarajan Radhakrishnan
Mobile 3: Launch Like a Boss!
Cassandra Summit 2014: Diagnosing Problems in Production
Cassandra Day Atlanta 2015: Diagnosing Problems in Production

What's hot (18)

PPTX
You're monitoring Kubernetes Wrong
PDF
20140708 - Jeremy Edberg: How Netflix Delivers Software
PDF
AWS to Bare Metal: Motivation, Pitfalls, and Results
PDF
Monitoring kubernetes across data center and cloud
PDF
Saltstack - Orchestration & Application Deployment
PDF
Just enough web ops for web developers
PPTX
Docker for Ops: Operationalize your Docker Built Apps in Production by Evan H...
PDF
QCon NYC: Distributed systems in practice, in theory
PPTX
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
PDF
Configuration Management vs. Container Automation
PDF
Brian Ketelsen - Microservices in Go using Micro - Codemotion Milan 2017
PDF
Netflix and Containers: Not A Stranger Thing
PDF
Building Codealike: a journey into the developers analytics world
PPTX
Going serverless with aws
PPTX
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
PDF
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
PDF
Vagrant for Effective DevOps Culture
PPTX
Next generation pipelines
You're monitoring Kubernetes Wrong
20140708 - Jeremy Edberg: How Netflix Delivers Software
AWS to Bare Metal: Motivation, Pitfalls, and Results
Monitoring kubernetes across data center and cloud
Saltstack - Orchestration & Application Deployment
Just enough web ops for web developers
Docker for Ops: Operationalize your Docker Built Apps in Production by Evan H...
QCon NYC: Distributed systems in practice, in theory
Mario Cartia - SMACK is the new LAMP! - Codemotion Milan 2017
Configuration Management vs. Container Automation
Brian Ketelsen - Microservices in Go using Micro - Codemotion Milan 2017
Netflix and Containers: Not A Stranger Thing
Building Codealike: a journey into the developers analytics world
Going serverless with aws
Leonard Austin (Ravelin) - DevOps in a Machine Learning World
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
Vagrant for Effective DevOps Culture
Next generation pipelines
Ad

Viewers also liked (20)

PDF
Introduction to Cassandra - Denver
PDF
Intro to py spark (and cassandra)
PDF
Intro to Cassandra
PDF
Crash course intro to cassandra
PDF
Cassandra 3.0 Awesomeness
PDF
Cassandra Core Concepts
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Diagnosing Problems in Production - Cassandra
PDF
Enter the Snake Pit for Fast and Easy Spark
PDF
Spark and cassandra (Hulu Talk)
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
PDF
Cassandra Core Concepts - Cassandra Day Toronto
PDF
Python and cassandra
PDF
PySpark Cassandra - Amsterdam Spark Meetup
PDF
Cassandra and Spark
PDF
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
PDF
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
Introduction to Cassandra - Denver
Intro to py spark (and cassandra)
Intro to Cassandra
Crash course intro to cassandra
Cassandra 3.0 Awesomeness
Cassandra Core Concepts
Diagnosing Problems in Production (Nov 2015)
Diagnosing Problems in Production - Cassandra
Enter the Snake Pit for Fast and Easy Spark
Spark and cassandra (Hulu Talk)
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Cassandra Core Concepts - Cassandra Day Toronto
Python and cassandra
PySpark Cassandra - Amsterdam Spark Meetup
Cassandra and Spark
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax: 7 Deadly Sins for Cassandra Ops
Ad

Similar to Diagnosing Problems in Production: Cassandra Summit 2014 (20)

PDF
Cassandra Summit 2014: Diagnosing Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Webinar: Diagnosing Apache Cassandra Problems in Production
PDF
Standing Up Your First Cluster
PDF
Advanced Operations
PDF
Cassandra Day Chicago 2015: Diagnosing Problems in Production
PDF
Cassandra Day London 2015: Diagnosing Problems in Production
PDF
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
PDF
Low latency Java apps
PPTX
Google file system
PPTX
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Development of concurrent services using In-Memory Data Grids
KEY
Make It Cooler: Using Decentralized Version Control
KEY
Dibi Conference 2012
PDF
John adams talk cloudy
PPT
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector
Cassandra Summit 2014: Diagnosing Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
Standing Up Your First Cluster
Advanced Operations
Cassandra Day Chicago 2015: Diagnosing Problems in Production
Cassandra Day London 2015: Diagnosing Problems in Production
Joel Jacobson (Datastax) - Diagnosing Cassandra Problems in Production
Low latency Java apps
Google file system
Kubernetes at NU.nl (Kubernetes meetup 2019-09-05)
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Development of concurrent services using In-Memory Data Grids
Make It Cooler: Using Decentralized Version Control
Dibi Conference 2012
John adams talk cloudy
Java Garbage Collectors – Moving to Java7 Garbage First (G1) Collector

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Empathic Computing: Creating Shared Understanding
PDF
Modernizing your data center with Dell and AMD
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Machine learning based COVID-19 study performance prediction
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
cuic standard and advanced reporting.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Empathic Computing: Creating Shared Understanding
Modernizing your data center with Dell and AMD
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Machine learning based COVID-19 study performance prediction
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Unlocking AI with Model Context Protocol (MCP)
cuic standard and advanced reporting.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
NewMind AI Weekly Chronicles - August'25 Week I
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Mobile App Security Testing_ A Comprehensive Guide.pdf

Diagnosing Problems in Production: Cassandra Summit 2014

  • 1. Diagnosing Problems in Production Jon Haddad, Technical Evangelist, Datastax Blake Eggleston, Software Developer, Datastax ©2013 DataStax Confidential. Do not distribute without consent. 1
  • 2. Preventative Measures • Opscenter • Metrics Integration • Munin • Monit • Nagios / Icinga • Graphite / Statsd (application level) • Variety of 3rd party monitoring services
  • 3. Narrow Down the Problem • Weird consistency issues - NTP? • Last write wins - if servers have different time, which is the last write? • Problems with Streaming / Repair - version conflicts • Cleanup after you add nodes (reclaim disk space) • Slow queries • Compaction • Histograms • Tracing • Nodes flapping / failing • Check ops center • Dig into system metrics • JVM GC issues
  • 4. Compaction • Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction cluster wide • nodetool • compactionhistory • getcompactionthroughput • Leveled vs Size Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads
  • 5. System Utilities • iostat • disk level statistics • htop • process overview • iftop & netstat • network utilities • dstat • all the above in 1 tool • strace • …for the hardcore
  • 6. Histograms • proxyhistograms • High level read and write times • Includes network latency • cfhistograms <keyspace> <table> • reports stats for single table on a single node • Used to identify tables with performance problems
  • 8. JVM GC Overview • What is garbage collection? • Manual vs automatic memory management • Generational garbage collection (ParNew & CMS) • New Generation • Old Generation
  • 9. New Generation • New objects are created in the new gen • Minor GC • Occurs when new gen fills up • Stop the world • Dead objects are removed • Live objects are promoted into old gen • Removing objects is fast, promoting objects is slow
  • 10. Old Generation • Objects are promoted to new gen from old gen • Major GC • Old generations fills up some percentage. • Mostly concurrent • 2 short stop the world pauses • Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • These are bad!
  • 11. GC Profiling • Opscenter gc stats • Look for correlations between gc spikes and read/write latency • Cassandra GC Logging • Can be activated in cassandra-env.sh • jstat • prints gc activity
  • 12. GC Profiling • What to look out for: • Long, multi-second pauses • Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with it. Typically means garbage is being promoted out of the new gen too soon • Long minor GC • Many of the objects in the new gen are being promoted to the old gen. • Most commonly caused by new gen being too big • Sometimes caused by objects being promoted prematurely
  • 13. Jon: @rustyrazorblade Blake: @blakeeggleston ©2013 DataStax Confidential. Do not distribute without consent. 13