SlideShare a Scribd company logo
DataStax / Cassandra Data
Modeling Strategies
Avoiding The Three Stooges: Wide Partitions, Tombstones, Data Skew
Rahul Xavier Singh Anant Corporation
TOC
Core Concepts
Wide Partitions
Data Modeling
Synthetic Sharding
Key Design
Tombstones
Data Skew
Avoid tombstones
Business Platform Success
We build business success platforms,
which are collections of systems that
serve business processes that have
information needs for people.
Platform Thinking
How?
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Assets
Business Platform
● Curateframeworkof
systems.
● Workwitha vettedteam
of experts.
● Connectit all together.
● Focuson finding,
analyzing,and actingon
knowledge&
communicationtowards
businesssuccess.
Streamline. Organize. Unify. Business Platform
Who we help Succeed
Cassandra / DataStax
Core Concepts
Cassandra
Architecture
Cluster / Data Centers
01Cassandra is not for tiny data. Do you NEED:
1. Fast read and write of terabytes of data?
2. Replication / availability around the world?
3. Never go down, always up?
Don’tuse Cassandra:
1. If you have gigabytes of data.
2. Your application can chill in one datacenter.
3. Your system can go down whenever it wants.
4. Want to be cool.
Cassandra Data Model
Keyspaces & Tables
02
Cassandra Tables / Column Families look like SQL Server /
MySQL / Postgres tables & databases. They are not.
1. CQL Supports queries with a Primary and optional
Clustering Key
2. CQL Does not support arbitrary queries on columns.
3. Cassandra shouldn’t be managing more than a 100-
150 tables across any number of keyspaces.
Cassandra Operations
Read / Write Paths
03
Cassandra does these things well.
1. Write: It writes data in an immutable way at first into
a commit log, adds it to the memtable to be available,
and then flushes it to disk: sstables.
2. Read: It figures out if the data is on a node (Orlando
Bloomfilter is involved) and reads from different
sstables, reconciles the immutable data + deletes into
the latest data.
3. It spreads the load around the ring so that you can
hundreds of nodes doing this and not break a sweat:
beast like performance.
Cassandra Operational
Pitfalls Visualized
Wide Partitions
01
Wide partitions will completely screw you you over on reads
and take a node out if there’s traffic.
1. Monitor using cfstats
(CompactedPartitionMaximumBytes)
2. Monitor in system.log “Compacting large partition”
3. Monitor using toppartitions
4. Monitor using OpsCenter (if usingDataStax)
Data Skew
02
Bad key design can lead to really, really bad data skew. In
some cases if the number of keys is only 1 or 2, that means
that the data only exists in one or two partitions replicated.
1. Monitor using cfstats(NumberOfKeys,
SpaceUsedLive, ReadCounts, WriteCounts)
2. Monitor using OpsCenter (if usingDataStax)
Tombstones
03
How to check for tombstones.
1. Monitor using cfstats(*Tombstones)
2. Monitor using syslog (“Tombstone Warn Threshold”)
3. Monitor using OpsCenter (if usingDataStax)
Cassandra Data Modeling
Best Practices
Good Key Design
01
Somethingsto NOTDO.
1. Avoid using Integer/Longkeys unless you couple it
with another composite partition key. (Unless you
can somehow show through realistic data generation
that it won’t coalesce data in some nodes)
2. Avoidusing Time/Date based keys or TimeUUID
unless you know for damn sure that you are going to
continuously create data at a given interval all day,
every day.
3. Don’t just import relational data and expect it to
magically work.
SomethingsTODO.
1. UUIDwill most likely work fine for any given table,
but how do you find it again? You will need to have
another table that has that information.
2. If you must use human readable keys, you can use a
synthetic shardingmechanism. Next Slide.
3. Can combine known things and take a chance but
should test with load: (String, Integer , String
,Integer) .
Somethingsto REMEMBER
1. Clustering Keysdon’tspreaddataaroundthecluster.
2. Remember ( Partition Key,ClusteringKey) are
different((PartitionKey1, Partition Key2))
3. UseRealistic Data: To properly scaleCassandra or
anyother Systemyouneedto create realistic data.
Spreading Data via
Synthetic Sharding
01
Sometimes you need to use the key that you have which is
human readable because that is the query path. How do deal
with that?
1. Primary Key : ((CountryName, StateName,
CityName, CompanyName))
2. Integer Shard Added ((CountryName, StateName,
CityName, CompanyName, ShardNumber))
3. ShardNumber couldbe 1-10, or 1-100dependingon
howbadly your datais spreading.
Let’s say you are using a time based key and notice coalescing
around a particular time of day, you could consider the
weekday itself as a part of the key .
1. Primary Key : (CreatedDate)
2. Week Day Number ((CreatedDate, WeekDay))
3. WeekDay would be 0-6 mapped to Sunday-Saturday
Just say now to Tombstones! The reason tombstones exist is
to make it possible to do insanely fast writes and updates and
still be able to send the data back performantly. (Side
conversation on Queues as Anti-pattern)
1. There is no need to set null values or delete data
actively.
2. You can always do soft deletes or use TTL values that
expire data automatically.
3. Watch out for prepared statements sending nulls.
Avoiding Tombstones
01
Questions?
Confidential Customized for Lorem Ipsum LLC Version 1.0
We’re Partnering / Hiring
1. Professional Services
Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure
2. Digital Services
React/Angular, TypeScript, ASP.NET, Node, Python
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037
Data & Analytics
Cassandra, DataStax, Kafka, Spark
Customer Experience
Sitecore
Information Systems
Salesforce, Quickbooks, and more

More Related Content

PPTX
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
PPTX
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
PPTX
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
ODP
Deep Dive Into Elasticsearch
PPTX
Large partition in Cassandra
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
PPTX
BI, Reporting and Analytics on Apache Cassandra
PPTX
Real time data pipeline with spark streaming and cassandra with mesos
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
Deploying, Backups, and Restore w Datastax + Azure at Albertsons/Safeway (Gur...
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
Deep Dive Into Elasticsearch
Large partition in Cassandra
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
BI, Reporting and Analytics on Apache Cassandra
Real time data pipeline with spark streaming and cassandra with mesos

What's hot (20)

PDF
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
PDF
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
PPTX
Processing 50,000 events per second with Cassandra and Spark
PPTX
ML on Big Data: Real-Time Analysis on Time Series
PDF
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
PDF
Analytics with Spark and Cassandra
PPTX
Everyday I’m scaling... Cassandra
PDF
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
PDF
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
PPTX
Using Spark to Load Oracle Data into Cassandra
PPTX
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
PPTX
An Overview of Apache Cassandra
PPTX
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
PDF
Pythian: My First 100 days with a Cassandra Cluster
PPTX
Cassandra Tuning - above and beyond
PPTX
Cassandra
PDF
Dynamo and BigTable - Review and Comparison
PDF
Cassandra Database
PDF
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
PDF
Scaling MySQL -- Swanseacon.co.uk
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016
Apache Cassandra and DataStax Enterprise Explained with Peter Halliday at Wil...
Processing 50,000 events per second with Cassandra and Spark
ML on Big Data: Real-Time Analysis on Time Series
SASI: Cassandra on the Full Text Search Ride (DuyHai DOAN, DataStax) | C* Sum...
Analytics with Spark and Cassandra
Everyday I’m scaling... Cassandra
How We Used Cassandra/Solr to Build Real-Time Analytics Platform
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...
Using Spark to Load Oracle Data into Cassandra
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...
An Overview of Apache Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Pythian: My First 100 days with a Cassandra Cluster
Cassandra Tuning - above and beyond
Cassandra
Dynamo and BigTable - Review and Comparison
Cassandra Database
Apache Cassandra For Java Developers - Why, What and How. LJC @ UCL October 2014
Scaling MySQL -- Swanseacon.co.uk
Ad

Similar to Datastax / Cassandra Modeling Strategies (20)

PPTX
DataStax & Cassandra Data Modeling Strategies
PPTX
Migrating from a Relational Database to Cassandra: Why, Where, When and How
PDF
Moving from a Relational Database to Cassandra: Why, Where, When, and How
PPTX
Cassandra & Python - Springfield MO User Group
PDF
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
PDF
Instaclustr Apache Cassandra Best Practices & Toubleshooting
PPTX
Cassandra an overview
PDF
Cassandra basics 2.0
PDF
Cassandra Data Modelling
PPTX
Why Cassandra?
PPTX
Cassandra - A decentralized storage system
PPTX
Performance tuning - A key to successful cassandra migration
PDF
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
PDF
Cassandra summit 2013 how not to use cassandra
DOCX
Cassandra data modelling best practices
PDF
Cassandra v1.0
PDF
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
PPTX
Symantec: Cassandra Data Modelling techniques in action
PDF
An Introduction to Apache Cassandra
DataStax & Cassandra Data Modeling Strategies
Migrating from a Relational Database to Cassandra: Why, Where, When and How
Moving from a Relational Database to Cassandra: Why, Where, When, and How
Cassandra & Python - Springfield MO User Group
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Instaclustr Apache Cassandra Best Practices & Toubleshooting
Cassandra an overview
Cassandra basics 2.0
Cassandra Data Modelling
Why Cassandra?
Cassandra - A decentralized storage system
Performance tuning - A key to successful cassandra migration
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
Cassandra summit 2013 how not to use cassandra
Cassandra data modelling best practices
Cassandra v1.0
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...
Symantec: Cassandra Data Modelling techniques in action
An Introduction to Apache Cassandra
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PPTX
YugabyteDB Developer Tools
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
PPTX
Machine Learning Orchestration with Airflow
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
YugabyteDB Developer Tools
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Machine Learning Orchestration with Airflow
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Recently uploaded (20)

PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
KodekX | Application Modernization Development
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Encapsulation theory and applications.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Approach and Philosophy of On baking technology
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KodekX | Application Modernization Development
Spectral efficient network and resource selection model in 5G networks
Encapsulation_ Review paper, used for researhc scholars
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Encapsulation theory and applications.pdf
sap open course for s4hana steps from ECC to s4
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
The AUB Centre for AI in Media Proposal.docx
Approach and Philosophy of On baking technology
“AI and Expert System Decision Support & Business Intelligence Systems”

Datastax / Cassandra Modeling Strategies

  • 1. DataStax / Cassandra Data Modeling Strategies Avoiding The Three Stooges: Wide Partitions, Tombstones, Data Skew Rahul Xavier Singh Anant Corporation
  • 2. TOC Core Concepts Wide Partitions Data Modeling Synthetic Sharding Key Design Tombstones Data Skew Avoid tombstones
  • 3. Business Platform Success We build business success platforms, which are collections of systems that serve business processes that have information needs for people.
  • 5. How? Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Assets Business Platform ● Curateframeworkof systems. ● Workwitha vettedteam of experts. ● Connectit all together. ● Focuson finding, analyzing,and actingon knowledge& communicationtowards businesssuccess.
  • 6. Streamline. Organize. Unify. Business Platform
  • 7. Who we help Succeed
  • 9. Cassandra Architecture Cluster / Data Centers 01Cassandra is not for tiny data. Do you NEED: 1. Fast read and write of terabytes of data? 2. Replication / availability around the world? 3. Never go down, always up? Don’tuse Cassandra: 1. If you have gigabytes of data. 2. Your application can chill in one datacenter. 3. Your system can go down whenever it wants. 4. Want to be cool.
  • 10. Cassandra Data Model Keyspaces & Tables 02 Cassandra Tables / Column Families look like SQL Server / MySQL / Postgres tables & databases. They are not. 1. CQL Supports queries with a Primary and optional Clustering Key 2. CQL Does not support arbitrary queries on columns. 3. Cassandra shouldn’t be managing more than a 100- 150 tables across any number of keyspaces.
  • 11. Cassandra Operations Read / Write Paths 03 Cassandra does these things well. 1. Write: It writes data in an immutable way at first into a commit log, adds it to the memtable to be available, and then flushes it to disk: sstables. 2. Read: It figures out if the data is on a node (Orlando Bloomfilter is involved) and reads from different sstables, reconciles the immutable data + deletes into the latest data. 3. It spreads the load around the ring so that you can hundreds of nodes doing this and not break a sweat: beast like performance.
  • 13. Wide Partitions 01 Wide partitions will completely screw you you over on reads and take a node out if there’s traffic. 1. Monitor using cfstats (CompactedPartitionMaximumBytes) 2. Monitor in system.log “Compacting large partition” 3. Monitor using toppartitions 4. Monitor using OpsCenter (if usingDataStax)
  • 14. Data Skew 02 Bad key design can lead to really, really bad data skew. In some cases if the number of keys is only 1 or 2, that means that the data only exists in one or two partitions replicated. 1. Monitor using cfstats(NumberOfKeys, SpaceUsedLive, ReadCounts, WriteCounts) 2. Monitor using OpsCenter (if usingDataStax)
  • 15. Tombstones 03 How to check for tombstones. 1. Monitor using cfstats(*Tombstones) 2. Monitor using syslog (“Tombstone Warn Threshold”) 3. Monitor using OpsCenter (if usingDataStax)
  • 17. Good Key Design 01 Somethingsto NOTDO. 1. Avoid using Integer/Longkeys unless you couple it with another composite partition key. (Unless you can somehow show through realistic data generation that it won’t coalesce data in some nodes) 2. Avoidusing Time/Date based keys or TimeUUID unless you know for damn sure that you are going to continuously create data at a given interval all day, every day. 3. Don’t just import relational data and expect it to magically work. SomethingsTODO. 1. UUIDwill most likely work fine for any given table, but how do you find it again? You will need to have another table that has that information. 2. If you must use human readable keys, you can use a synthetic shardingmechanism. Next Slide. 3. Can combine known things and take a chance but should test with load: (String, Integer , String ,Integer) . Somethingsto REMEMBER 1. Clustering Keysdon’tspreaddataaroundthecluster. 2. Remember ( Partition Key,ClusteringKey) are different((PartitionKey1, Partition Key2)) 3. UseRealistic Data: To properly scaleCassandra or anyother Systemyouneedto create realistic data.
  • 18. Spreading Data via Synthetic Sharding 01 Sometimes you need to use the key that you have which is human readable because that is the query path. How do deal with that? 1. Primary Key : ((CountryName, StateName, CityName, CompanyName)) 2. Integer Shard Added ((CountryName, StateName, CityName, CompanyName, ShardNumber)) 3. ShardNumber couldbe 1-10, or 1-100dependingon howbadly your datais spreading. Let’s say you are using a time based key and notice coalescing around a particular time of day, you could consider the weekday itself as a part of the key . 1. Primary Key : (CreatedDate) 2. Week Day Number ((CreatedDate, WeekDay)) 3. WeekDay would be 0-6 mapped to Sunday-Saturday
  • 19. Just say now to Tombstones! The reason tombstones exist is to make it possible to do insanely fast writes and updates and still be able to send the data back performantly. (Side conversation on Queues as Anti-pattern) 1. There is no need to set null values or delete data actively. 2. You can always do soft deletes or use TTL values that expire data automatically. 3. Watch out for prepared statements sending nulls. Avoiding Tombstones 01
  • 21. Confidential Customized for Lorem Ipsum LLC Version 1.0 We’re Partnering / Hiring 1. Professional Services Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure 2. Digital Services React/Angular, TypeScript, ASP.NET, Node, Python
  • 22. www.anant.us | solutions@anant.us | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037 Data & Analytics Cassandra, DataStax, Kafka, Spark Customer Experience Sitecore Information Systems Salesforce, Quickbooks, and more