SlideShare a Scribd company logo
Migrating from Relational to
Cassandra ( SQL to CQL )
Rahul Xavier Singh Anant Corporation
TOC
Core Concepts
Detect Bad Models
Data Modeling in Cassandra
Synthetic Sharding
Key DesignCommon Patterns
Avoid tombstones
Business Platform Success
We build realtime business platforms,
connecting customer experiences,
information systems with realtime data
& analytics platforms like … Cassandra,
Kafka, Spark
Platform Thinking
How?
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Assets
Business Platform
● Curate framework of
systems.
● Work with a vetted team
of experts.
● Connect it all together.
● Focus on finding,
analyzing, and acting on
knowledge &
communication towards
business success.
Streamline. Organize. Unify. Business Platform
Who we help Succeed
Differences between
Relational and Cassandra
Typical Use Cases in
RDBMS / Relational
01 Relational Use Cases
1. Master / Detail - 1 to N | has
2. Document Attributes - 1 to 1 | is/has
3. Lookup - n to 1 | is/ is part of
4. Connection - n to n | is /is related to
/has
5. ….
Why Cassandra
01
1. Familiar Structure: CQL tables ~ SQL
tables
2. Familiar Query language : CQL ~ SQL
3. Schema Constrained Queries : No
arbitrary queries, joins, transactions
4. CQL is a Subset of SQL : CQL < sql
Differences between
RDBMS / Cassandra
01 SQL / Relational / RDBMS
1. Reduce redundancy
2. Store once / Relate / Query
3. ACID : “Atomicity, Consistency,
Isolation, and Durability”
4. Immediate consistency
5. Structured with types
6. Set schema for all Rows
7. Scale : master slave , limited scale
8. Joins, Views, Arbitrary Queries
CQL / Non-Relational / Cassandra
1. Store as often as you need.
Redundancy okay!
2. Duplicate as needed.
3. Predefined Queries (no Joins)
4. BASE : “Basically Available, Soft
state, Eventual consistency”
5. Tunable consistency
6. Structured w/ types + Semi-
Structured (Maps, Sets,
Collections)
7. Malleable schema (via New Rows,
Columns)
8. Masterless + Multi DC (Workload
or Regional)
9. Globally Scalable
Cassandra Core Concepts
Cassandra
Architecture
Cluster / Data Centers
01Cassandra is not for tiny data. Do you NEED:
1. Fast read and write of terabytes of data?
2. Replication / availability around the world?
3. Never go down, always up?
Don’t use Cassandra:
1. If you have gigabytes of data.
2. Your application can chill in one datacenter.
3. Your system can go down whenever it wants.
4. Want to be cool.
Cassandra Data Model
Keyspaces & Tables
02
Cassandra Tables / Column Families look like SQL Server /
MySQL / Postgres tables & databases. They are not.
1. CQL Supports queries with a Primary and optional
Clustering Key
2. CQL Does not support arbitrary queries on columns.
3. Cassandra shouldn’t be managing more than a 100-
150 tables across any number of keyspaces.
Cassandra Operations
Read / Write Paths
03
Cassandra does these things well.
1. Write: It writes data in an immutable way at first into
a commit log, adds it to the memtable to be available,
and then flushes it to disk: sstables.
2. Read: It figures out if the data is on a node (Orlando
Bloom filter is involved) and reads from different
sstables, reconciles the immutable data + deletes into
the latest data.
3. It spreads the load around the ring so that you can
hundreds of nodes doing this and not break a sweat:
beast like performance.
Cassandra Operational
Pitfalls Visualized
Wide Partitions
01
Data Skew
02
Tombstones
03
Monitoring and
Continuous Detection
04
How to check for tombstones.
1. Monitor using cfstats (*Tombstones)
2. Monitor using syslog (“Tombstone Warn Threshold”)
3. Monitor using OpsCenter, Prometheus + Grafana,
Datadog, Sematext
Bad key design can lead to really, really bad data skew. In
some cases if the number of keys is only 1 or 2, that means
that the data only exists in one or two partitions replicated.
1. Monitor using cfstats (NumberOfKeys,
SpaceUsedLive, ReadCounts, WriteCounts)
2. Monitor using OpsCenter, Prometheus + Grafana,
Datadog, Sematext
Wide partitions will completely screw you over on reads and
take a node out if there’s traffic.
1. Monitor using cfstats
(CompactedPartitionMaximumBytes)
2. Monitor in system.log “Compacting large partition”
3. Monitor using toppartitions
4. Monitor using OpsCenter, Prometheus + Grafana,
Datadog, Sematext
https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
Monitoring Options
Opscenter, Grafana
+ Prometheus, ...
Cassandra Vision
05
Objective : Provide a way to Visually Identify “Skew”
1. Extract cfstats, tablestats, and soon from “virtual”
system tables.
2. Transform it to a format that can be
importable/exportable
3. Transform it into an Excel that’s easy to use
4. Provide a Web UI
5. Cassandra “Agnostic”Tooling
1. Visualizing distributed systems is
difficult.
2. Some information is good as a time
based view, others need to be point in
time.
3. Sometimes managing Cassandra is like a
Business Intelligence / Data Analytics
job.
Objective : Provide /Support a standard Cassandra
Monitoring Stack (Prometheus + Grafana )
1. Document clearly how to do it manually.
2. Document how to do it w. Automation.
3. Provide ansible playbooks
4. Provide dockerized containers
5. Cassandra “Agnostic” Tooling
https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
Common Cassandra
Migration Patterns
Monolith to
Microservices
https://www.infoworld.com/article/3236291/how-to-choose-a-database-for-your-microservices.html
Lift and Shift
01
When it Works Great!
1. Partition Key is a GUID/UUID/TimeUUID
2. Partition Sizes are “Sane” because a
Clustering Key is a Natural Key
3. Ton of Columns and Most of the are Null
4. Ton of Text / Blobs / JSON / XML
5. Aren’t using JOINS or Arbitrary Queries
6. Aren’t using that many “Views” that are
basically JOINS or Arbitrary Queries
When it Fails Bigly!
1. Lookup Tables because there are a set
number of Records - no need for distribution.
2. When a Partition key is “Popular”
3. When a Time Series Data doesn’t come in
Consistently
4. Any type of JOIN / Arbitrary Query as the
primary Access Pattern
Some things to REMEMBER
1. CQL is Similar to but is NOT SQL
2. You can’t Query what is not a Key or Indexed
3. Indexes / Materialized Views can Have Skew
4. Empty Columns are better than Null Columns
5. Since you Lifted/Shifted, Performance Test
w/ Realistic Data
Denormalize / Consolidate
02 When it Works like Awesome Sauce!
1. Master Detail Objects: One to Many
where the “One” Owns the “Many”
2. Hierarchical Objects (1-n-n)
3. Normalized Data is not Massive
4. Natural “Objects” that can be organized
into Records (Row) and Folders
(Partition)
5. Whole Sets of Small Lookup Tables can be
put into a “Object_Reference_Table”
6. Object that need to store History or
When it Fails Flat!
1. Normalized Data is Massive
2. The Standard Deviation of Partition Size
is High
3. You need to Query on a non-Key attribute
/ sub-attribute.
4. Need to pull “reports”
Some things to REMEMBER
1. Cassandra Stores Key/Values under the
Hood
2. Number of Rows / Columns Don’t Matter
as long as they are under 100-200MB
3. Since Consolidated Objects can become
Big, Performance Test w/ Realistic Data
Microservices on
Cassandra
https://www.infoworld.com/article/3236291/how-to-choose-a-database-for-your-microservices.html
Read /Write Microservices
03
When it Works like a Champ!
1. Treat a Table/Keyspace/Data Center as a
model for a Microservice (Domain)
2. Design your Models as if you were
designing a REST API
3. Design your Models as if they were
messages being sent in a Queue
4. When Microservices are not waiting on
other Microservices (Non-Blocking)
5. Bunch of Writes and then a bunch of
Reads
When it Fails like a Champ!
1. You are trying to do too many things in
one operation. (Lifted and Shifted a
Monolith)
2. Instead of making 100 Queries you make
1 Query with 100 Keys in the “IN” Clause
3. Trying to do Many
Read/Write/Read/Write
Some things to REMEMBER
1. Cassandra itself is a set of 15-20 Threads
that pass messages between each Pool
and sometimes between nodes.
2. Do as many Writes as you want.
3. Reads should be 1 Partition / Query
CQRS
CQRS Microservices
04 When It’s “Web Scale” . Drop Microphone
1. All updates to data are “Events” with a
Payload that are processed via Command
Processors
2. Events are interpreted and can be used to
update multiple copies of data as may be
required. (Data Integrity)
3. Events can be sent to and sourced from
Database, Cache, Queues, or directly
from the Event source to the processor.
4. All Reads happen from “Query Tables” or
“Report Tables”
When it’s not “Web Scale”
1. Processors can’t seem to keep up and so
your queries show stale information.
2. Too many events take down the Queue /
Cache
3. Sending too much information as events.
Think smaller.
4. Didn’t really segregate Command from
Query Layer (Separate Scaling)
Some things to REMEMBER
1. Same things in RW Microservices
2. If Commands are materializing Data in
Different places, process those in
separate threads asynchronously.
3. Scale the Query and Command
processors as needed.
Cassandra Data Modeling
Best Practices
Good Key Design
01
Some things to NOT DO.
1. Avoid using Integer/Long keys unless you couple it
with another composite partition key. (Unless you
can somehow show through realistic data generation
that it won’t coalesce data in some nodes)
2. Avoid using Time/Date based keys or TimeUUID
unless you know for damn sure that you are going to
continuously create data at a given interval all day,
every day.
3. Don’t just import relational data and expect it to
magically work.
Some things TO DO.
1. UUID will most likely work fine for any given table,
but how do you find it again? You will need to have
another table that has that information.
2. If you must use human readable keys, you can use a
synthetic sharding mechanism. Next Slide.
3. Can combine known things and take a chance but
should test with load: (String , Integer , String
,Integer) .
Some things to REMEMBER
1. Clustering Keys don’t spread data around the
cluster.
2. Remember ( Partition Key, Clustering Key ) are
different ((Partition Key 1, Partition Key 2))
3. Use Realistic Data: To properly scale Cassandra or
any other System you need to create realistic data.
Spreading Data via
Synthetic Sharding
01
Sometimes you need to use the key that you have which is
human readable because that is the query path. How do deal
with that?
1. Primary Key : ((CountryName, StateName,
CityName, CompanyName))
2. Integer Shard Added ((CountryName, StateName,
CityName, CompanyName, ShardNumber))
3. ShardNumber could be 1-10, or 1-100 depending on
how badly your data is spreading.
Let’s say you are using a time based key and notice coalescing
around a particular time of day, you could consider the
weekday itself as a part of the key .
1. Partition Key : (CreatedDate)
2. Week Day Number ((CreatedDate, WeekDay))
3. WeekDay would be 0-6 mapped to Sunday-Saturday
Just say no to Tombstones! The reason tombstones exist is to
make it possible to do insanely fast writes and updates and
still be able to send the data back performantly. (Side
conversation on Queues as Anti-pattern)
1. There is no need to set null values or delete data
actively.
2. You can always do soft deletes or use TTL values that
expire data automatically.
3. Watch out for prepared statements sending nulls.
Avoiding Tombstones
01
Questions?
Resources
Cassandra
● cassandra.link
● https://anant.github.io/awesome-
cassandra
● https://www.sestevez.com/sestevez/cassandradatamo
deler/
Microservices on Cassandra
● https://www.slideshare.net/JeffreyCarpenter/data-
modeling-for-microservices-with-cassandra-and-spark
Data Modeling Problems in Cassandra
● https://blog.anant.us/common-problems-cassandra-
data-models/
Monitoring Cassandra / Spark
● https://blog.anant.us/resources-for-monitoring-
datastax-cassandra-spark-solr-performance/
We’re Partnering / Hiring
Platforms
Datastax, Sitecore, Spark, Docker, Solr, Cassandra,
Kafka, Elastic, AWS, Azure
Frameworks
React/Angular, TypeScript, ASP.NET, Node, Python

More Related Content

PPTX
Cassandra vs Databases
PDF
Cassandra Distributions and Variants
PPTX
Apache Cassandra Lunch #78: Deploy Cassandra using DSE Operator to Kubernetes
PPTX
Apache Cassandra Lunch #70: Basics of Apache Cassandra
PPTX
Cassandra
PPTX
Building a REST API with Cassandra on Datastax Astra Using Python and Node
PPTX
Apache Cassandra Lunch #71: Creating a User Profile Using DataStax Astra and ...
PDF
Cassandra Workshop - Cassandra from scratch in one day
Cassandra vs Databases
Cassandra Distributions and Variants
Apache Cassandra Lunch #78: Deploy Cassandra using DSE Operator to Kubernetes
Apache Cassandra Lunch #70: Basics of Apache Cassandra
Cassandra
Building a REST API with Cassandra on Datastax Astra Using Python and Node
Apache Cassandra Lunch #71: Creating a User Profile Using DataStax Astra and ...
Cassandra Workshop - Cassandra from scratch in one day

What's hot (20)

PDF
Introduction to Apache Cassandra
PDF
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...
PPTX
Apache Cassandra
PPTX
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
PDF
Apache Cassandra in the Real World
PPTX
Introducing DataStax Enterprise 4.7
PPTX
Managing Objects and Data in Apache Cassandra
PPTX
cassandra@Netflix
PDF
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
PPTX
Cassandra Lunch #92: Securing Apache Cassandra - Managing Roles and Permissions
PPTX
Cassandra training
PPTX
Managing (Schema) Migrations in Cassandra
PDF
Cassandra NoSQL Tutorial
PPT
Apache Cassandra training. Overview and Basics
PDF
Apache Cassandra in the Real World
PPTX
An Overview of Apache Cassandra
PPT
Cassandra architecture
PDF
Apache Cassandra overview
PPTX
Empowering the AWS DynamoDB™ application developer with Alternator
PPTX
mParticle's Journey to Scylla from Cassandra
Introduction to Apache Cassandra
What is Apache Cassandra? | Apache Cassandra Tutorial | Apache Cassandra Intr...
Apache Cassandra
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on Docker
Apache Cassandra in the Real World
Introducing DataStax Enterprise 4.7
Managing Objects and Data in Apache Cassandra
cassandra@Netflix
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Lunch #92: Securing Apache Cassandra - Managing Roles and Permissions
Cassandra training
Managing (Schema) Migrations in Cassandra
Cassandra NoSQL Tutorial
Apache Cassandra training. Overview and Basics
Apache Cassandra in the Real World
An Overview of Apache Cassandra
Cassandra architecture
Apache Cassandra overview
Empowering the AWS DynamoDB™ application developer with Alternator
mParticle's Journey to Scylla from Cassandra
Ad

Similar to Migrating from a Relational Database to Cassandra: Why, Where, When and How (20)

PPTX
An Introduction to Cassandra - Oracle User Group
PPTX
Learning Cassandra NoSQL
PPTX
DataStax & Cassandra Data Modeling Strategies
DOCX
Cassandra data modelling best practices
PDF
Introduction to cassandra 2014
PPTX
Datastax / Cassandra Modeling Strategies
PPTX
Cassandra - A decentralized storage system
PDF
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
PDF
Cassandra basics 2.0
PDF
An Introduction to Apache Cassandra
PPTX
Cassandra & Python - Springfield MO User Group
PPTX
Apache Cassandra at the Geek2Geek Berlin
PDF
Cassandra and Spark
ODP
Intro to cassandra
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PDF
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
PDF
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
PDF
Deep Dive into Cassandra
PDF
Using cassandra as a distributed logging to store pb data
PPTX
Cassandra Tutorial
An Introduction to Cassandra - Oracle User Group
Learning Cassandra NoSQL
DataStax & Cassandra Data Modeling Strategies
Cassandra data modelling best practices
Introduction to cassandra 2014
Datastax / Cassandra Modeling Strategies
Cassandra - A decentralized storage system
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...
Cassandra basics 2.0
An Introduction to Apache Cassandra
Cassandra & Python - Springfield MO User Group
Apache Cassandra at the Geek2Geek Berlin
Cassandra and Spark
Intro to cassandra
A Deep Dive into Apache Cassandra for .NET Developers
CASSANDRA A DISTRIBUTED NOSQL DATABASE FOR HOTEL MANAGEMENT SYSTEM
Cassandra Day Denver 2014: So, You Want to Use Cassandra?
Deep Dive into Cassandra
Using cassandra as a distributed logging to store pb data
Cassandra Tutorial
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
PPTX
YugabyteDB Developer Tools
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
PPTX
Machine Learning Orchestration with Airflow
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
PPTX
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
YugabyteDB Developer Tools
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Machine Learning Orchestration with Airflow
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...

Recently uploaded (20)

PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
KodekX | Application Modernization Development
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
A Presentation on Artificial Intelligence
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Approach and Philosophy of On baking technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Monthly Chronicles - July 2025
KodekX | Application Modernization Development
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
“AI and Expert System Decision Support & Business Intelligence Systems”
Encapsulation_ Review paper, used for researhc scholars
A Presentation on Artificial Intelligence
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Network Security Unit 5.pdf for BCA BBA.
Approach and Philosophy of On baking technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
Digital-Transformation-Roadmap-for-Companies.pptx

Migrating from a Relational Database to Cassandra: Why, Where, When and How

  • 1. Migrating from Relational to Cassandra ( SQL to CQL ) Rahul Xavier Singh Anant Corporation
  • 2. TOC Core Concepts Detect Bad Models Data Modeling in Cassandra Synthetic Sharding Key DesignCommon Patterns Avoid tombstones
  • 3. Business Platform Success We build realtime business platforms, connecting customer experiences, information systems with realtime data & analytics platforms like … Cassandra, Kafka, Spark
  • 5. How? Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Assets Business Platform ● Curate framework of systems. ● Work with a vetted team of experts. ● Connect it all together. ● Focus on finding, analyzing, and acting on knowledge & communication towards business success.
  • 6. Streamline. Organize. Unify. Business Platform
  • 7. Who we help Succeed
  • 9. Typical Use Cases in RDBMS / Relational 01 Relational Use Cases 1. Master / Detail - 1 to N | has 2. Document Attributes - 1 to 1 | is/has 3. Lookup - n to 1 | is/ is part of 4. Connection - n to n | is /is related to /has 5. ….
  • 10. Why Cassandra 01 1. Familiar Structure: CQL tables ~ SQL tables 2. Familiar Query language : CQL ~ SQL 3. Schema Constrained Queries : No arbitrary queries, joins, transactions 4. CQL is a Subset of SQL : CQL < sql
  • 11. Differences between RDBMS / Cassandra 01 SQL / Relational / RDBMS 1. Reduce redundancy 2. Store once / Relate / Query 3. ACID : “Atomicity, Consistency, Isolation, and Durability” 4. Immediate consistency 5. Structured with types 6. Set schema for all Rows 7. Scale : master slave , limited scale 8. Joins, Views, Arbitrary Queries CQL / Non-Relational / Cassandra 1. Store as often as you need. Redundancy okay! 2. Duplicate as needed. 3. Predefined Queries (no Joins) 4. BASE : “Basically Available, Soft state, Eventual consistency” 5. Tunable consistency 6. Structured w/ types + Semi- Structured (Maps, Sets, Collections) 7. Malleable schema (via New Rows, Columns) 8. Masterless + Multi DC (Workload or Regional) 9. Globally Scalable
  • 13. Cassandra Architecture Cluster / Data Centers 01Cassandra is not for tiny data. Do you NEED: 1. Fast read and write of terabytes of data? 2. Replication / availability around the world? 3. Never go down, always up? Don’t use Cassandra: 1. If you have gigabytes of data. 2. Your application can chill in one datacenter. 3. Your system can go down whenever it wants. 4. Want to be cool.
  • 14. Cassandra Data Model Keyspaces & Tables 02 Cassandra Tables / Column Families look like SQL Server / MySQL / Postgres tables & databases. They are not. 1. CQL Supports queries with a Primary and optional Clustering Key 2. CQL Does not support arbitrary queries on columns. 3. Cassandra shouldn’t be managing more than a 100- 150 tables across any number of keyspaces.
  • 15. Cassandra Operations Read / Write Paths 03 Cassandra does these things well. 1. Write: It writes data in an immutable way at first into a commit log, adds it to the memtable to be available, and then flushes it to disk: sstables. 2. Read: It figures out if the data is on a node (Orlando Bloom filter is involved) and reads from different sstables, reconciles the immutable data + deletes into the latest data. 3. It spreads the load around the ring so that you can hundreds of nodes doing this and not break a sweat: beast like performance.
  • 20. Monitoring and Continuous Detection 04 How to check for tombstones. 1. Monitor using cfstats (*Tombstones) 2. Monitor using syslog (“Tombstone Warn Threshold”) 3. Monitor using OpsCenter, Prometheus + Grafana, Datadog, Sematext Bad key design can lead to really, really bad data skew. In some cases if the number of keys is only 1 or 2, that means that the data only exists in one or two partitions replicated. 1. Monitor using cfstats (NumberOfKeys, SpaceUsedLive, ReadCounts, WriteCounts) 2. Monitor using OpsCenter, Prometheus + Grafana, Datadog, Sematext Wide partitions will completely screw you over on reads and take a node out if there’s traffic. 1. Monitor using cfstats (CompactedPartitionMaximumBytes) 2. Monitor in system.log “Compacting large partition” 3. Monitor using toppartitions 4. Monitor using OpsCenter, Prometheus + Grafana, Datadog, Sematext https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
  • 22. Cassandra Vision 05 Objective : Provide a way to Visually Identify “Skew” 1. Extract cfstats, tablestats, and soon from “virtual” system tables. 2. Transform it to a format that can be importable/exportable 3. Transform it into an Excel that’s easy to use 4. Provide a Web UI 5. Cassandra “Agnostic”Tooling 1. Visualizing distributed systems is difficult. 2. Some information is good as a time based view, others need to be point in time. 3. Sometimes managing Cassandra is like a Business Intelligence / Data Analytics job. Objective : Provide /Support a standard Cassandra Monitoring Stack (Prometheus + Grafana ) 1. Document clearly how to do it manually. 2. Document how to do it w. Automation. 3. Provide ansible playbooks 4. Provide dockerized containers 5. Cassandra “Agnostic” Tooling https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/
  • 25. Lift and Shift 01 When it Works Great! 1. Partition Key is a GUID/UUID/TimeUUID 2. Partition Sizes are “Sane” because a Clustering Key is a Natural Key 3. Ton of Columns and Most of the are Null 4. Ton of Text / Blobs / JSON / XML 5. Aren’t using JOINS or Arbitrary Queries 6. Aren’t using that many “Views” that are basically JOINS or Arbitrary Queries When it Fails Bigly! 1. Lookup Tables because there are a set number of Records - no need for distribution. 2. When a Partition key is “Popular” 3. When a Time Series Data doesn’t come in Consistently 4. Any type of JOIN / Arbitrary Query as the primary Access Pattern Some things to REMEMBER 1. CQL is Similar to but is NOT SQL 2. You can’t Query what is not a Key or Indexed 3. Indexes / Materialized Views can Have Skew 4. Empty Columns are better than Null Columns 5. Since you Lifted/Shifted, Performance Test w/ Realistic Data
  • 26. Denormalize / Consolidate 02 When it Works like Awesome Sauce! 1. Master Detail Objects: One to Many where the “One” Owns the “Many” 2. Hierarchical Objects (1-n-n) 3. Normalized Data is not Massive 4. Natural “Objects” that can be organized into Records (Row) and Folders (Partition) 5. Whole Sets of Small Lookup Tables can be put into a “Object_Reference_Table” 6. Object that need to store History or When it Fails Flat! 1. Normalized Data is Massive 2. The Standard Deviation of Partition Size is High 3. You need to Query on a non-Key attribute / sub-attribute. 4. Need to pull “reports” Some things to REMEMBER 1. Cassandra Stores Key/Values under the Hood 2. Number of Rows / Columns Don’t Matter as long as they are under 100-200MB 3. Since Consolidated Objects can become Big, Performance Test w/ Realistic Data
  • 28. Read /Write Microservices 03 When it Works like a Champ! 1. Treat a Table/Keyspace/Data Center as a model for a Microservice (Domain) 2. Design your Models as if you were designing a REST API 3. Design your Models as if they were messages being sent in a Queue 4. When Microservices are not waiting on other Microservices (Non-Blocking) 5. Bunch of Writes and then a bunch of Reads When it Fails like a Champ! 1. You are trying to do too many things in one operation. (Lifted and Shifted a Monolith) 2. Instead of making 100 Queries you make 1 Query with 100 Keys in the “IN” Clause 3. Trying to do Many Read/Write/Read/Write Some things to REMEMBER 1. Cassandra itself is a set of 15-20 Threads that pass messages between each Pool and sometimes between nodes. 2. Do as many Writes as you want. 3. Reads should be 1 Partition / Query
  • 29. CQRS
  • 30. CQRS Microservices 04 When It’s “Web Scale” . Drop Microphone 1. All updates to data are “Events” with a Payload that are processed via Command Processors 2. Events are interpreted and can be used to update multiple copies of data as may be required. (Data Integrity) 3. Events can be sent to and sourced from Database, Cache, Queues, or directly from the Event source to the processor. 4. All Reads happen from “Query Tables” or “Report Tables” When it’s not “Web Scale” 1. Processors can’t seem to keep up and so your queries show stale information. 2. Too many events take down the Queue / Cache 3. Sending too much information as events. Think smaller. 4. Didn’t really segregate Command from Query Layer (Separate Scaling) Some things to REMEMBER 1. Same things in RW Microservices 2. If Commands are materializing Data in Different places, process those in separate threads asynchronously. 3. Scale the Query and Command processors as needed.
  • 32. Good Key Design 01 Some things to NOT DO. 1. Avoid using Integer/Long keys unless you couple it with another composite partition key. (Unless you can somehow show through realistic data generation that it won’t coalesce data in some nodes) 2. Avoid using Time/Date based keys or TimeUUID unless you know for damn sure that you are going to continuously create data at a given interval all day, every day. 3. Don’t just import relational data and expect it to magically work. Some things TO DO. 1. UUID will most likely work fine for any given table, but how do you find it again? You will need to have another table that has that information. 2. If you must use human readable keys, you can use a synthetic sharding mechanism. Next Slide. 3. Can combine known things and take a chance but should test with load: (String , Integer , String ,Integer) . Some things to REMEMBER 1. Clustering Keys don’t spread data around the cluster. 2. Remember ( Partition Key, Clustering Key ) are different ((Partition Key 1, Partition Key 2)) 3. Use Realistic Data: To properly scale Cassandra or any other System you need to create realistic data.
  • 33. Spreading Data via Synthetic Sharding 01 Sometimes you need to use the key that you have which is human readable because that is the query path. How do deal with that? 1. Primary Key : ((CountryName, StateName, CityName, CompanyName)) 2. Integer Shard Added ((CountryName, StateName, CityName, CompanyName, ShardNumber)) 3. ShardNumber could be 1-10, or 1-100 depending on how badly your data is spreading. Let’s say you are using a time based key and notice coalescing around a particular time of day, you could consider the weekday itself as a part of the key . 1. Partition Key : (CreatedDate) 2. Week Day Number ((CreatedDate, WeekDay)) 3. WeekDay would be 0-6 mapped to Sunday-Saturday
  • 34. Just say no to Tombstones! The reason tombstones exist is to make it possible to do insanely fast writes and updates and still be able to send the data back performantly. (Side conversation on Queues as Anti-pattern) 1. There is no need to set null values or delete data actively. 2. You can always do soft deletes or use TTL values that expire data automatically. 3. Watch out for prepared statements sending nulls. Avoiding Tombstones 01
  • 36. Resources Cassandra ● cassandra.link ● https://anant.github.io/awesome- cassandra ● https://www.sestevez.com/sestevez/cassandradatamo deler/ Microservices on Cassandra ● https://www.slideshare.net/JeffreyCarpenter/data- modeling-for-microservices-with-cassandra-and-spark Data Modeling Problems in Cassandra ● https://blog.anant.us/common-problems-cassandra- data-models/ Monitoring Cassandra / Spark ● https://blog.anant.us/resources-for-monitoring- datastax-cassandra-spark-solr-performance/
  • 37. We’re Partnering / Hiring Platforms Datastax, Sitecore, Spark, Docker, Solr, Cassandra, Kafka, Elastic, AWS, Azure Frameworks React/Angular, TypeScript, ASP.NET, Node, Python