SlideShare a Scribd company logo
Cassandra
part 2
diegopacheco
@diego_pacheco
Diego Pacheco
@diego_pacheco
❏ Cat's Father
❏ Principal Software Architect
❏ Agile Coach
❏ SOA Expert
❏ DevOps Practitioner
❏ Speaker
❏ Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
https://goo.gl/eEqvzl
About me...
Agenda
❏ RE-CAP
❏ Cassandra Write Path
❏ Tombstones
❏ Compaction Strategies
❏ Row Cache
❏ Bloom Filter
❏ SASI Index
❏ Materialized Views
❏ Counter Families
❏ Anti-Patterns
❏ Cassandra running at UBER in MESOS use case
❏ Q&A
http://cassandra.apache.org/
RE-CAP: Partition Strategy
Cassandra Write Path
❏ SSTable => Sorted Array of Strings.
❏ Write to Disk: Merges and Pre-sorts
happens.
❏ SSTables are IMMUTABLE.
❏ Compaction happens:
❏ Time to time
❏ Prune deleted data
❏ Has thread-offs
Tombstones
❏ Deleted data is MARKED as Removed == Tombstone
❏ Data is deleted and removed during compaction
❏ Compaction can happen in few days depending of the
configs.
❏ Queries on partition with lots of tombstones requires lots of
filtering which can slow down the CASS performance.
❏ Collections operations can lead to tombstones depending
on what you do.
❏ There are Compaction Trade-Offs.
Compaction Strategies
❏ STCS
❏ Default
❏ Insert-Heavy
❏ General Workloads
❏ LCS
❏ Read Heavy
❏ More Updates than
Inserts
❏ DTCS
❏ Time Series
❏ Inserts out of order
❏ Updates for old data
Cassandra ROW CACHE
❏ Buffer FULL merged row into memory
❏ Increase a lot the throughput
❏ Row Cache works with Key Cache
❏ Key Cache = Where the partition is on DISK.
CREATE TABLE status (
user text,
status_id timeuuid,
status text,
PRIMARY KEY (user, status_id))
WITH CLUSTERING ORDER BY (status_id DESC)
AND caching = '{"keys":"ALL", "rows_per_partition":"10"}'
Cassandra Bloom Filter
❏ Bloom Filter: Technique created on the 70s to filter db matches.
❏ Space Efficient
❏ Probabilistic Data Structures
❏ For each SSTable there is a Bloom Filter
❏ Used for Index scans - not used to range scans
❏ Stored OFF HEAP
❏ Tunable per TABLE
❏ Cassandra uses bloom filters to know if the data is on the ROW or not.
Cassandra READ Path
SASI
❏ Secondary Index: Not the primary key.
❏ Lookup tables: bySomething
❏ Distributed Index
❏ Search Like Capabilities: %diego%
❏ Great when:
❏ Multi fields Search
❏ You know the partition key
❏ Indexing static columns
❏ Issues:
❏ More than 1000 rows returned
❏ Searching in Large Partitions
❏ Aggressive Read SLOs
❏ Search for Analytics(Use Spark/Flink)
❏ Ordering Search is important
SASI
Samples
❏ SELECT * FROM users WHERE firstname LIKE 'Die%';
❏ SELECT * FROM users WHERE lastname LIKE '%ie%';
❏ SELECT * FROM users WHERE
created_date > '2015-01-02' AND created_date < '2017-01-02';
Materialized Views
❏ Automated - Table managed for you, Denormalization
❏ Copies of the data in different partitions / replicas
❏ Some Write penalty but acceptable performance
❏ Store results in table which can be indexed
❏ Update ASYNC
❏ Great For:
❏ Caching
❏ Result Sets
❏ Dashbaords
SAMPLE
CREATE MATERIALIZED VIEW all_time_high AS
SELECT user FROM scores WHERE
game IS NOT NULL AND
score IS NOT NULL
PRIMARY KEY (game,score) WITH CLUSTERING ORDER BY (score DESC)
Cassandra Counter Family
❏ Static VS Dynamic Column families
❏ Dynamic Column families A.K.A Wide Rows
❏ Wide Rows is good for: Ordering,Grouping and Filtering.
❏ Wide Rows are not split into NODES.
❏ Counters Internally:
❏ Calculated and sum of all replicas
❏ Split into fragments called SHARDs.
❏ Logical clock monotonically increased
❏ 3 tuple = { NODE_COUNTER_ID, SHARD_LOGICAL_CLOCK, SHARD_VALUE }
Anti-Patterns
❏ Using Cassandra as a queue or queue-like table
❏ Tombstones
❏ Lots of deleted columns(expiry) and slice-queries don't play well
❏ http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets
❏ CQL Nulls
❏ Reading Tombstones
❏ Write NULL create tombstones
❏ Intensive Updates on SAME column
❏ Sensor table (ID,VALUE)
❏ Physical Limits
❏ Solution: Timestamp as cluster key.
Cassandra at UBER using MESOS (2016 data)
Cassandra
part 2
diegopacheco
@diego_pacheco
Diego Pacheco

More Related Content

PDF
Thoughts about Shape Up
PDF
Growing Up MongoDB
TXT
Connection
PDF
Reflections on SCM
ODP
Caching idea for midcom
PDF
PDF
Lean agile 2019 - part 4
PDF
Organization_GTD
Thoughts about Shape Up
Growing Up MongoDB
Connection
Reflections on SCM
Caching idea for midcom
Lean agile 2019 - part 4
Organization_GTD

Viewers also liked (20)

PDF
Elassandra
PDF
Lean/Agile/DevOps 2016 part 3
PDF
Dev opsdaykeynote
PDF
Microservices reativos usando a stack do Netflix na AWS
PDF
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
PDF
Lean/Agile/DevOps 2016 part 1
PDF
Microservices
PPTX
Pattern matchind and case classes
PPTX
Apresentação play framework
PPTX
Play Framework
PPTX
Pattern matching and case classes
PPTX
Highorderfunctions
PPTX
Apresentação angular js
PDF
DevOps: Plain English Business Benefits
PDF
TI na ERA DEVOPS
PDF
Stream Processing with Kafka and Samza
PDF
Spring framework 2.5
PDF
Cassandra
PDF
Spring framework 2.0 pt_BR
Elassandra
Lean/Agile/DevOps 2016 part 3
Dev opsdaykeynote
Microservices reativos usando a stack do Netflix na AWS
Cloud Native, Microservices and SRE/Chaos Engineering: The new Rules of The G...
Lean/Agile/DevOps 2016 part 1
Microservices
Pattern matchind and case classes
Apresentação play framework
Play Framework
Pattern matching and case classes
Highorderfunctions
Apresentação angular js
DevOps: Plain English Business Benefits
TI na ERA DEVOPS
Stream Processing with Kafka and Samza
Spring framework 2.5
Cassandra
Spring framework 2.0 pt_BR
Ad

Similar to Apache Cassandra - part 2 (20)

PDF
PDF
Experiences building a multi region cassandra operations orchestrator on aws
PDF
Cloud-Native DevOps Engineering
PPTX
My Database Skills Killed the Server
PPTX
Sql killedserver
PPTX
My SQL Skills Killed the Server
PDF
Don't you (forget about me) - PHP Meetup Lisboa 2023
PDF
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
PDF
Deep Dive into Cassandra
PDF
Cassandra Redis
PDF
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
PDF
Spark & Cassandra - DevFest Córdoba
PDF
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
PDF
Logical-DataWarehouse-Alluxio-meetup
PPTX
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
PDF
Dynomite Eureka Registry With Prana
PDF
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
PDF
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
PPT
Real world capacity
PDF
Using cassandra as a distributed logging to store pb data
Experiences building a multi region cassandra operations orchestrator on aws
Cloud-Native DevOps Engineering
My Database Skills Killed the Server
Sql killedserver
My SQL Skills Killed the Server
Don't you (forget about me) - PHP Meetup Lisboa 2023
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Deep Dive into Cassandra
Cassandra Redis
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Spark & Cassandra - DevFest Córdoba
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Logical-DataWarehouse-Alluxio-meetup
Scylla Summit 2018: How We Made Large Partition Scans Over Two Times Faster
Dynomite Eureka Registry With Prana
C* Summit EU 2013: Mixing Batch and Real-Time: Cassandra with Shark
Mixing Batch and Real-time: Cassandra with Shark (Cassandra Europe 2013)
Real world capacity
Using cassandra as a distributed logging to store pb data
Ad

More from Diego Pacheco (20)

PDF
Naming Things Book : Simple Book Review!
PDF
Continuous Discovery Habits Book Review.pdf
PDF
Holacracy
PDF
AWS IAM
PDF
Encryption Deep Dive
PDF
Sec 101
PDF
Management: Doing the non-obvious! III
PDF
Design is not Subjective
PDF
Architecture & Engineering : Doing the non-obvious!
PDF
Management doing the non-obvious II
PDF
Testing in production
PDF
Nine lies about work
PDF
Management: doing the nonobvious!
PDF
AI and the Future
PDF
Dealing with dependencies
PDF
Dealing with dependencies in tests
PDF
Kanban 2020
PDF
Lean 2020
PDF
Hardening
PDF
Design 101
Naming Things Book : Simple Book Review!
Continuous Discovery Habits Book Review.pdf
Holacracy
AWS IAM
Encryption Deep Dive
Sec 101
Management: Doing the non-obvious! III
Design is not Subjective
Architecture & Engineering : Doing the non-obvious!
Management doing the non-obvious II
Testing in production
Nine lies about work
Management: doing the nonobvious!
AI and the Future
Dealing with dependencies
Dealing with dependencies in tests
Kanban 2020
Lean 2020
Hardening
Design 101

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Approach and Philosophy of On baking technology
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Unlocking AI with Model Context Protocol (MCP)
Approach and Philosophy of On baking technology
KodekX | Application Modernization Development
NewMind AI Weekly Chronicles - August'25 Week I
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Reach Out and Touch Someone: Haptics and Empathic Computing
sap open course for s4hana steps from ECC to s4
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Network Security Unit 5.pdf for BCA BBA.
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Spectral efficient network and resource selection model in 5G networks

Apache Cassandra - part 2

  • 2. @diego_pacheco ❏ Cat's Father ❏ Principal Software Architect ❏ Agile Coach ❏ SOA Expert ❏ DevOps Practitioner ❏ Speaker ❏ Author diegopacheco http://diego-pacheco.blogspot.com.br/ https://goo.gl/eEqvzl About me...
  • 3. Agenda ❏ RE-CAP ❏ Cassandra Write Path ❏ Tombstones ❏ Compaction Strategies ❏ Row Cache ❏ Bloom Filter ❏ SASI Index ❏ Materialized Views ❏ Counter Families ❏ Anti-Patterns ❏ Cassandra running at UBER in MESOS use case ❏ Q&A
  • 6. Cassandra Write Path ❏ SSTable => Sorted Array of Strings. ❏ Write to Disk: Merges and Pre-sorts happens. ❏ SSTables are IMMUTABLE. ❏ Compaction happens: ❏ Time to time ❏ Prune deleted data ❏ Has thread-offs
  • 7. Tombstones ❏ Deleted data is MARKED as Removed == Tombstone ❏ Data is deleted and removed during compaction ❏ Compaction can happen in few days depending of the configs. ❏ Queries on partition with lots of tombstones requires lots of filtering which can slow down the CASS performance. ❏ Collections operations can lead to tombstones depending on what you do. ❏ There are Compaction Trade-Offs.
  • 8. Compaction Strategies ❏ STCS ❏ Default ❏ Insert-Heavy ❏ General Workloads ❏ LCS ❏ Read Heavy ❏ More Updates than Inserts ❏ DTCS ❏ Time Series ❏ Inserts out of order ❏ Updates for old data
  • 9. Cassandra ROW CACHE ❏ Buffer FULL merged row into memory ❏ Increase a lot the throughput ❏ Row Cache works with Key Cache ❏ Key Cache = Where the partition is on DISK. CREATE TABLE status ( user text, status_id timeuuid, status text, PRIMARY KEY (user, status_id)) WITH CLUSTERING ORDER BY (status_id DESC) AND caching = '{"keys":"ALL", "rows_per_partition":"10"}'
  • 10. Cassandra Bloom Filter ❏ Bloom Filter: Technique created on the 70s to filter db matches. ❏ Space Efficient ❏ Probabilistic Data Structures ❏ For each SSTable there is a Bloom Filter ❏ Used for Index scans - not used to range scans ❏ Stored OFF HEAP ❏ Tunable per TABLE ❏ Cassandra uses bloom filters to know if the data is on the ROW or not.
  • 12. SASI ❏ Secondary Index: Not the primary key. ❏ Lookup tables: bySomething ❏ Distributed Index ❏ Search Like Capabilities: %diego% ❏ Great when: ❏ Multi fields Search ❏ You know the partition key ❏ Indexing static columns ❏ Issues: ❏ More than 1000 rows returned ❏ Searching in Large Partitions ❏ Aggressive Read SLOs ❏ Search for Analytics(Use Spark/Flink) ❏ Ordering Search is important
  • 13. SASI Samples ❏ SELECT * FROM users WHERE firstname LIKE 'Die%'; ❏ SELECT * FROM users WHERE lastname LIKE '%ie%'; ❏ SELECT * FROM users WHERE created_date > '2015-01-02' AND created_date < '2017-01-02';
  • 14. Materialized Views ❏ Automated - Table managed for you, Denormalization ❏ Copies of the data in different partitions / replicas ❏ Some Write penalty but acceptable performance ❏ Store results in table which can be indexed ❏ Update ASYNC ❏ Great For: ❏ Caching ❏ Result Sets ❏ Dashbaords SAMPLE CREATE MATERIALIZED VIEW all_time_high AS SELECT user FROM scores WHERE game IS NOT NULL AND score IS NOT NULL PRIMARY KEY (game,score) WITH CLUSTERING ORDER BY (score DESC)
  • 15. Cassandra Counter Family ❏ Static VS Dynamic Column families ❏ Dynamic Column families A.K.A Wide Rows ❏ Wide Rows is good for: Ordering,Grouping and Filtering. ❏ Wide Rows are not split into NODES. ❏ Counters Internally: ❏ Calculated and sum of all replicas ❏ Split into fragments called SHARDs. ❏ Logical clock monotonically increased ❏ 3 tuple = { NODE_COUNTER_ID, SHARD_LOGICAL_CLOCK, SHARD_VALUE }
  • 16. Anti-Patterns ❏ Using Cassandra as a queue or queue-like table ❏ Tombstones ❏ Lots of deleted columns(expiry) and slice-queries don't play well ❏ http://www.datastax.com/dev/blog/cassandra-anti-patterns-queues-and-queue-like-datasets ❏ CQL Nulls ❏ Reading Tombstones ❏ Write NULL create tombstones ❏ Intensive Updates on SAME column ❏ Sensor table (ID,VALUE) ❏ Physical Limits ❏ Solution: Timestamp as cluster key.
  • 17. Cassandra at UBER using MESOS (2016 data)