SlideShare a Scribd company logo
©2013 DataStax Confidential. Do not distribute without consent.
Jon Haddad, Luke Tillman
Technical Evangelists, DataStax
@rustyrazorblade, @LukeTillman
Introduction to Apache Cassandra
1
Small Data
• 100's of MB to low GB, single user
• sed, awk, grep are great
• sqlite
• Limitations:
• bad for multiple concurrent users (file sharing!)
Medium Data
• Fits on 1 machine
• RDBMS is fine
• postgres
• mysql
• Supports hundreds of concurrent
users
• ACID makes us feel good
• Scales vertically
Can RDBMS work for big data?
Replication: ACID is a lie
Client
Master Slave
replication lag
Consistent results? Nope!
Third Normal Form Doesn't Scale
• Queries are unpredictable
• Users are impatient
• Data must be denormalized
• If data > memory, you = history
• Disk seeks are the worst
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
AS distance,
CASE region
WHEN '$region' THEN 1
ELSE 0
END AS region_match
FROM `cities`
$where and foo_count > 5
ORDER BY region_match desc, foo_count desc
limit 0, 11)
UNION
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
Sharding is a Nightmare
• Data is all over the place
• No more joins
• No more aggregations
• Denormalize all the things
• Querying secondary indexes
requires hitting every shard
• Adding shards requires manually
moving data
• Schema changes
High Availability.. not really
• Master failover… who's responsible?
• Another moving part…
• Bolted on hack
• Multi-DC is a mess
• Downtime is frequent
• Change database settings (innodb buffer
pool, etc)
• Drive, power supply failures
• OS updates
Summary of Failure
• Scaling is a pain
• ACID is naive at best
• You aren't consistent
• Re-sharding is a manual process
• We're going to denormalize for
performance
• High availability is complicated,
requires additional operational
overhead
Lessons Learned
• Consistency is not practical
• So we give it up
• Manual sharding & rebalancing is hard
• So let's build in
• Every moving part makes systems more complex
• So let's simplify our architecture - no more master / slave
• Scaling up is expensive
• We want commodity hardware
• Scatter / gather no good
• We denormalize for real time query performance
• Goal is to always hit 1 machine
What is Apache Cassandra?
• Fast Distributed Database
• High Availability
• Linear Scalability
• Predictable Performance
• No SPOF
• Multi-DC
• Commodity Hardware
• Easy to manage operationally
• Not a drop in replacement for
RDBMS
Hash Ring
• No master / slave / replica sets
• No config servers, zookeeper
• Data is partitioned around the ring
• Data is replicated to RF=N servers
• All nodes hold data and can answer
queries (both reads & writes)
• Location of data on ring is
determined by partition key
CAP Tradeoffs
• Cassandra chooses Availability &
Partition Tolerance over Consistency
• Queries have tunable consistency level
• ALL, QUORUM, ONE
• Hinted Handoff to deal with failed nodes
Data Modeling
Data Structures
• Like an RDBMS, Cassandra uses a Table to
store data
• But there’s where the similarities end
• Partitions within tables
• Rows within partitions (or a single row)
• CQL to create tables & query data
• Partition keys determine where a partition
is found
• Clustering keys determine ordering of rows
within a partition
Table
Partition
Row
Keyspace
Example: Single Row Partition
• Simple User system
• Identified by name (pk)
• 1 Row per partition
• This is familiar territory
name age job
jon 33 evangelist
luke 33 evangelist
old pete 108 retired
s. seagal 62 actor
JCVD 53 actor
cqlsh:demo> select * from user WHERE name = 'JCVD'
cqlsh:demo> create table user
(name text primary key,
age int,
job text);
Example: Multiple Rows
• Comments on photos
• Comments are always selected by
the photo_id
• There are only 4 rows in 2 partitions
• In the real world, use UUIDs instead
of int for PK
photo_id comment_id user comment
5 1 jon hi
5 2 luke oh hey
5 3 JCVD AHHHHH!!!
6 4 jon great pic
select * from comment where photo_id=5
create table comment
( photo_id int,
comment_id int,
user text,
comment text,
primary key (photo_id, comment_id));
Partition with Clustering
photo_id comment_id user comment comment_id user comment comment_id user comment
5 1 jon hi 2 luke oh hey 3 JCVD AHHHHH!!!
6 4 jon great pic
• Multiple rows are transposed into a single partition
• Partitions vary in size
• Old terminology - "wide row"
Model Tables to Answer Queries
• This is not 3NF!!
• We always query by partition key
• Create many tables aka
materialized views
• Manage in your app code
• Denormalize!!
user age
jon 33
luke 33
JCVD 53
age user user
33 jon luke
53 JCVD
CREATE TABLE age_to_user (
age int,
user text,
primary key (age, user)
);
CQL Data Types
Basic Types Collections
text uuid counter map
int timeuuid list
decimal set
blob
Read the CQL documentation for the full list of types
Reads & Writes
The Write Path
• Writes are written to any node in the cluster
(coordinator)
• Writes are written to commit log, then to
memtable
• Every write includes a timestamp
• Memtable flushed to disk periodically
(sstable)
• New memtable is created in memory
• Deletes are a special write case, called a
“tombstone”
What is an SSTable?
• Immutable data file for row storage
• Deletes are written as tombstones
• Every write includes a timestamp of when it
was written
• Partition is spread across multiple SSTables
• Same column can be in multiple SSTables
• Merged through compaction, only latest
timestamp is kept
• Easy backups!
sstable sstable sstable
sstable
The Read Path
• Any server may be queried, it acts as the
coordinator
• Contacts nodes with the requested key
• On each node, data is pulled from
SSTables and merged
• Consistency< ALL performs read repair
in background (read_repair_chance)
Analytics with Spark
Spark at a Glance
• Scala, Python, Java
• Hadoop alternative - batch analytics
• Distributed SQL
• Real time analytics via streaming
• Machine learning
• GraphX (in progress)
• Open source connector available
• Built into DSE
Picking a distribution
Open Source
• Latest, bleeding edge features
• File JIRAs
• Support via mailing list & IRC
• Fix bugs
• cassandra.apache.org
• Perfect for hacking
DataStax Enterprise
• Integrated Multi-DC Solr
• Integrated Spark
• Free Startup Program
• <3MM rev & <$30M funding
• Extended support
• Additional QA
• Focused on stable releases for enterprise
• Included on USB drive
Summary
• How do I query my data if I can only
query by key?
• Denormalize!
• Create multiple views into your data
(multiple tables)
• Cassandra is built for fast writes
• Use fast writes to do as few reads as
possible
• Use Spark for advanced analytics and
real time analysis
©2013 DataStax Confidential. Do not distribute without consent. 31

More Related Content

PDF
Cassandra Core Concepts
PDF
Diagnosing Problems in Production (Nov 2015)
PDF
Cassandra Core Concepts - Cassandra Day Toronto
PDF
Diagnosing Problems in Production - Cassandra
PPTX
Apache Cassandra Developer Training Slide Deck
PDF
Introduction to .Net Driver
PDF
Bulk Loading into Cassandra
PDF
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...
Cassandra Core Concepts
Diagnosing Problems in Production (Nov 2015)
Cassandra Core Concepts - Cassandra Day Toronto
Diagnosing Problems in Production - Cassandra
Apache Cassandra Developer Training Slide Deck
Introduction to .Net Driver
Bulk Loading into Cassandra
C* Summit 2013: Can't we all just get along? MariaDB and Cassandra by Colin C...

What's hot (20)

PDF
Advanced Operations
PPT
Webinar: Getting Started with Apache Cassandra
PDF
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
PPTX
Hindsight is 20/20: MySQL to Cassandra
PDF
Micro-batching: High-performance writes
PDF
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy Cobley
PDF
Cassandra: An Alien Technology That's not so Alien
PPTX
Everyday I’m scaling... Cassandra
PPTX
Solving Office 365 Big Challenges using Cassandra + Spark
PDF
Introduction to Cassandra
PDF
Introduction to Apache Cassandra
PPTX
Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
PDF
Introduction to Cassandra and CQL for Java developers
PPTX
Apache cassandra v4.0
PDF
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
PPTX
M6d cassandrapresentation
PDF
Introduction to CQL and Data Modeling with Apache Cassandra
PDF
Cassandra Tutorial
PDF
Introduction to Cassandra Basics
ODP
GIDS2014: SolrCloud: Searching Big Data
Advanced Operations
Webinar: Getting Started with Apache Cassandra
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Hindsight is 20/20: MySQL to Cassandra
Micro-batching: High-performance writes
C* Summit 2013: Hardware Agnostic - Cassandra on Raspberry Pi by Andy Cobley
Cassandra: An Alien Technology That's not so Alien
Everyday I’m scaling... Cassandra
Solving Office 365 Big Challenges using Cassandra + Spark
Introduction to Cassandra
Introduction to Apache Cassandra
Cassandra Day NY 2014: Getting Started with the DataStax C# Driver
Introduction to Cassandra and CQL for Java developers
Apache cassandra v4.0
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
M6d cassandrapresentation
Introduction to CQL and Data Modeling with Apache Cassandra
Cassandra Tutorial
Introduction to Cassandra Basics
GIDS2014: SolrCloud: Searching Big Data
Ad

Viewers also liked (20)

PDF
Introduction to Cassandra - Denver
PDF
Spark and cassandra (Hulu Talk)
PDF
Intro to py spark (and cassandra)
PDF
Python & Cassandra - Best Friends
PDF
Diagnosing Problems in Production: Cassandra Summit 2014
PDF
Python performance profiling
PDF
Cassandra 3.0 Awesomeness
PDF
Crash course intro to cassandra
PDF
Enter the Snake Pit for Fast and Easy Spark
PDF
Cassandra meetup slides - Oct 15 Santa Monica Coloft
PDF
Python and cassandra
PDF
Inside Hulu's Data platform (BigDataCamp LA 2013)
PPTX
Lessons Learned - Monitoring the Data Pipeline at Hulu
PDF
Cassandra and Spark
PDF
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
PDF
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
PDF
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
PDF
DataStax: 7 Deadly Sins for Cassandra Ops
PDF
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
PDF
DataStax: Making Cassandra Fail (for effective testing)
Introduction to Cassandra - Denver
Spark and cassandra (Hulu Talk)
Intro to py spark (and cassandra)
Python & Cassandra - Best Friends
Diagnosing Problems in Production: Cassandra Summit 2014
Python performance profiling
Cassandra 3.0 Awesomeness
Crash course intro to cassandra
Enter the Snake Pit for Fast and Easy Spark
Cassandra meetup slides - Oct 15 Santa Monica Coloft
Python and cassandra
Inside Hulu's Data platform (BigDataCamp LA 2013)
Lessons Learned - Monitoring the Data Pipeline at Hulu
Cassandra and Spark
DataStax: How to Roll Cassandra into Production Without Losing your Health, M...
DataStax: Old Dogs, New Tricks. Teaching your Relational DBA to fetch
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
DataStax: 7 Deadly Sins for Cassandra Ops
DataStax & O'Reilly Media: Large Scale Data Analytics with Spark and Cassandr...
DataStax: Making Cassandra Fail (for effective testing)
Ad

Similar to Intro to Cassandra (20)

PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
PDF
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
PDF
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
PDF
Cassandra Core Concepts
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PPTX
Using Cassandra with your Web Application
PDF
Deep Dive into Cassandra
PPTX
Presentation
PDF
Introduction to Cassandra
ODP
Intro to cassandra
PPTX
Learning Cassandra NoSQL
PDF
Introduction to cassandra 2014
PDF
A Deep Dive into Apache Cassandra for .NET Developers
PPTX
L6.sp17.pptx
PDF
Intro to Cassandra
PDF
Jan 2015 - Cassandra101 Manchester Meetup
PDF
Cassandra
PPT
The No SQL Principles and Basic Application Of Casandra Model
PDF
Cassandra Day Denver 2014: Introduction to Apache Cassandra
PPTX
Cassandra & Python - Springfield MO User Group
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...
Cassandra Day Chicago 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Core Concepts
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Using Cassandra with your Web Application
Deep Dive into Cassandra
Presentation
Introduction to Cassandra
Intro to cassandra
Learning Cassandra NoSQL
Introduction to cassandra 2014
A Deep Dive into Apache Cassandra for .NET Developers
L6.sp17.pptx
Intro to Cassandra
Jan 2015 - Cassandra101 Manchester Meetup
Cassandra
The No SQL Principles and Basic Application Of Casandra Model
Cassandra Day Denver 2014: Introduction to Apache Cassandra
Cassandra & Python - Springfield MO User Group

Recently uploaded (20)

DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Advanced methodologies resolving dimensionality complications for autism neur...
“AI and Expert System Decision Support & Business Intelligence Systems”
Understanding_Digital_Forensics_Presentation.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Mobile App Security Testing_ A Comprehensive Guide.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Spectral efficient network and resource selection model in 5G networks
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...

Intro to Cassandra

  • 1. ©2013 DataStax Confidential. Do not distribute without consent. Jon Haddad, Luke Tillman Technical Evangelists, DataStax @rustyrazorblade, @LukeTillman Introduction to Apache Cassandra 1
  • 2. Small Data • 100's of MB to low GB, single user • sed, awk, grep are great • sqlite • Limitations: • bad for multiple concurrent users (file sharing!)
  • 3. Medium Data • Fits on 1 machine • RDBMS is fine • postgres • mysql • Supports hundreds of concurrent users • ACID makes us feel good • Scales vertically
  • 4. Can RDBMS work for big data?
  • 5. Replication: ACID is a lie Client Master Slave replication lag Consistent results? Nope!
  • 6. Third Normal Form Doesn't Scale • Queries are unpredictable • Users are impatient • Data must be denormalized • If data > memory, you = history • Disk seeks are the worst (SELECT CONCAT(city_name,', ',region) value, latitude, longitude, id, population, ( 3959 * acos( cos( radians($latitude) ) * cos( radians( latitude ) ) * cos( radians( longitude ) - radians($longitude) ) + sin( radians($latitude) ) * sin( radians( latitude ) ) ) ) AS distance, CASE region WHEN '$region' THEN 1 ELSE 0 END AS region_match FROM `cities` $where and foo_count > 5 ORDER BY region_match desc, foo_count desc limit 0, 11) UNION (SELECT CONCAT(city_name,', ',region) value, latitude, longitude, id, population, ( 3959 * acos( cos( radians($latitude) ) * cos( radians( latitude ) ) * cos( radians( longitude ) - radians($longitude) ) + sin( radians($latitude) ) * sin( radians( latitude ) ) ) )
  • 7. Sharding is a Nightmare • Data is all over the place • No more joins • No more aggregations • Denormalize all the things • Querying secondary indexes requires hitting every shard • Adding shards requires manually moving data • Schema changes
  • 8. High Availability.. not really • Master failover… who's responsible? • Another moving part… • Bolted on hack • Multi-DC is a mess • Downtime is frequent • Change database settings (innodb buffer pool, etc) • Drive, power supply failures • OS updates
  • 9. Summary of Failure • Scaling is a pain • ACID is naive at best • You aren't consistent • Re-sharding is a manual process • We're going to denormalize for performance • High availability is complicated, requires additional operational overhead
  • 10. Lessons Learned • Consistency is not practical • So we give it up • Manual sharding & rebalancing is hard • So let's build in • Every moving part makes systems more complex • So let's simplify our architecture - no more master / slave • Scaling up is expensive • We want commodity hardware • Scatter / gather no good • We denormalize for real time query performance • Goal is to always hit 1 machine
  • 11. What is Apache Cassandra? • Fast Distributed Database • High Availability • Linear Scalability • Predictable Performance • No SPOF • Multi-DC • Commodity Hardware • Easy to manage operationally • Not a drop in replacement for RDBMS
  • 12. Hash Ring • No master / slave / replica sets • No config servers, zookeeper • Data is partitioned around the ring • Data is replicated to RF=N servers • All nodes hold data and can answer queries (both reads & writes) • Location of data on ring is determined by partition key
  • 13. CAP Tradeoffs • Cassandra chooses Availability & Partition Tolerance over Consistency • Queries have tunable consistency level • ALL, QUORUM, ONE • Hinted Handoff to deal with failed nodes
  • 15. Data Structures • Like an RDBMS, Cassandra uses a Table to store data • But there’s where the similarities end • Partitions within tables • Rows within partitions (or a single row) • CQL to create tables & query data • Partition keys determine where a partition is found • Clustering keys determine ordering of rows within a partition Table Partition Row Keyspace
  • 16. Example: Single Row Partition • Simple User system • Identified by name (pk) • 1 Row per partition • This is familiar territory name age job jon 33 evangelist luke 33 evangelist old pete 108 retired s. seagal 62 actor JCVD 53 actor cqlsh:demo> select * from user WHERE name = 'JCVD' cqlsh:demo> create table user (name text primary key, age int, job text);
  • 17. Example: Multiple Rows • Comments on photos • Comments are always selected by the photo_id • There are only 4 rows in 2 partitions • In the real world, use UUIDs instead of int for PK photo_id comment_id user comment 5 1 jon hi 5 2 luke oh hey 5 3 JCVD AHHHHH!!! 6 4 jon great pic select * from comment where photo_id=5 create table comment ( photo_id int, comment_id int, user text, comment text, primary key (photo_id, comment_id));
  • 18. Partition with Clustering photo_id comment_id user comment comment_id user comment comment_id user comment 5 1 jon hi 2 luke oh hey 3 JCVD AHHHHH!!! 6 4 jon great pic • Multiple rows are transposed into a single partition • Partitions vary in size • Old terminology - "wide row"
  • 19. Model Tables to Answer Queries • This is not 3NF!! • We always query by partition key • Create many tables aka materialized views • Manage in your app code • Denormalize!! user age jon 33 luke 33 JCVD 53 age user user 33 jon luke 53 JCVD CREATE TABLE age_to_user ( age int, user text, primary key (age, user) );
  • 20. CQL Data Types Basic Types Collections text uuid counter map int timeuuid list decimal set blob Read the CQL documentation for the full list of types
  • 22. The Write Path • Writes are written to any node in the cluster (coordinator) • Writes are written to commit log, then to memtable • Every write includes a timestamp • Memtable flushed to disk periodically (sstable) • New memtable is created in memory • Deletes are a special write case, called a “tombstone”
  • 23. What is an SSTable? • Immutable data file for row storage • Deletes are written as tombstones • Every write includes a timestamp of when it was written • Partition is spread across multiple SSTables • Same column can be in multiple SSTables • Merged through compaction, only latest timestamp is kept • Easy backups! sstable sstable sstable sstable
  • 24. The Read Path • Any server may be queried, it acts as the coordinator • Contacts nodes with the requested key • On each node, data is pulled from SSTables and merged • Consistency< ALL performs read repair in background (read_repair_chance)
  • 26. Spark at a Glance • Scala, Python, Java • Hadoop alternative - batch analytics • Distributed SQL • Real time analytics via streaming • Machine learning • GraphX (in progress) • Open source connector available • Built into DSE
  • 28. Open Source • Latest, bleeding edge features • File JIRAs • Support via mailing list & IRC • Fix bugs • cassandra.apache.org • Perfect for hacking
  • 29. DataStax Enterprise • Integrated Multi-DC Solr • Integrated Spark • Free Startup Program • <3MM rev & <$30M funding • Extended support • Additional QA • Focused on stable releases for enterprise • Included on USB drive
  • 30. Summary • How do I query my data if I can only query by key? • Denormalize! • Create multiple views into your data (multiple tables) • Cassandra is built for fast writes • Use fast writes to do as few reads as possible • Use Spark for advanced analytics and real time analysis
  • 31. ©2013 DataStax Confidential. Do not distribute without consent. 31