SlideShare a Scribd company logo
Intro to

Cassandra
  Tyler Hobbs
History


Dynamo                     BigTable
(clustering)               (data model)




               Cassandra
Users
Clustering

    Every node plays the same role
    – No masters, slaves, or special nodes
    – No single point of failure
Consistent Hashing

           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                        Key: “www.google.com”
           0
                        md5(“www.google.com”)
     50          10

                                   14

     40          20

           30
                Replication Factor = 3
Clustering

    Client can talk to any node
Scaling

RF = 2             0


              50        10

The node at
50 owns the
red portion             20

                   30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10




                40        20

                     30
Consistency, Availability

    Consistency
    – Can I read stale data?

    Availability
    – Can I write/read at all?

    Tunable Consistency
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)


    W + R > N gives strong consistency
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent

 Only 2 of the 3 replicas must be
 available.
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
    – Quorum: N/2 + 1
       • R = W = Quorum
       • Strong consistency
       • Tolerate the loss of N – Quorum replicas
    – R, W can also be 1 or N
Availability

    Can tolerate the loss of:
    – N – R replicas for reads
    – N – W replicas for writes
CAP Theorem
During node or network failure:



          100%
                                          Not
                                          Possible

   Availability
                     Possible




                     Consistency   100%
CAP Theorem
During node or network failure:



          100%
                                                 Not
                            Ca                   Possible
                              ss
                                an
                                   dr
   Availability                       a
                     Possible




                     Consistency          100%
Clustering

    No single point of failure

    Replication that works

    Scales linearly
    – 2x nodes = 2x performance
       • For both writes and reads
    – Up to 100's of nodes

    Operationally simple

    Multi-Datacenter Replication
Data Model

    Comes from Google BigTable

    Goals
    – Minimize disk seeks
    – High throughput
    – Low latency
    – Durable
Data Model

    Keyspace
    – A collection of Column Families
    – Controls replication settings

    Column Family
    – Kinda resembles a table
Column Families

    Static
    – Object data
    – Similar to a table in a relational database

    Dynamic
    – Pre-calculated query results
    – Materialized views
Static Column Families
                   Users
   zznate    password: *    name: Nate


   driftx    password: *   name: Brandon


   thobbs    password: *    name: Tyler


   jbellis   password: *   name: Jonathan   site: riptano.com
Dynamic Column Families

    Rows
    – Each row has a unique primary key
    – Sorted list of (name, value) tuples
       • Like a sorted map or dictionary
    – The (name, value) tuple is called a “column”
Dynamic Column Families
                     Following
zznate    driftx:   thobbs:


driftx


thobbs    zznate:


jbellis   driftx:   mdennis:   pcmanus   thobbs:   xedin:   zznate
Dynamic Column Families

    Column Timestamps
    – Each column (tuple) has a timestamp
    – In the case of a collision, the latest timestamp wins
    – Client specifies timestamp with write
    – Writes are idempotent
       • Infinite retries allowed
Dynamic Column Families

    Other Examples:
    – Timeline of tweets by a user
    – Timeline of tweets by all of the people a user is
      following
    – List of comments sorted by score
    – List of friends grouped by state
The Data API

    Two choices
    – RPC-based API
    – CQL
       • Cassandra Query Language
Inserting Data
 INSERT INTO users (KEY, “name”, “age”)
     VALUES (“thobbs”, “Tyler”, 24);
Updating Data
 Updates are the same as inserts:
 INSERT INTO users (KEY, “age”)
     VALUES (“thobbs”, 34);


 Or
 UPDATE users SET “age” = 34
     WHERE KEY = “thobbs”;
Fetching Data
 Whole row select:
 SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data
 Explicit column select:
 SELECT “name”, “age” FROM users
     WHERE KEY = “thobbs”;
Fetching Data
 Get a slice of columns
 UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
     WHERE KEY = “key”;

 SELECT 1..3 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b), (3, c)]
Fetching Data
 Get a slice of columns
 SELECT FIRST 2 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b)]

 SELECT FIRST 2 REVERSED FROM letters
     WHERE KEY = “key”;


 Returns [(5, e), (4, d)]
Fetching Data
 Get a slice of columns
 SELECT 3..'' FROM letters WHERE KEY = “key”;


 Returns [(3, c), (4, d), (5, e)]

 SELECT FIRST 2 REVERSED 4..'' FROM letters
     WHERE KEY = “key”;


 Returns [(4, d), (3, c)]
Deleting Data
 Delete a whole row:
 DELETE FROM users WHERE KEY = “thobbs”;

 Delete specific columns:
 DELETE “age” FROM users
     WHERE KEY = “thobbs”;
Secondary Indexes
 Builtin basic indexes
 CREATE INDEX ageIndex ON users (age);

 SELECT name FROM USERS
     WHERE age = 24 AND state = “TX”;
Performance

    Writes
    – 10k – 30k per second per node
    – Sub-millisecond latency

    Reads
    – 1k – 10k per second per node
    – Depends on data set, caching
    – Usually 0.1 to 10ms latency
Other Features

    Distributed Counters
    – Can support millions of high-volume counters

    Excellent Multi-datacenter Support
    – Disaster recovery
    – Locality

    Hadoop Integration
    – Isolation of resources
    – Hive and Pig drivers

    Compression
What Cassandra Can't Do

    Transactions
    – Unless you use a distributed lock
    – Atomicity, Isolation
    – These aren't needed as often as you'd think

    Limited support for ad-hoc queries
    – Know what you want to do with the data
Not One-size-fits-all

    Use alongside an RDBMS
    – Use the RDBMS for highly-transactional or highly-
      relational data
       • Usually a small set of data
    – Let Cassandra scale to handle the rest
Language Support

    Good:
    – Java
    – Python
    – Ruby
    – PHP
    – C#

    Coming Soon:
    – Everything else, now that we have CQL
Questions?

          Tyler Hobbs
               @tylhobbs
       tyler@datastax.com

More Related Content

PDF
Cassandra for Python Developers
PDF
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
PDF
MySQL's JSON Data Type and Document Store
PDF
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
ODP
Beyond PHP - It's not (just) about the code
ODP
Caching and tuning fun for high scalability
PDF
MySQL 5.7 NF – JSON Datatype 활용
PDF
Pdxpugday2010 pg90
Cassandra for Python Developers
Diagnosing Open-Source Community Health with Spark-(William Benton, Red Hat)
MySQL's JSON Data Type and Document Store
Percona Live 4/15/15: Transparent sharding database virtualization engine (DVE)
Beyond PHP - It's not (just) about the code
Caching and tuning fun for high scalability
MySQL 5.7 NF – JSON Datatype 활용
Pdxpugday2010 pg90

What's hot (20)

PDF
groovy databases
PDF
SunshinePHP 2017 - Making the most out of MySQL
PPTX
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
PDF
Spock and Geb in Action
PDF
Cassandra 2.1
PPTX
MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...
PDF
ODP
Beyond PHP - it's not (just) about the code
ODP
Caching and tuning fun for high scalability
PDF
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
PDF
The Ring programming language version 1.10 book - Part 56 of 212
PDF
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
PPTX
Getting started with replica set in MongoDB
PDF
The Ring programming language version 1.5.2 book - Part 45 of 181
PDF
The ABCs of OTP
PDF
Graph Connect: Importing data quickly and easily
PDF
Cassandra summit keynote 2014
DOCX
Materi my sql part 1
PDF
Cassandra introduction @ ParisJUG
PPTX
Introduction databases and MYSQL
groovy databases
SunshinePHP 2017 - Making the most out of MySQL
Rapid and Scalable Development with MongoDB, PyMongo, and Ming
Spock and Geb in Action
Cassandra 2.1
MongoDB London 2013: Basic Replication in MongoDB presented by Marc Schwering...
Beyond PHP - it's not (just) about the code
Caching and tuning fun for high scalability
اسلاید اول جلسه چهارم کلاس پایتون برای هکرهای قانونی
The Ring programming language version 1.10 book - Part 56 of 212
BDD - Behavior Driven Development Webapps mit Groovy Spock und Geb
Getting started with replica set in MongoDB
The Ring programming language version 1.5.2 book - Part 45 of 181
The ABCs of OTP
Graph Connect: Importing data quickly and easily
Cassandra summit keynote 2014
Materi my sql part 1
Cassandra introduction @ ParisJUG
Introduction databases and MYSQL
Ad

Viewers also liked (11)

PPTX
SC 2015: Thinking Fast and Slow with Software Development
PDF
Detect all memory leaks with LeakCanary!
PDF
How Yelp Uses Sensu to Monitor Services in a SOA World
PDF
Evolving the Netflix API
PDF
Datomic – A Modern Database - StampedeCon 2014
PDF
7 Common Mistakes in Go (2015)
PDF
How to name things: the hardest problem in programming
PPTX
Cassandra @ Sony: The good, the bad, and the ugly part 1
PPTX
Patterns for building resilient and scalable microservices platform on AWS
PDF
Understanding Data Partitioning and Replication in Apache Cassandra
PDF
The data model is dead, long live the data model
SC 2015: Thinking Fast and Slow with Software Development
Detect all memory leaks with LeakCanary!
How Yelp Uses Sensu to Monitor Services in a SOA World
Evolving the Netflix API
Datomic – A Modern Database - StampedeCon 2014
7 Common Mistakes in Go (2015)
How to name things: the hardest problem in programming
Cassandra @ Sony: The good, the bad, and the ugly part 1
Patterns for building resilient and scalable microservices platform on AWS
Understanding Data Partitioning and Replication in Apache Cassandra
The data model is dead, long live the data model
Ad

Similar to Intro to Cassandra (20)

PDF
Cassandra for Ruby/Rails Devs
PDF
Scalable Data Storage Getting You Down? To The Cloud!
PDF
Scalable Data Storage Getting you Down? To the Cloud!
PDF
A Guide to the Post Relational Revolution
PDF
Design Patterns for Distributed Non-Relational Databases
PDF
PPT
No sql
PPTX
Cassandra 2012 scandit
PDF
Nzpug welly-cassandra-02-12-2010
PDF
Design Patterns For Distributed NO-reational databases
PDF
Thoughts on Transaction and Consistency Models
PPTX
NoSql Database
PDF
Slide presentation pycassa_upload
PPTX
Basics of Distributed Systems - Distributed Storage
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
Cassandra勉強会
PDF
Apache Cassandra @Geneva JUG 2013.02.26
PPTX
Cassandra
PDF
On Rails with Apache Cassandra
PPTX
Talk About Apache Cassandra
Cassandra for Ruby/Rails Devs
Scalable Data Storage Getting You Down? To The Cloud!
Scalable Data Storage Getting you Down? To the Cloud!
A Guide to the Post Relational Revolution
Design Patterns for Distributed Non-Relational Databases
No sql
Cassandra 2012 scandit
Nzpug welly-cassandra-02-12-2010
Design Patterns For Distributed NO-reational databases
Thoughts on Transaction and Consistency Models
NoSql Database
Slide presentation pycassa_upload
Basics of Distributed Systems - Distributed Storage
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Cassandra勉強会
Apache Cassandra @Geneva JUG 2013.02.26
Cassandra
On Rails with Apache Cassandra
Talk About Apache Cassandra

Recently uploaded (20)

PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation theory and applications.pdf
PPTX
Cloud computing and distributed systems.
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Spectroscopy.pptx food analysis technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Encapsulation theory and applications.pdf
Cloud computing and distributed systems.
Network Security Unit 5.pdf for BCA BBA.
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
sap open course for s4hana steps from ECC to s4
Reach Out and Touch Someone: Haptics and Empathic Computing
Diabetes mellitus diagnosis method based random forest with bat algorithm
Spectroscopy.pptx food analysis technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
20250228 LYD VKU AI Blended-Learning.pptx

Intro to Cassandra

  • 1. Intro to Cassandra Tyler Hobbs
  • 2. History Dynamo BigTable (clustering) (data model) Cassandra
  • 4. Clustering  Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  • 5. Consistent Hashing 0 50 10 40 20 30
  • 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  • 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  • 11. Clustering  Client can talk to any node
  • 12. Scaling RF = 2 0 50 10 The node at 50 owns the red portion 20 30
  • 13. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 14. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 15. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 16. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 17. Node Failures RF = 2 0 50 10 40 20 30
  • 18. Consistency, Availability  Consistency – Can I read stale data?  Availability – Can I write/read at all?  Tunable Consistency
  • 19. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success)
  • 20. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  • 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  • 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  • 23. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  • 24. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  • 25. Availability  Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  • 26. CAP Theorem During node or network failure: 100% Not Possible Availability Possible Consistency 100%
  • 27. CAP Theorem During node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  • 28. Clustering  No single point of failure  Replication that works  Scales linearly – 2x nodes = 2x performance • For both writes and reads – Up to 100's of nodes  Operationally simple  Multi-Datacenter Replication
  • 29. Data Model  Comes from Google BigTable  Goals – Minimize disk seeks – High throughput – Low latency – Durable
  • 30. Data Model  Keyspace – A collection of Column Families – Controls replication settings  Column Family – Kinda resembles a table
  • 31. Column Families  Static – Object data – Similar to a table in a relational database  Dynamic – Pre-calculated query results – Materialized views
  • 32. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  • 33. Dynamic Column Families  Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like a sorted map or dictionary – The (name, value) tuple is called a “column”
  • 34. Dynamic Column Families Following zznate driftx: thobbs: driftx thobbs zznate: jbellis driftx: mdennis: pcmanus thobbs: xedin: zznate
  • 35. Dynamic Column Families  Column Timestamps – Each column (tuple) has a timestamp – In the case of a collision, the latest timestamp wins – Client specifies timestamp with write – Writes are idempotent • Infinite retries allowed
  • 36. Dynamic Column Families  Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  • 37. The Data API  Two choices – RPC-based API – CQL • Cassandra Query Language
  • 38. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  • 39. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  • 40. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  • 41. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  • 42. Fetching Data Get a slice of columns UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  • 43. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  • 44. Fetching Data Get a slice of columns SELECT 3..'' FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  • 45. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  • 46. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  • 47. Performance  Writes – 10k – 30k per second per node – Sub-millisecond latency  Reads – 1k – 10k per second per node – Depends on data set, caching – Usually 0.1 to 10ms latency
  • 48. Other Features  Distributed Counters – Can support millions of high-volume counters  Excellent Multi-datacenter Support – Disaster recovery – Locality  Hadoop Integration – Isolation of resources – Hive and Pig drivers  Compression
  • 49. What Cassandra Can't Do  Transactions – Unless you use a distributed lock – Atomicity, Isolation – These aren't needed as often as you'd think  Limited support for ad-hoc queries – Know what you want to do with the data
  • 50. Not One-size-fits-all  Use alongside an RDBMS – Use the RDBMS for highly-transactional or highly- relational data • Usually a small set of data – Let Cassandra scale to handle the rest
  • 51. Language Support  Good: – Java – Python – Ruby – PHP – C#  Coming Soon: – Everything else, now that we have CQL
  • 52. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com