SlideShare a Scribd company logo
Intro to

Cassandra
  Tyler Hobbs
History


Dynamo                        BigTable
(clustering)                  (data model)


               Inbox search




                Cassandra
Users
Clustering

    Every node plays the same role
    – No masters, slaves, or special nodes
    – No single point of failure
Consistent Hashing

           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0

     50          10




     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                      Key: “www.google.com”
           0
                      md5(“www.google.com”)
     50          10

                               14

     40          20

           30
Consistent Hashing
                        Key: “www.google.com”
           0
                        md5(“www.google.com”)
     50          10

                                   14

     40          20

           30
                Replication Factor = 3
Clustering

    Client can talk to any node
Scaling

RF = 2             0


              50        10

The node at
50 owns the
red portion             20

                   30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Scaling

RF = 2               0


                50        10



   Add a new    40        20
   node at 40
                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10

   Replicas
                40        20

                     30
Node Failures

RF = 2               0


                50        10




                40        20

                     30
Consistency, Availability

    Consistency
    – Can I read stale data?

    Availability
    – Can I write/read at all?

    Tunable Consistency
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)
Consistency

    N = Total number of replicas

    R = Number of replicas read from
    – (before the response is returned)

    W = Number of replicas written to
    – (before the write is considered a success)


    W + R > N gives strong consistency
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent
Consistency
 W + R > N gives strong consistency

 N=3
 W=2
 R=2

 2 + 2 > 3 ==> strongly consistent

 Only 2 of the 3 replicas must be
 available.
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
Consistency

    Tunable Consistency
    – Specify N (Replication Factor) per data set
    – Specify R, W per operation
    – Quorum: N/2 + 1
       • R = W = Quorum
       • Strong consistency
       • Tolerate the loss of N – Quorum replicas
    – R, W can also be 1 or N
Availability

    Can tolerate the loss of:
    – N – R replicas for reads
    – N – W replicas for writes
CAP Theorem
During node or network failure:



          100%
                                          Not
                                          Possible

   Availability
                     Possible




                     Consistency   100%
CAP Theorem
During node or network failure:



          100%
                                                 Not
                            Ca                   Possible
                              ss
                                an
                                   dr
   Availability                       a
                     Possible




                     Consistency          100%
Clustering

    No single point of failure

    Replication that works

    Scales linearly
    – 2x nodes = 2x performance
       • For both reads and writes
    – Up to 100's of nodes
    – See “Netflix: 1 million writes/sec on AWS”

    Operationally simple

    Multi-Datacenter Replication
Data Model

    Comes from Google BigTable

    Goals
    – Commodity Hardware
       • Spinning disks
    – Handle data sets much larger than memory
       • Minimize disk seeks
    – High throughput
    – Low latency
    – Durable
Column Families

    Static
    – Object data
    – Similar to a table in a relational database

    Dynamic
    – Precomputed query results
    – Materialized views

    (these are just educational classifications)
Static Column Families
                   Users
   zznate    password: *    name: Nate


   driftx    password: *   name: Brandon


   thobbs    password: *    name: Tyler


   jbellis   password: *   name: Jonathan   site: riptano.com
Dynamic Column Families

    Rows
    – Each row has a unique primary key
    – Sorted list of (name, value) tuples
       • Like an ordered hash
    – The (name, value) tuple is called a “column”
Dynamic Column Families
                     Following
zznate    driftx:   thobbs:


driftx


thobbs    zznate:


jbellis   driftx:   mdennis:   pcmanus:   thobbs:   xedin:   zznate:
Dynamic Column Families

    Other Examples:
    – Timeline of tweets by a user
    – Timeline of tweets by all of the people a user is
      following
    – List of comments sorted by score
    – List of friends grouped by state
The Data API

    RPC-based API
    – github.com/twitter/cassandra

    CQL (Cassandra Query Language)
    – code.google.com/a/apache-extras.org/p/cassandra-ruby/
Inserting Data
 INSERT INTO users (KEY, “name”, “age”)
     VALUES (“thobbs”, “Tyler”, 24);
Updating Data
 Updates are the same as inserts:
 INSERT INTO users (KEY, “age”)
     VALUES (“thobbs”, 34);


 Or
 UPDATE users SET “age” = 34
     WHERE KEY = “thobbs”;
Fetching Data
 Whole row select:
 SELECT * FROM users WHERE KEY = “thobbs”;
Fetching Data
 Explicit column select:
 SELECT “name”, “age” FROM users
     WHERE KEY = “thobbs”;
Fetching Data
 Get a slice of columns
 UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e'
     WHERE KEY = “key”;

 SELECT 1..3 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b), (3, c)]
Fetching Data
 Get a slice of columns
 SELECT FIRST 2 FROM letters WHERE KEY = “key”;


 Returns [(1, a), (2, b)]

 SELECT FIRST 2 REVERSED FROM letters
     WHERE KEY = “key”;


 Returns [(5, e), (4, d)]
Fetching Data
 Get a slice of columns
 SELECT 3..'' FROM letters WHERE KEY = “key”;


 Returns [(3, c), (4, d), (5, e)]

 SELECT FIRST 2 REVERSED 4..'' FROM letters
     WHERE KEY = “key”;


 Returns [(4, d), (3, c)]
Deleting Data
 Delete a whole row:
 DELETE FROM users WHERE KEY = “thobbs”;

 Delete specific columns:
 DELETE “age” FROM users
     WHERE KEY = “thobbs”;
Secondary Indexes
 Builtin basic indexes
 CREATE INDEX ageIndex ON users (age);

 SELECT name FROM USERS
     WHERE age = 24 AND state = “TX”;
Performance

    Writes
    – 10k – 30k per second per node
    – Sub-millisecond latency

    Reads
    – 1k – 20k per second per node (depends on data
      set, caching
    – 0.1 to 10ms latency
Other Features

    Distributed Counters
    – Can support millions of high-volume counters

    Excellent Multi-datacenter Support
    – Disaster recovery
    – Locality

    Hadoop Integration
    – Isolation of resources
    – Hive and Pig drivers

    Compression
What Cassandra Can't Do

    Transactions
    – Unless you use a distributed lock
    – Atomicity, Isolation
    – These aren't needed as often as you'd think

    Limited support for ad-hoc queries
    – Know what you want to do with the data
Not One-size-fits-all

    Use alongside an RDBMS
Problems you shouldn't solve with C*

    Prototyping

    Distributed Locking

    Small datasets
    – (When you don't need availability)

    Complex graph processing
    – Shallow graph queries work well, though

    Fundamentally highly relational/transactional
    data
The sweet spot for Cassandra

    Large dataset, low latency queries

    Simple to medium complexity queries
    – Key/value
    – Time series, ordered data
    – Lists, sets, maps

    High Availability
The sweet spot for Cassandra

    Social
    – Texts, comments, check-ins, collaboration

    Activity
    – Feeds, timelines, clickstreams, logs, sensor data

    Metrics
    – Performance data over time
    – CloudKick, DataStax OpsCenter

    Text Search
    – Inbox search at Facebook
ORMs

    Poor integration

    ORMs are not a natural fit for Cassandra
    – In C*, we mainly care about queries, not objects
    – Beyond simple K/V, abstraction breaks

    Suggestion: don't waste time with an ORM
    – C* will only be used for a specific subset of your
      data/queries
    – Use the C* API directly in your model
Questions?

          Tyler Hobbs
               @tylhobbs
       tyler@datastax.com

More Related Content

PDF
On Rails with Apache Cassandra
PDF
Outside The Box With Apache Cassnadra
PPTX
Cassandra Presentation for San Antonio JUG
PDF
Cassandra Explained
PDF
Cassandra Tutorial
PPTX
Learning Cassandra
PDF
Cassandra for Sysadmins
PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
On Rails with Apache Cassandra
Outside The Box With Apache Cassnadra
Cassandra Presentation for San Antonio JUG
Cassandra Explained
Cassandra Tutorial
Learning Cassandra
Cassandra for Sysadmins
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...

What's hot (20)

KEY
Bay area Cassandra Meetup 2011
PDF
Cassandra by example - the path of read and write requests
PPTX
Apache Cassandra 2.0
PPTX
Introduction to NoSQL & Apache Cassandra
PDF
Introduction to Apache Cassandra
KEY
Introduction to Cassandra: Replication and Consistency
PDF
Introduction to Cassandra
PPT
The No SQL Principles and Basic Application Of Casandra Model
PDF
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
PDF
Introduction to Cassandra
PPTX
Apache Cassandra, part 1 – principles, data model
PDF
Cassandra multi-datacenter operations essentials
PPTX
Cassandra Metrics
PPT
Introduction to apache_cassandra_for_develope
PDF
OpenTSDB 2.0
PDF
Advanced Apache Cassandra Operations with JMX
PPTX
Update on OpenTSDB and AsyncHBase
PPTX
HBaseCon 2015: OpenTSDB and AsyncHBase Update
PDF
第17回Cassandra勉強会: MyCassandra
ODP
Introduction to apache_cassandra_for_developers-lhg
Bay area Cassandra Meetup 2011
Cassandra by example - the path of read and write requests
Apache Cassandra 2.0
Introduction to NoSQL & Apache Cassandra
Introduction to Apache Cassandra
Introduction to Cassandra: Replication and Consistency
Introduction to Cassandra
The No SQL Principles and Basic Application Of Casandra Model
Exactly-Once Streaming from Kafka-(Cody Koeninger, Kixer)
Introduction to Cassandra
Apache Cassandra, part 1 – principles, data model
Cassandra multi-datacenter operations essentials
Cassandra Metrics
Introduction to apache_cassandra_for_develope
OpenTSDB 2.0
Advanced Apache Cassandra Operations with JMX
Update on OpenTSDB and AsyncHBase
HBaseCon 2015: OpenTSDB and AsyncHBase Update
第17回Cassandra勉強会: MyCassandra
Introduction to apache_cassandra_for_developers-lhg
Ad

Similar to Cassandra for Ruby/Rails Devs (20)

PDF
Intro to Cassandra
PDF
Scalable Data Storage Getting You Down? To The Cloud!
PDF
Scalable Data Storage Getting you Down? To the Cloud!
PDF
Design Patterns for Distributed Non-Relational Databases
PDF
A Guide to the Post Relational Revolution
PPT
Scaling web applications with cassandra presentation
PPTX
Cassandra 2012 scandit
PDF
Slide presentation pycassa_upload
PPT
No sql
PPTX
Cassandra
PDF
Seminar.2010.NoSql
PDF
Finding the Right Data Solution for your Application in the Data Storage Hays...
PPT
No sql
PPTX
NoSql Database
PDF
Cassandra Talk: Austin JUG
PDF
Cassandra for Python Developers
PDF
Design Patterns For Distributed NO-reational databases
PDF
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
Nzpug welly-cassandra-02-12-2010
Intro to Cassandra
Scalable Data Storage Getting You Down? To The Cloud!
Scalable Data Storage Getting you Down? To the Cloud!
Design Patterns for Distributed Non-Relational Databases
A Guide to the Post Relational Revolution
Scaling web applications with cassandra presentation
Cassandra 2012 scandit
Slide presentation pycassa_upload
No sql
Cassandra
Seminar.2010.NoSql
Finding the Right Data Solution for your Application in the Data Storage Hays...
No sql
NoSql Database
Cassandra Talk: Austin JUG
Cassandra for Python Developers
Design Patterns For Distributed NO-reational databases
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
Nzpug welly-cassandra-02-12-2010
Ad

Recently uploaded (20)

PPT
Chapter four Project-Preparation material
DOCX
Business Management - unit 1 and 2
PPTX
Lecture (1)-Introduction.pptx business communication
PDF
Unit 1 Cost Accounting - Cost sheet
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
PDF
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PDF
IFRS Notes in your pocket for study all the time
PDF
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PDF
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
PPTX
Amazon (Business Studies) management studies
PDF
A Brief Introduction About Julia Allison
PPTX
HR Introduction Slide (1).pptx on hr intro
PPTX
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
PDF
How to Get Business Funding for Small Business Fast
PDF
Chapter 5_Foreign Exchange Market in .pdf
PDF
COST SHEET- Tender and Quotation unit 2.pdf
PPTX
New Microsoft PowerPoint Presentation - Copy.pptx
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
Chapter four Project-Preparation material
Business Management - unit 1 and 2
Lecture (1)-Introduction.pptx business communication
Unit 1 Cost Accounting - Cost sheet
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
20250805_A. Stotz All Weather Strategy - Performance review July 2025.pdf
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
IFRS Notes in your pocket for study all the time
kom-180-proposal-for-a-directive-amending-directive-2014-45-eu-and-directive-...
DOC-20250806-WA0002._20250806_112011_0000.pdf
SIMNET Inc – 2023’s Most Trusted IT Services & Solution Provider
Amazon (Business Studies) management studies
A Brief Introduction About Julia Allison
HR Introduction Slide (1).pptx on hr intro
job Avenue by vinith.pptxvnbvnvnvbnvbnbmnbmbh
How to Get Business Funding for Small Business Fast
Chapter 5_Foreign Exchange Market in .pdf
COST SHEET- Tender and Quotation unit 2.pdf
New Microsoft PowerPoint Presentation - Copy.pptx
340036916-American-Literature-Literary-Period-Overview.ppt

Cassandra for Ruby/Rails Devs

  • 1. Intro to Cassandra Tyler Hobbs
  • 2. History Dynamo BigTable (clustering) (data model) Inbox search Cassandra
  • 4. Clustering  Every node plays the same role – No masters, slaves, or special nodes – No single point of failure
  • 5. Consistent Hashing 0 50 10 40 20 30
  • 6. Consistent Hashing Key: “www.google.com” 0 50 10 40 20 30
  • 7. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 8. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 9. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30
  • 10. Consistent Hashing Key: “www.google.com” 0 md5(“www.google.com”) 50 10 14 40 20 30 Replication Factor = 3
  • 11. Clustering  Client can talk to any node
  • 12. Scaling RF = 2 0 50 10 The node at 50 owns the red portion 20 30
  • 13. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 14. Scaling RF = 2 0 50 10 Add a new 40 20 node at 40 30
  • 15. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 16. Node Failures RF = 2 0 50 10 Replicas 40 20 30
  • 17. Node Failures RF = 2 0 50 10 40 20 30
  • 18. Consistency, Availability  Consistency – Can I read stale data?  Availability – Can I write/read at all?  Tunable Consistency
  • 19. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success)
  • 20. Consistency  N = Total number of replicas  R = Number of replicas read from – (before the response is returned)  W = Number of replicas written to – (before the write is considered a success) W + R > N gives strong consistency
  • 21. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent
  • 22. Consistency W + R > N gives strong consistency N=3 W=2 R=2 2 + 2 > 3 ==> strongly consistent Only 2 of the 3 replicas must be available.
  • 23. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation
  • 24. Consistency  Tunable Consistency – Specify N (Replication Factor) per data set – Specify R, W per operation – Quorum: N/2 + 1 • R = W = Quorum • Strong consistency • Tolerate the loss of N – Quorum replicas – R, W can also be 1 or N
  • 25. Availability  Can tolerate the loss of: – N – R replicas for reads – N – W replicas for writes
  • 26. CAP Theorem During node or network failure: 100% Not Possible Availability Possible Consistency 100%
  • 27. CAP Theorem During node or network failure: 100% Not Ca Possible ss an dr Availability a Possible Consistency 100%
  • 28. Clustering  No single point of failure  Replication that works  Scales linearly – 2x nodes = 2x performance • For both reads and writes – Up to 100's of nodes – See “Netflix: 1 million writes/sec on AWS”  Operationally simple  Multi-Datacenter Replication
  • 29. Data Model  Comes from Google BigTable  Goals – Commodity Hardware • Spinning disks – Handle data sets much larger than memory • Minimize disk seeks – High throughput – Low latency – Durable
  • 30. Column Families  Static – Object data – Similar to a table in a relational database  Dynamic – Precomputed query results – Materialized views (these are just educational classifications)
  • 31. Static Column Families Users zznate password: * name: Nate driftx password: * name: Brandon thobbs password: * name: Tyler jbellis password: * name: Jonathan site: riptano.com
  • 32. Dynamic Column Families  Rows – Each row has a unique primary key – Sorted list of (name, value) tuples • Like an ordered hash – The (name, value) tuple is called a “column”
  • 33. Dynamic Column Families Following zznate driftx: thobbs: driftx thobbs zznate: jbellis driftx: mdennis: pcmanus: thobbs: xedin: zznate:
  • 34. Dynamic Column Families  Other Examples: – Timeline of tweets by a user – Timeline of tweets by all of the people a user is following – List of comments sorted by score – List of friends grouped by state
  • 35. The Data API  RPC-based API – github.com/twitter/cassandra  CQL (Cassandra Query Language) – code.google.com/a/apache-extras.org/p/cassandra-ruby/
  • 36. Inserting Data INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);
  • 37. Updating Data Updates are the same as inserts: INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34); Or UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;
  • 38. Fetching Data Whole row select: SELECT * FROM users WHERE KEY = “thobbs”;
  • 39. Fetching Data Explicit column select: SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;
  • 40. Fetching Data Get a slice of columns UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”; SELECT 1..3 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b), (3, c)]
  • 41. Fetching Data Get a slice of columns SELECT FIRST 2 FROM letters WHERE KEY = “key”; Returns [(1, a), (2, b)] SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”; Returns [(5, e), (4, d)]
  • 42. Fetching Data Get a slice of columns SELECT 3..'' FROM letters WHERE KEY = “key”; Returns [(3, c), (4, d), (5, e)] SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”; Returns [(4, d), (3, c)]
  • 43. Deleting Data Delete a whole row: DELETE FROM users WHERE KEY = “thobbs”; Delete specific columns: DELETE “age” FROM users WHERE KEY = “thobbs”;
  • 44. Secondary Indexes Builtin basic indexes CREATE INDEX ageIndex ON users (age); SELECT name FROM USERS WHERE age = 24 AND state = “TX”;
  • 45. Performance  Writes – 10k – 30k per second per node – Sub-millisecond latency  Reads – 1k – 20k per second per node (depends on data set, caching – 0.1 to 10ms latency
  • 46. Other Features  Distributed Counters – Can support millions of high-volume counters  Excellent Multi-datacenter Support – Disaster recovery – Locality  Hadoop Integration – Isolation of resources – Hive and Pig drivers  Compression
  • 47. What Cassandra Can't Do  Transactions – Unless you use a distributed lock – Atomicity, Isolation – These aren't needed as often as you'd think  Limited support for ad-hoc queries – Know what you want to do with the data
  • 48. Not One-size-fits-all  Use alongside an RDBMS
  • 49. Problems you shouldn't solve with C*  Prototyping  Distributed Locking  Small datasets – (When you don't need availability)  Complex graph processing – Shallow graph queries work well, though  Fundamentally highly relational/transactional data
  • 50. The sweet spot for Cassandra  Large dataset, low latency queries  Simple to medium complexity queries – Key/value – Time series, ordered data – Lists, sets, maps  High Availability
  • 51. The sweet spot for Cassandra  Social – Texts, comments, check-ins, collaboration  Activity – Feeds, timelines, clickstreams, logs, sensor data  Metrics – Performance data over time – CloudKick, DataStax OpsCenter  Text Search – Inbox search at Facebook
  • 52. ORMs  Poor integration  ORMs are not a natural fit for Cassandra – In C*, we mainly care about queries, not objects – Beyond simple K/V, abstraction breaks  Suggestion: don't waste time with an ORM – C* will only be used for a specific subset of your data/queries – Use the C* API directly in your model
  • 53. Questions? Tyler Hobbs @tylhobbs tyler@datastax.com