SlideShare a Scribd company logo
Cassandra Explained


      Disruptive Code
    September 22, 2010

            Eric Evans
    eevans@rackspace.com
           @jericevans
    http://blog.sym-link.com
Outline
●   Background
●   Description
●   API(s)
●   Code Samples
Background
The Digital Universe
Consolidation
Old Guard
Vertical Scaling Sucks
Influential Papers
●   BigTable
    ● Strong consistency
    ● Sparse map data model


    ● GFS, Chubby, et al


●   Dynamo
    ●   O(1) distributed hash table (DHT)
    ●   BASE (aka eventual consistency)
    ●   Client tunable consistency/availability
NoSQL
●   HBase          ●   Hypertable
●   MongoDB        ●   HyperGraphDB
●   Riak           ●   Memcached
●   Voldemort      ●   Tokyo Cabinet
●   Neo4J          ●   Redis
●   Cassandra      ●   CouchDB
NoSQL Big data
●   HBase           ●   Hypertable
●   MongoDB         ●   HyperGraphDB
●   Riak            ●   Memcached
●   Voldemort       ●   Tokyo Cabinet
●   Neo4J           ●   Redis
●   Cassandra       ●   CouchDB
Bigtable / Dynamo
        Bigtable              Dynamo
●   HBase          ●   Riak
●   Hypertable     ●   Voldemort



            Cassandra ??
Dynamo-Bigtable Lovechild
CAP Theorem “Pick Two”
●   CP               ●   AP
    ●   Bigtable         ●   Dynamo
    ●   Hypertable       ●   Voldemort
    ●   HBase            ●   Cassandra
CAP Theorem “Pick Two”



   ●   Consistency
   ●   Availability
   ●   Partition Tolerance
Description
Properties
●   Symmetric
    ● No single point of failure
    ● Linearly scalable


    ● Ease of administration


●   Flexible partitioning, replica placement
●   Automated provisioning
●   High availability (eventual consistency)
P2P Routing
P2P Routing
Partitioning
●   Random
    ●   128bit namespace, (MD5)
    ●   Good distribution
●   Order Preserving
    ●   Tokens determine namespace
    ●   Natural order (lexicographical)
    ●   Range / cover queries
●   Yours ??
Replica Placement
●   SimpleSnitch
    ●   Default
    ●   N-1 successive nodes
●   RackInferringSnitch
    ●   Infers DC/rack from IP
●   PropertyFileSnitch
    ●   Configured w/ a properties file
Bootstrap
Bootstrap
Bootstrap
Choosing Consistency

         Write                      Read
Level     Description      Level     Description
ZERO      Hail Mary        ZERO      N/A
ANY       1 replica (HH)   ANY       N/A
ONE       1 replica        ONE       1 replica
QUORUM    (N / 2) +1       QUORUM    (N / 2) +1
ALL       All replicas     ALL       All replicas

                       R+W>N
Quorum ((N/2) + 1)
Quorum ((N/2) + 1)
Data Model
Overview
●   Keyspace
    ●   Uppermost namespace
    ●   Typically one per application
●   ColumnFamily
    ●   Associates records of a similar kind
    ●   Record-level Atomicity
    ●   Indexed
●   Column
    ●   Basic unit of storage
Sparse Table
Column
●   name
    ●   byte[]
    ●   Queried against (predicates)
    ●   Determines sort order
●   value
    ●   byte[]
    ●   Opaque to Cassandra
●   timestamp
    ●   long
    ●   Conflict resolution (Last Write Wins)
Column Comparators
●    Bytes
●    UTF8
●    TimeUUID
●    Long
●    LexicalUUID
●    Composite (third-party)


    http://github.com/edanuff/CassandraCompositeType
API
RPC




THRIFT
RPC




THRIFT
RPC




AVRO
RPC




AVRO
Idiomatic Client Libraries
●    Pelops, Hector (Java)
●    Pycassa (Python)
●    Cassandra (Ruby)
●    Others …




    http://wiki.apache.org/cassandra/ClientOptions
Code Samples
Pycassa – Python Client API
# creating a connection
from pycassa import connect
hosts = ['host1:9160', 'host2:9160']
client = connect('Keyspace1', hosts)

# creating a column family instance
from pycassa import ColumnFamily
cf = ColumnFamily(client, “Standard1”)

# reading/writing a column
cf.insert('key1', {'name': 'value'})
print cf.get('key1')['name']

   1. http://github.com/vomjom/pycassa
Address Book – Setup
# conf/cassandra.yaml
keyspaces:
  - name: AddressBook
    column_families:
      - name: Addresses
        compare_with: BytesType
        rows_cached: 10000
        keys_cached: 50
        comment: 'No comment'
Adding an entry
key = uuid()

columns = {
    'first':   'Eric',
    'last':    'Evans',
    'email':   'eevans@rackspace.com',
    'city':    'San Antonio',
    'zip':     78250
}

addresses.insert(key, columns)
Fetching a record
# fetching the record by key
record = addresses.get(key)

# accessing columns by name
zipcode = record['zip']
city = record['city']
Indexing (manual)
# conf/cassandra.yaml
keyspaces:
  - name: AddressBook
    column_families:
      - name: Addresses
        compare_with: BytesType
        rows_cached: 10000
        keys_cached: 50
        comment: 'No comment'
      - name: ByCity
        compare_with: UTF8Type
Updating the index
key = uuid()

columns = {
    'first':   'Eric',
    'last':    'Evans',
    'email':   'eevans@rackspace.com',
    'city':    'San Antonio',
    'zip':     78250
}

addresses.insert(key, columns)
byCity.insert('San Antonio', {key: ''})
Indexing (auto)
# conf/cassandra.yaml
keyspaces:
  - name: AddressBook
    column_families:
      - name: Addresses
        compare_with: BytesType
        rows_cached: 10000
        keys_cached: 50
        comment: 'No comment'
        column_metadata:
           - name: city
             index_type: KEYS
Querying the Index
from pycassa.index import create_index_expression
from pycassa.index import create_index_clause

e = create_index_expression('city', 'San Antonio')
clause = create_index_clause([e])

results = address.get_indexed_slices(clause)

for (key, columns) in results.items():
    print “%(first)s %(last)s” % columns
Timeseries
# conf/cassandra.yaml
keyspaces:
  - name: Sites
    column_families:
      - name: Stats
        compare_with: LongType
Logging values
# time as long (milliseconds since epoch)
tstmp = long(time() * 1e6)

stats.insert('org.apache', {tstmp: value})
Slicing
begin = long(start * 1e6)

stats.get_range('org.apache',
                column_start=begin)

end = long((start + 86400) * 1e6)

stats.get_range(start='org.apache',
                finish='org.debian',
                column_start=begin,
                column_finish=end)
Questions?

More Related Content

PDF
Cassandra Tutorial
PDF
Cassandra for Ruby/Rails Devs
PDF
On Rails with Apache Cassandra
PPTX
Learning Cassandra
PDF
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
PPTX
Cassandra Presentation for San Antonio JUG
KEY
Bay area Cassandra Meetup 2011
PPTX
Introduction to NoSQL & Apache Cassandra
Cassandra Tutorial
Cassandra for Ruby/Rails Devs
On Rails with Apache Cassandra
Learning Cassandra
Cassandra Day Atlanta 2015: Introduction to Apache Cassandra & DataStax Enter...
Cassandra Presentation for San Antonio JUG
Bay area Cassandra Meetup 2011
Introduction to NoSQL & Apache Cassandra

What's hot (20)

PDF
Bulk Loading Data into Cassandra
PDF
Cassandra for Sysadmins
PPTX
Apache Cassandra Data Modeling with Travis Price
PPTX
An Overview of Apache Cassandra
KEY
Scaling Twitter with Cassandra
PDF
Introduction to Cassandra
PDF
Cassandra + Spark + Elk
PDF
Cassandra: Open Source Bigtable + Dynamo
PDF
Introduction to Apache Cassandra
PDF
Heuritech: Apache Spark REX
PDF
Bulk Loading into Cassandra
PPT
9b. Document-Oriented Databases lab
PPTX
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
PDF
Introduction to Cassandra
PPTX
memcached Distributed Cache
PDF
Cassandra background-and-architecture
PPTX
Data analysis scala_spark
PPTX
Introduction to Cassandra (June 2010)
PPTX
Using Spark to Load Oracle Data into Cassandra
PDF
ETL With Cassandra Streaming Bulk Loading
Bulk Loading Data into Cassandra
Cassandra for Sysadmins
Apache Cassandra Data Modeling with Travis Price
An Overview of Apache Cassandra
Scaling Twitter with Cassandra
Introduction to Cassandra
Cassandra + Spark + Elk
Cassandra: Open Source Bigtable + Dynamo
Introduction to Apache Cassandra
Heuritech: Apache Spark REX
Bulk Loading into Cassandra
9b. Document-Oriented Databases lab
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...
Introduction to Cassandra
memcached Distributed Cache
Cassandra background-and-architecture
Data analysis scala_spark
Introduction to Cassandra (June 2010)
Using Spark to Load Oracle Data into Cassandra
ETL With Cassandra Streaming Bulk Loading
Ad

Viewers also liked (16)

PDF
Cassandra NoSQL Tutorial
PDF
Cassandra By Example: Data Modelling with CQL3
PPTX
Apache Cassandra Developer Training Slide Deck
PDF
Cassandra Introduction & Features
PPTX
Google Dremel
PDF
NoSQL Essentials: Cassandra
PDF
Software Productivity and Serverless
PDF
[serverlessconf2017]FaaSで簡単に実現する数十万RPSスパイク負荷試験
PDF
Serverlessconf Tokyo 2017 Biz serverless お客様のビジネスを支える サーバーレスアーキテクチャーと開発としてのビジ...
PDF
BluetoothメッシュによるIoTシステムを支えるサーバーレス技術 #serverlesstokyo
PPTX
Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤
PPTX
分散システムについて語らせてくれ
PDF
Growing up serverless
PDF
「サーバレスの薄い本」からの1年 #serverlesstokyo
Cassandra NoSQL Tutorial
Cassandra By Example: Data Modelling with CQL3
Apache Cassandra Developer Training Slide Deck
Cassandra Introduction & Features
Google Dremel
NoSQL Essentials: Cassandra
Software Productivity and Serverless
[serverlessconf2017]FaaSで簡単に実現する数十万RPSスパイク負荷試験
Serverlessconf Tokyo 2017 Biz serverless お客様のビジネスを支える サーバーレスアーキテクチャーと開発としてのビジ...
BluetoothメッシュによるIoTシステムを支えるサーバーレス技術 #serverlesstokyo
Step functionsとaws batchでオーケストレートするイベントドリブンな機械学習基盤
分散システムについて語らせてくれ
Growing up serverless
「サーバレスの薄い本」からの1年 #serverlesstokyo
Ad

Similar to Cassandra Explained (20)

PDF
Cassandra Explained
PDF
NoSQL, no Limits, lots of Fun!
PDF
Cassandra Talk: Austin JUG
PPTX
TDC2017 | Florianopolis - Trilha DevOps How we figured out we had a SRE team ...
PPT
Scaling web applications with cassandra presentation
PDF
Introduction to Cassandra
PPTX
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
PDF
Online Analytics with Hadoop and Cassandra
PPTX
Cassandra Java APIs Old and New – A Comparison
PDF
Sorry - How Bieber broke Google Cloud at Spotify
PPT
Scaling Web Applications with Cassandra Presentation (1).ppt
PDF
Clojure ♥ cassandra
PDF
Cassandra
PDF
Your Database Cannot Do this (well)
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PPTX
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
PPT
NOSQL and Cassandra
PPTX
Apache Cassandra, part 1 – principles, data model
PDF
C# as a System Language
PDF
About "Apache Cassandra"
Cassandra Explained
NoSQL, no Limits, lots of Fun!
Cassandra Talk: Austin JUG
TDC2017 | Florianopolis - Trilha DevOps How we figured out we had a SRE team ...
Scaling web applications with cassandra presentation
Introduction to Cassandra
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Online Analytics with Hadoop and Cassandra
Cassandra Java APIs Old and New – A Comparison
Sorry - How Bieber broke Google Cloud at Spotify
Scaling Web Applications with Cassandra Presentation (1).ppt
Clojure ♥ cassandra
Cassandra
Your Database Cannot Do this (well)
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
AWS Big Data Demystified #2 | Athena, Spectrum, Emr, Hive
NOSQL and Cassandra
Apache Cassandra, part 1 – principles, data model
C# as a System Language
About "Apache Cassandra"

More from Eric Evans (20)

PDF
Wikimedia Content API (Strangeloop)
PDF
Wikimedia Content API: A Cassandra Use-case
PDF
Wikimedia Content API: A Cassandra Use-case
PDF
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
PDF
Time Series Data with Apache Cassandra
PDF
Time Series Data with Apache Cassandra
PDF
It's not you, it's me: Ending a 15 year relationship with RRD
PDF
Time series storage in Cassandra
PDF
Virtual Nodes: Rethinking Topology in Cassandra
PDF
Cassandra by Example: Data Modelling with CQL3
PDF
Rethinking Topology In Cassandra (ApacheCon NA)
PDF
Virtual Nodes: Rethinking Topology in Cassandra
KEY
Castle enhanced Cassandra
PDF
CQL: SQL In Cassandra
PDF
CQL In Cassandra 1.0 (and beyond)
PDF
Cassandra: Not Just NoSQL, It's MoSQL
PDF
NoSQL Yes, But YesCQL, No?
PDF
Outside The Box With Apache Cassnadra
PDF
The Cassandra Distributed Database
PDF
An Introduction To Cassandra
Wikimedia Content API (Strangeloop)
Wikimedia Content API: A Cassandra Use-case
Wikimedia Content API: A Cassandra Use-case
Time Series Data with Apache Cassandra (ApacheCon EU 2014)
Time Series Data with Apache Cassandra
Time Series Data with Apache Cassandra
It's not you, it's me: Ending a 15 year relationship with RRD
Time series storage in Cassandra
Virtual Nodes: Rethinking Topology in Cassandra
Cassandra by Example: Data Modelling with CQL3
Rethinking Topology In Cassandra (ApacheCon NA)
Virtual Nodes: Rethinking Topology in Cassandra
Castle enhanced Cassandra
CQL: SQL In Cassandra
CQL In Cassandra 1.0 (and beyond)
Cassandra: Not Just NoSQL, It's MoSQL
NoSQL Yes, But YesCQL, No?
Outside The Box With Apache Cassnadra
The Cassandra Distributed Database
An Introduction To Cassandra

Recently uploaded (20)

PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Encapsulation theory and applications.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Unlocking AI with Model Context Protocol (MCP)
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Agricultural_Statistics_at_a_Glance_2022_0.pdf
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Encapsulation theory and applications.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Unlocking AI with Model Context Protocol (MCP)
The AUB Centre for AI in Media Proposal.docx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation

Cassandra Explained

  • 1. Cassandra Explained Disruptive Code September 22, 2010 Eric Evans eevans@rackspace.com @jericevans http://blog.sym-link.com
  • 2. Outline ● Background ● Description ● API(s) ● Code Samples
  • 8. Influential Papers ● BigTable ● Strong consistency ● Sparse map data model ● GFS, Chubby, et al ● Dynamo ● O(1) distributed hash table (DHT) ● BASE (aka eventual consistency) ● Client tunable consistency/availability
  • 9. NoSQL ● HBase ● Hypertable ● MongoDB ● HyperGraphDB ● Riak ● Memcached ● Voldemort ● Tokyo Cabinet ● Neo4J ● Redis ● Cassandra ● CouchDB
  • 10. NoSQL Big data ● HBase ● Hypertable ● MongoDB ● HyperGraphDB ● Riak ● Memcached ● Voldemort ● Tokyo Cabinet ● Neo4J ● Redis ● Cassandra ● CouchDB
  • 11. Bigtable / Dynamo Bigtable Dynamo ● HBase ● Riak ● Hypertable ● Voldemort Cassandra ??
  • 13. CAP Theorem “Pick Two” ● CP ● AP ● Bigtable ● Dynamo ● Hypertable ● Voldemort ● HBase ● Cassandra
  • 14. CAP Theorem “Pick Two” ● Consistency ● Availability ● Partition Tolerance
  • 16. Properties ● Symmetric ● No single point of failure ● Linearly scalable ● Ease of administration ● Flexible partitioning, replica placement ● Automated provisioning ● High availability (eventual consistency)
  • 19. Partitioning ● Random ● 128bit namespace, (MD5) ● Good distribution ● Order Preserving ● Tokens determine namespace ● Natural order (lexicographical) ● Range / cover queries ● Yours ??
  • 20. Replica Placement ● SimpleSnitch ● Default ● N-1 successive nodes ● RackInferringSnitch ● Infers DC/rack from IP ● PropertyFileSnitch ● Configured w/ a properties file
  • 24. Choosing Consistency Write Read Level Description Level Description ZERO Hail Mary ZERO N/A ANY 1 replica (HH) ANY N/A ONE 1 replica ONE 1 replica QUORUM (N / 2) +1 QUORUM (N / 2) +1 ALL All replicas ALL All replicas R+W>N
  • 28. Overview ● Keyspace ● Uppermost namespace ● Typically one per application ● ColumnFamily ● Associates records of a similar kind ● Record-level Atomicity ● Indexed ● Column ● Basic unit of storage
  • 30. Column ● name ● byte[] ● Queried against (predicates) ● Determines sort order ● value ● byte[] ● Opaque to Cassandra ● timestamp ● long ● Conflict resolution (Last Write Wins)
  • 31. Column Comparators ● Bytes ● UTF8 ● TimeUUID ● Long ● LexicalUUID ● Composite (third-party) http://github.com/edanuff/CassandraCompositeType
  • 32. API
  • 37. Idiomatic Client Libraries ● Pelops, Hector (Java) ● Pycassa (Python) ● Cassandra (Ruby) ● Others … http://wiki.apache.org/cassandra/ClientOptions
  • 39. Pycassa – Python Client API # creating a connection from pycassa import connect hosts = ['host1:9160', 'host2:9160'] client = connect('Keyspace1', hosts) # creating a column family instance from pycassa import ColumnFamily cf = ColumnFamily(client, “Standard1”) # reading/writing a column cf.insert('key1', {'name': 'value'}) print cf.get('key1')['name'] 1. http://github.com/vomjom/pycassa
  • 40. Address Book – Setup # conf/cassandra.yaml keyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment'
  • 41. Adding an entry key = uuid() columns = { 'first': 'Eric', 'last': 'Evans', 'email': 'eevans@rackspace.com', 'city': 'San Antonio', 'zip': 78250 } addresses.insert(key, columns)
  • 42. Fetching a record # fetching the record by key record = addresses.get(key) # accessing columns by name zipcode = record['zip'] city = record['city']
  • 43. Indexing (manual) # conf/cassandra.yaml keyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment' - name: ByCity compare_with: UTF8Type
  • 44. Updating the index key = uuid() columns = { 'first': 'Eric', 'last': 'Evans', 'email': 'eevans@rackspace.com', 'city': 'San Antonio', 'zip': 78250 } addresses.insert(key, columns) byCity.insert('San Antonio', {key: ''})
  • 45. Indexing (auto) # conf/cassandra.yaml keyspaces: - name: AddressBook column_families: - name: Addresses compare_with: BytesType rows_cached: 10000 keys_cached: 50 comment: 'No comment' column_metadata: - name: city index_type: KEYS
  • 46. Querying the Index from pycassa.index import create_index_expression from pycassa.index import create_index_clause e = create_index_expression('city', 'San Antonio') clause = create_index_clause([e]) results = address.get_indexed_slices(clause) for (key, columns) in results.items(): print “%(first)s %(last)s” % columns
  • 47. Timeseries # conf/cassandra.yaml keyspaces: - name: Sites column_families: - name: Stats compare_with: LongType
  • 48. Logging values # time as long (milliseconds since epoch) tstmp = long(time() * 1e6) stats.insert('org.apache', {tstmp: value})
  • 49. Slicing begin = long(start * 1e6) stats.get_range('org.apache', column_start=begin) end = long((start + 86400) * 1e6) stats.get_range(start='org.apache', finish='org.debian', column_start=begin, column_finish=end)