SlideShare a Scribd company logo
Cassandra
Structured Storage System over a P2P Network




          Avinash Lakshman, Prashant Malik
Why Cassandra?
• Lots of data
  – Copies of messages, reverse indices of
    messages, per user data.
• Many incoming requests resulting in a lot
  of random reads and random writes.
• No existing production ready solutions in
  the market meet these requirements.
Design Goals
• High availability
• Eventual consistency
  – trade-off strong consistency in favor of high
    availability
• Incremental scalability
• Optimistic Replication
• “Knobs” to tune tradeoffs between consistency,
  durability and latency
• Low total cost of ownership
• Minimal administration
Data Model                                                       Columns are
                                                                                     added and
                              ColumnFamily1 Name : MailList                           modified
                                                                         Type : Simple Sort : Name
 KEY                          Name : tid1         Name : tid2           Name : tid3 dynamically
                                                                                        Name : tid4
                              Value : <Binary>    Value : <Binary>      Value : <Binary>        Value : <Binary>
                              TimeStamp : t1      TimeStamp : t2        TimeStamp : t3          TimeStamp : t4




                        ColumnFamily2            Name : WordList            Type : Super            Sort : Time
Column Families         Name : aloha                                                     Name : dude
  are declared
                         C1             C2             C3          C4                      C2             C6
     upfront
 SuperColumns            V1             V2             V3          V4                      V2             V6

 are added and           T1             T2             T3          T4                      T2             T6

    modified
Columns are
  dynamically
 added and
  modified        ColumnFamily3 Name : System                Type : Super       Sort : Name
dynamically       Name : hint1         Name : hint2         Name : hint3       Name : hint4
                  <Column List>        <Column List>        <Column List>      <Column List>
Write Operations
• A client issues a write request to a random
  node in the Cassandra cluster.
• The “Partitioner” determines the nodes
  responsible for the data.
• Locally, write operations are logged and
  then applied to an in-memory version.
• Commit log is stored on a dedicated disk
  local to the machine.
Write cont’d
Key (CF1 , CF2 , CF3)                                                         • Data size
                                                                              • Number of Objects
                                   Memtable ( CF1)
                                                                              • Lifetime

 Commit Log                        Memtable ( CF2)
 Binary serialized
 Key ( CF1 , CF2 , CF3 )           Memtable ( CF2)

                                                                         Data file on disk
                                               <Key name><Size of key Data><Index of columns/supercolumns><
                                               Serialized column family>
                           K128 Offset         ---
                                               ---
                           K256 Offset          BLOCK Index <Key Name> Offset, <Key Name> Offset
     Dedicated Disk
                           K384 Offset         ---
                                               ---

                           Bloom Filter        <Key name><Size of key Data><Index of columns/supercolumns><
                                               Serialized column family>

                           (Index in memory)
Compactions
                                                     K2 < Serialized data >             K4 < Serialized data >
              K1 < Serialized data >
                                                     K10 < Serialized data >            K5 < Serialized data >
              K2 < Serialized data >
                                                     K30 < Serialized data >            K10 < Serialized data >
              K3 < Serialized data >



                                   DELETED
                                                     --                                 --
              --
                                        Sorted       --                        Sorted   --
Sorted        --
                                                     --                                 --
              --




                                            MERGE SORT


   Index File
                                                   K1 < Serialized data >
          Loaded in memory                         K2 < Serialized data >
                                                   K3 < Serialized data >
         K1 Offset
                                                   K4 < Serialized data >
         K5 Offset                     Sorted
                                                   K5 < Serialized data >
         K30 Offset
                                                   K10 < Serialized data >
         Bloom Filter
                                                   K30 < Serialized data >

                                                 Data File
Write Properties
•   No locks in the critical path
•   Sequential disk access
•   Behaves like a write back Cache
•   Append support without read ahead
•   Atomicity guarantee for a key
• “Always Writable”
    – accept writes during failure scenarios
Read
                         Client


                  Query       Result

                       Cassandra Cluster


          Closest replica     Result                   Read repair if
                                                       digests differ
                        Replica A


                       Digest Query
Digest Response                            Digest Response


           Replica B                   Replica C
Partitioning And Replication
                          1 0           h(key1)
                    E
                                      A           N=3

          C

h(key2)                                    F


                                       B
              D

                          1/2
                                                        10
Cluster Membership and Failure
              Detection
•   Gossip protocol is used for cluster membership.
•   Super lightweight with mathematically provable properties.
•   State disseminated in O(logN) rounds where N is the number of
    nodes in the cluster.
•   Every T seconds each member increments its heartbeat counter and
    selects one other member to send its list to.
•   A member merges the list with its own list .
Cassandra Nosql
Cassandra Nosql
Cassandra Nosql
Cassandra Nosql
Accrual Failure Detector
•   Valuable for system management, replication, load balancing etc.
•   Defined as a failure detector that outputs a value, PHI, associated
    with each process.
•   Also known as Adaptive Failure detectors - designed to adapt to
    changing network conditions.
•   The value output, PHI, represents a suspicion level.
•   Applications set an appropriate threshold, trigger suspicions and
    perform appropriate actions.
•   In Cassandra the average time taken to detect a failure is 10-15
    seconds with the PHI threshold set at 5.
Properties of the Failure Detector
•   If a process p is faulty, the suspicion level
                  Φ(t)     ∞as t     ∞.
•   If a process p is faulty, there is a time after which Φ(t) is monotonic
    increasing.
•   A process p is correct      Φ(t) has an ub over an infinite execution.
•   If process p is correct, then for any time T,
                  Φ(t) = 0 for t >= T.
Implementation
•   PHI estimation is done in three phases
     – Inter arrival times for each member are stored in a sampling
       window.
     – Estimate the distribution of the above inter arrival times.
     – Gossip follows an exponential distribution.
     – The value of PHI is now computed as follows:
         • Φ(t) = -log10( P(tnow – tlast) )
                   where P(t) is the CDF of an exponential distribution. P(t) denotes the
                   probability that a heartbeat will arrive more than t units after the previous
                   one. P(t) = ( 1 – e-tλ )
The overall mechanism is described in the figure below.
Information Flow in the
    Implementation
Performance Benchmark
• Loading of data - limited by network
  bandwidth.
• Read performance for Inbox Search in
  production:

              Search Interactions Term Search
    Min       7.69 ms            7.78 ms
    Median    15.69 ms           18.27 ms
    Average   26.13 ms           44.41 ms
MySQL Comparison
• MySQL > 50 GB Data
  Writes Average : ~300 ms
  Reads Average : ~350 ms
• Cassandra > 50 GB Data
  Writes Average : 0.12 ms
  Reads Average : 15 ms
Lessons Learnt
• Add fancy features only when absolutely
  required.
• Many types of failures are possible.
• Big systems need proper systems-level
  monitoring.
• Value simple designs
Future work
•   Atomicity guarantees across multiple keys
•   Analysis support via Map/Reduce
•   Distributed transactions
•   Compression support
•   Granular security via ACL’s
Questions?

More Related Content

PDF
Introduction to Cassandra
PPSX
Exchange 2010 ha ctd
PDF
Column Stride Fields aka. DocValues
PDF
Top five questions to ask when choosing a big data solution
DOCX
Command line
PDF
Wintel commands
PDF
Akiban Technologies: Renormalize
PDF
Logging Last Resource Optimization for Distributed Transactions in Oracle…
Introduction to Cassandra
Exchange 2010 ha ctd
Column Stride Fields aka. DocValues
Top five questions to ask when choosing a big data solution
Command line
Wintel commands
Akiban Technologies: Renormalize
Logging Last Resource Optimization for Distributed Transactions in Oracle…

What's hot (8)

PDF
Cheatsheet of msdos
PDF
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
PPTX
Availability and Integrity in hadoop (Strata EU Edition)
PDF
Cassandra 1.1
PPTX
Cassandra Intro -- TheEdge2012
PPTX
Advanced Windows Debugging
DOCX
Postgre sql run book
PDF
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
Cheatsheet of msdos
Logging Last Resource Optimization for Distributed Transactions in Oracle We...
Availability and Integrity in hadoop (Strata EU Edition)
Cassandra 1.1
Cassandra Intro -- TheEdge2012
Advanced Windows Debugging
Postgre sql run book
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
Ad

Similar to Cassandra Nosql (20)

PPT
Cassandra NoSQL
PPTX
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
PDF
DaStor/Cassandra report for CDR solution
PDF
Cacheconcurrencyconsistency cassandra svcc
PDF
"Mobage DBA Fight against Big Data" - NHN TE
PPT
Scaling web applications with cassandra presentation
PPTX
Introduction to NoSQL
PPTX
Linked in nosql_atnetflix_2012_v1
PPTX
Dbms &amp; oracle
PDF
Ben Coverston - The Apache Cassandra Project
PPTX
How nebula graph index works
PDF
Cassandra Tutorial
KEY
Cassandra deep-dive @ NoSQLNow!
PPTX
SQL Server Deep Dive, Denis Reznik
PDF
What’s Evolving in the Elastic Stack
PDF
2013 london advanced-replication
PDF
Oracle 12.2 sharded database management
PPSX
DBMS Chapter-3.ppsx
PPTX
A Deep Dive Into Understanding Apache Cassandra
PDF
Cassandra introduction apache con 2014 budapest
Cassandra NoSQL
Introduce Apache Cassandra - JavaTwo Taiwan, 2012
DaStor/Cassandra report for CDR solution
Cacheconcurrencyconsistency cassandra svcc
"Mobage DBA Fight against Big Data" - NHN TE
Scaling web applications with cassandra presentation
Introduction to NoSQL
Linked in nosql_atnetflix_2012_v1
Dbms &amp; oracle
Ben Coverston - The Apache Cassandra Project
How nebula graph index works
Cassandra Tutorial
Cassandra deep-dive @ NoSQLNow!
SQL Server Deep Dive, Denis Reznik
What’s Evolving in the Elastic Stack
2013 london advanced-replication
Oracle 12.2 sharded database management
DBMS Chapter-3.ppsx
A Deep Dive Into Understanding Apache Cassandra
Cassandra introduction apache con 2014 budapest
Ad

More from elliando dias (20)

PDF
Clojurescript slides
PDF
Why you should be excited about ClojureScript
PDF
Functional Programming with Immutable Data Structures
PPT
Nomenclatura e peças de container
PDF
Geometria Projetiva
PDF
Polyglot and Poly-paradigm Programming for Better Agility
PDF
Javascript Libraries
PDF
How to Make an Eight Bit Computer and Save the World!
PDF
Ragel talk
PDF
A Practical Guide to Connecting Hardware to the Web
PDF
Introdução ao Arduino
PDF
Minicurso arduino
PDF
Incanter Data Sorcery
PDF
PDF
Fab.in.a.box - Fab Academy: Machine Design
PDF
The Digital Revolution: Machines that makes
PDF
Hadoop + Clojure
PDF
Hadoop - Simple. Scalable.
PDF
Hadoop and Hive Development at Facebook
PDF
Multi-core Parallelization in Clojure - a Case Study
Clojurescript slides
Why you should be excited about ClojureScript
Functional Programming with Immutable Data Structures
Nomenclatura e peças de container
Geometria Projetiva
Polyglot and Poly-paradigm Programming for Better Agility
Javascript Libraries
How to Make an Eight Bit Computer and Save the World!
Ragel talk
A Practical Guide to Connecting Hardware to the Web
Introdução ao Arduino
Minicurso arduino
Incanter Data Sorcery
Fab.in.a.box - Fab Academy: Machine Design
The Digital Revolution: Machines that makes
Hadoop + Clojure
Hadoop - Simple. Scalable.
Hadoop and Hive Development at Facebook
Multi-core Parallelization in Clojure - a Case Study

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
Teaching material agriculture food technology
PPTX
Spectroscopy.pptx food analysis technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
sap open course for s4hana steps from ECC to s4
PDF
cuic standard and advanced reporting.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Dropbox Q2 2025 Financial Results & Investor Presentation
Per capita expenditure prediction using model stacking based on satellite ima...
The AUB Centre for AI in Media Proposal.docx
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Teaching material agriculture food technology
Spectroscopy.pptx food analysis technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Encapsulation_ Review paper, used for researhc scholars
A comparative analysis of optical character recognition models for extracting...
sap open course for s4hana steps from ECC to s4
cuic standard and advanced reporting.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...

Cassandra Nosql

  • 1. Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik
  • 2. Why Cassandra? • Lots of data – Copies of messages, reverse indices of messages, per user data. • Many incoming requests resulting in a lot of random reads and random writes. • No existing production ready solutions in the market meet these requirements.
  • 3. Design Goals • High availability • Eventual consistency – trade-off strong consistency in favor of high availability • Incremental scalability • Optimistic Replication • “Knobs” to tune tradeoffs between consistency, durability and latency • Low total cost of ownership • Minimal administration
  • 4. Data Model Columns are added and ColumnFamily1 Name : MailList modified Type : Simple Sort : Name KEY Name : tid1 Name : tid2 Name : tid3 dynamically Name : tid4 Value : <Binary> Value : <Binary> Value : <Binary> Value : <Binary> TimeStamp : t1 TimeStamp : t2 TimeStamp : t3 TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Column Families Name : aloha Name : dude are declared C1 C2 C3 C4 C2 C6 upfront SuperColumns V1 V2 V3 V4 V2 V6 are added and T1 T2 T3 T4 T2 T6 modified Columns are dynamically added and modified ColumnFamily3 Name : System Type : Super Sort : Name dynamically Name : hint1 Name : hint2 Name : hint3 Name : hint4 <Column List> <Column List> <Column List> <Column List>
  • 5. Write Operations • A client issues a write request to a random node in the Cassandra cluster. • The “Partitioner” determines the nodes responsible for the data. • Locally, write operations are logged and then applied to an in-memory version. • Commit log is stored on a dedicated disk local to the machine.
  • 6. Write cont’d Key (CF1 , CF2 , CF3) • Data size • Number of Objects Memtable ( CF1) • Lifetime Commit Log Memtable ( CF2) Binary serialized Key ( CF1 , CF2 , CF3 ) Memtable ( CF2) Data file on disk <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> K128 Offset --- --- K256 Offset BLOCK Index <Key Name> Offset, <Key Name> Offset Dedicated Disk K384 Offset --- --- Bloom Filter <Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family> (Index in memory)
  • 7. Compactions K2 < Serialized data > K4 < Serialized data > K1 < Serialized data > K10 < Serialized data > K5 < Serialized data > K2 < Serialized data > K30 < Serialized data > K10 < Serialized data > K3 < Serialized data > DELETED -- -- -- Sorted -- Sorted -- Sorted -- -- -- -- MERGE SORT Index File K1 < Serialized data > Loaded in memory K2 < Serialized data > K3 < Serialized data > K1 Offset K4 < Serialized data > K5 Offset Sorted K5 < Serialized data > K30 Offset K10 < Serialized data > Bloom Filter K30 < Serialized data > Data File
  • 8. Write Properties • No locks in the critical path • Sequential disk access • Behaves like a write back Cache • Append support without read ahead • Atomicity guarantee for a key • “Always Writable” – accept writes during failure scenarios
  • 9. Read Client Query Result Cassandra Cluster Closest replica Result Read repair if digests differ Replica A Digest Query Digest Response Digest Response Replica B Replica C
  • 10. Partitioning And Replication 1 0 h(key1) E A N=3 C h(key2) F B D 1/2 10
  • 11. Cluster Membership and Failure Detection • Gossip protocol is used for cluster membership. • Super lightweight with mathematically provable properties. • State disseminated in O(logN) rounds where N is the number of nodes in the cluster. • Every T seconds each member increments its heartbeat counter and selects one other member to send its list to. • A member merges the list with its own list .
  • 16. Accrual Failure Detector • Valuable for system management, replication, load balancing etc. • Defined as a failure detector that outputs a value, PHI, associated with each process. • Also known as Adaptive Failure detectors - designed to adapt to changing network conditions. • The value output, PHI, represents a suspicion level. • Applications set an appropriate threshold, trigger suspicions and perform appropriate actions. • In Cassandra the average time taken to detect a failure is 10-15 seconds with the PHI threshold set at 5.
  • 17. Properties of the Failure Detector • If a process p is faulty, the suspicion level Φ(t) ∞as t ∞. • If a process p is faulty, there is a time after which Φ(t) is monotonic increasing. • A process p is correct Φ(t) has an ub over an infinite execution. • If process p is correct, then for any time T, Φ(t) = 0 for t >= T.
  • 18. Implementation • PHI estimation is done in three phases – Inter arrival times for each member are stored in a sampling window. – Estimate the distribution of the above inter arrival times. – Gossip follows an exponential distribution. – The value of PHI is now computed as follows: • Φ(t) = -log10( P(tnow – tlast) ) where P(t) is the CDF of an exponential distribution. P(t) denotes the probability that a heartbeat will arrive more than t units after the previous one. P(t) = ( 1 – e-tλ ) The overall mechanism is described in the figure below.
  • 19. Information Flow in the Implementation
  • 20. Performance Benchmark • Loading of data - limited by network bandwidth. • Read performance for Inbox Search in production: Search Interactions Term Search Min 7.69 ms 7.78 ms Median 15.69 ms 18.27 ms Average 26.13 ms 44.41 ms
  • 21. MySQL Comparison • MySQL > 50 GB Data Writes Average : ~300 ms Reads Average : ~350 ms • Cassandra > 50 GB Data Writes Average : 0.12 ms Reads Average : 15 ms
  • 22. Lessons Learnt • Add fancy features only when absolutely required. • Many types of failures are possible. • Big systems need proper systems-level monitoring. • Value simple designs
  • 23. Future work • Atomicity guarantees across multiple keys • Analysis support via Map/Reduce • Distributed transactions • Compression support • Granular security via ACL’s