SlideShare a Scribd company logo
Cassandra on Castle
                                   Tim Moreton
                                   @timmoreton




Saturday, 24 September 2011
Outline

                    • Why Castle?
                    • A [quick] tour of Castle
                    • Cassandra on Castle
                    • An aside into Memcache
                    • Cross-cluster snapshots and clones

Saturday, 24 September 2011
Before the Flood
                                       1990

                                   Small databases
                                    BTree indexes
                                  BTree File systems
                                        RAID
                                    Old hardware



Saturday, 24 September 2011
Two Revolutions
                                                        2010
                                   Distributed, shared-nothing databases
                              Write-optimised indexes          Write-optimised indexes

                          BTree file systems                    BTree file systems
                                     RAID                ...          RAID
                               New hardware                     New hardware




Saturday, 24 September 2011
Bridging the Gap
                                               2011

                                Distributed, shared-nothing databases


                                 Castle                      Castle
                                                 ...
                              New hardware               New hardware




Saturday, 24 September 2011
Saturday, 24 September 2011
                                                Shared memory interface
                                                                                          keys
                                                                                                                                                  Userspace
                                                                                                                                                Acunu Kernel
                                                                                   values
                                                                                                                      In-kernel
                                                              async, shared
                                                               memory ring                                            workloads




                               interface
                                                                                     shared buffers




                              userspace
                                                                            Streaming interface
                                                  range           key              buffered              key           buffered
                                                 queries         insert           value insert           get           value get




                                interface
                              kernelspace
                                                                                  Doubling Arrays
                                                              insert                                                                             Bloom filters
                                                             queues                                                      key
                                                                                                                         get                            x
                                                 arrays
                                                  range                                       arrays
                                                 queries                                    management




                              mapping layer
                                                              key




                              doubling array
                                                             insert                         merges




                                                                                     Arrays
                                                               key                                                      Version tree
                                                              insert                         btree
                                                                                                                key
                                                                                                                get
                                                  btree
                                                  range




                              modlist btree
                              mapping layer
                                                 queries                          value arrays




                                                                                                               Cache
                                                      "Extent" layer
                                                                                                     extent block
                                                                       extent                           cache
                                                 freespace
                                                                      allocator




                                                                                                                                   prefetcher
                                                  manager
                                                                      & mapper




                               cacheing layer
                                                                                                                         flusher




                              block mapping &
                                                                                                      page cache




                                                                                                                                                Linux Kernel
                                                       Block layer                                     Memory manager




                                 MM layers
                              linux's block &
Shared memory interface




                                                                                                                                               Castle
                                                                   keys
                                                                                                                           Userspace
                                                                                                                         Acunu Kernel
       userspace
        interface



                                                            values
                                                                                               In-kernel
                                       async, shared
                                        memory ring                                            workloads
                                                              shared buffers
       kernelspace




                                                                                                                                         • Like ZFS+BDB for Big Data
                                                     Streaming interface
         interface




                           range           key              buffered              key           buffered
                          queries         insert           value insert           get           value get




                                                                                                                                         • Opensource (GPLv2, MIT
                                                           Doubling Arrays
       doubling array
       mapping layer




                                       insert                                                                             Bloom filters
                                      queues                                                      key
                                                                                                  get
                          arrays                                                                                                 x




                                                                                                                                           for user libraries)
                           range                                       arrays
                          queries                                    management
                                       key
                                      insert                         merges




                                                              Arrays
                                                                                                                                         • http://bitbucket.org/acunu
       mapping layer
       modlist btree




                                        key                                                      Version tree
                                       insert                         btree




                                                                                                                                         • Loadable Kernel Module,
                                                                                         key
                                                                                         get
                           btree
                           range
                          queries                          value arrays




                                                                                        Cache
                                                                                                                                           targeting CentOS’s 2.6.18
       block mapping &




                                                                                                                                         • http://www.acunu.com/
        cacheing layer




                               "Extent" layer
                                                                                                            prefetcher




                                                                              extent block
                                                extent                           cache
                          freespace
                                               allocator
                           manager
                                                                                                  flusher




                                               & mapper

                                                                               page cache
                                                                                                                                           blogs/andy-twigg/why-
                                                                                                                                           acunu-kernel/
       linux's block &




                                                                                                                         Linux Kernel
          MM layers




                                Block layer                                     Memory manager




Saturday, 24 September 2011
The Interface
                              Shared memory interface
                                                                  keys
                                                                                                   Userspace
                                                                                                 Acunu Kernel
             userspace
              interface




                                                           values
                                                                                    In-kernel
                                          async, shared
                                           memory ring                              workloads
                                                             shared buffers
             kernelspace




                                                      Streaming interface
               interface




                                range         key          buffered           key    buffered
                               queries       insert       value insert        get    value get




                                                          Doubling Arrays
             doubling array
             mapping layer




                                          insert                                                  Bloom filters
                                         queues                                        key
                                                                                       get
                               arrays                                                                    x
                                range
                               queries
                                                              castle_{back,objects}.c
                                                                  arrays
                                                                management
Saturday, 24 September 2011               key
The Interface
             Tree of versions
                                                     Attachment
           •       Create, snapshot, clone

           •       Attach/detach
                                                 •   Keys: any dimensional
                                                 •   Values: any size
                                      v0
                                                 •   Simple get, put, delete
                              v1            v3
                                                 •   Iterator, slice interfaces

                v12           v13    v15
                                                 •   Streaming interface

                              v16   v24


Saturday, 24 September 2011
The Interface
                              Shared memory interface
                                                                  keys
                                                                                                   Userspace
                                                                                                 Acunu Kernel
             userspace
              interface




                                                           values
                                                                                    In-kernel
                                          async, shared
                                           memory ring                              workloads
                                                             shared buffers
             kernelspace




                                                      Streaming interface
               interface




                                range         key          buffered           key    buffered
                               queries       insert       value insert        get    value get




                                                          Doubling Arrays
             doubling array
             mapping layer




                                          insert                                                  Bloom filters
                                         queues                                        key
                                                                                       get
                               arrays                                                                    x
                                range
                               queries
                                                              castle_{back,objects}.c
                                                                  arrays
                                                                management
Saturday, 24 September 2011               key
interface
             userspac
                                                          values
                                                                                         In-kernel
                                         async, shared
                                          memory ring                                    workloads
                                                            shared buffers
             kernelspace
               interface
                              Doubling Array         Streaming interface
                               range         key          buffered           key          buffered
                              queries       insert       value insert        get          value get




                                                         Doubling Arrays
             doubling array
             mapping layer




                                         insert                                                           Bloom filters
                                        queues                                              key
                                                                                            get
                              arrays                                                                             x
                               range                                 arrays
                              queries                              management
                                         key
                                        insert                     merges




                                                            Arrays
             mapping layer
             modlist btree




                                          key                                              Version tree
                                         insert                     btree
                                                                                   key
                                                                                   get
                               btree
                               range
                              queries                    value arrays

                                                                            castle_{da,bloom}.c
Saturday, 24 September 2011
Doubling Array
                                      Inserts


                  2           2   9


                  9




       Buffer arrays in memory
      until we have > B of them
Saturday, 24 September 2011
Doubling Array
                                       Inserts


                 11           2   9     2   8   9   11


                  8           8   11
                                                         etc...




Saturday, 24 September 2011
8KB @ 100MB/s, w/ 8ms seek      100 / 5
                          = 100 IOs/s          = 20 updates/s
  ~ log (2^30)/log 100
    = 5 IOs/update
                                                        Range Query
                                       Update
                                                           (Size Z)
                                        O(logB N)             O(Z/B)
                   B-Tree              random IOs           random IOs

                                       O((log N)/B)           O(Z/B)
       Doubling Array                 sequential IOs       sequential IOs



     ~ log (2^30)/100                8KB @ 100MB/s             13k / 0.2
   = 0.2 IOs/update                   = 13k IOs/s          = 65k updates/s

                 B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries

Saturday, 24 September 2011
interface
             userspac
                                                          values
                                                                                         In-kernel
                                         async, shared
                                          memory ring                                    workloads
                                                            shared buffers
             kernelspace
               interface
                              Doubling Array         Streaming interface
                               range         key          buffered           key          buffered
                              queries       insert       value insert        get          value get




                                                         Doubling Arrays
             doubling array
             mapping layer




                                         insert                                                           Bloom filters
                                        queues                                              key
                                                                                            get
                              arrays                                                                             x
                               range                                 arrays
                              queries                              management
                                         key
                                        insert                     merges




                                                            Arrays
             mapping layer
             modlist btree




                                          key                                              Version tree
                                         insert                     btree
                                                                                   key
                                                                                   get
                               btree
                               range
                              queries                    value arrays

                                                                            castle_{da,bloom}.c
Saturday, 24 September 2011
Doubling Arrays

             doubling array
             mapping layer
                               “Mod-list” B-Tree
                                             insert                                                                       Bloom filters
                                            queues                                                 key
                                                                                                   get
                                arrays                                                                                           x
                                 range                                      arrays
                                queries                                   management
                                             key
                                            insert                       merges




                                                                   Arrays
             mapping layer
             modlist btree




                                              key                                                  Version tree
                                             insert                        btree
                                                                                             key
                                                                                             get
                                 btree
                                 range
                                queries                          value arrays




                                                                                          Cache
             block mapping &
              cacheing layer




                                     "Extent" layer




                                                                                                             prefetcher
                                                                                   extent block
                                                      extent                          cache

                               So how to do snapshots and clones?
                                freespace
                                 manager
                                                     allocator




                                                                                                    flusher
                                                     & mapper

                                                                                   page cache


                                                                 castle_{btree,versions}.c
             k&




                                                                                                                          Linux Kernel
             s




Saturday, 24 September 2011
Copy-on-Write BTree
                                            Idea:
                                         • Apply path-copying [DSST] to
                                            the B-tree
                                            Problems:
                                         • Space blowup: Each update may
                                            rewrite an entire path
                                         • Slow updates: as above
               A log file system makes updates sequential, but relies on
               random access and garbage collection (achilles heel!)


Saturday, 24 September 2011
Range
                              Update                              Space
                                                Query
           CoW B-               O(logB Nv)        O(Z/B)
                                                                O(N B logB Nv)
            Tree               random IOs       random IOs

       “BigTable”              O((log N)/B)       O(Z/B)
                                                                   O(VN)
         LevelDB
        style DA              sequential IOs   sequential IOs

       “Mod-list”              O((log N)/B)       O(Z/B)
        Castle
        in a DA               sequential IOs   sequential IOs
                                                                    O(N)



                 Nv = #keys live (accessible) at version v


Saturday, 24 September 2011
Stratified B-Trees
                           •           Retires Copy-On-Write B-Trees, the bedrock of
                                       modern storage (Sun ZFS, NetApp WAFL, ...)
                           •           Patent-pending, next-generation data structure
                           •           Theoretically optimal, yet highly practical

                                          Copy-on-write B-tree finally beaten.

           Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗
                                                ∗
                                                      s
                                                  Acunu, † Google                                                                     http://goo.gl/INTb1
                                          firstname@acunu.com


          Abstract                                                         This paper presents some recent results on new con-
                                                                        structions for B-trees that go beyond copy-on-write, that
          A classic versioned data structure in storage and com-        we call ‘stratified B-trees’. They solve two open prob-
          puter science is the copy-on-write (CoW) B-tree – it un-      lems: Firstly. they offer a fully-versioned B-tree with
          derlies many of today’s file systems and databases, in-        optimal space and the same lookup time as the CoW B-
          cluding WAFL, ZFS, Btrfs and more. Unfortunately, it          tree. Secondly, they are the first to offer other points on
          doesn’t inherit the B-tree’s optimality properties; it has    the Pareto optimal query/update tradeoff curve, and in
          poor space utilization, cannot offer fast updates, and re-    particular, our structures offer fully-versioned updates in


                                                                                                                                      http://goo.gl/gzihe
          lies on random IO to scale. Yet, nothing better has           o(1) IOs, while using linear space. Experimental results
          been developed since. We describe the ‘stratified B-tree’,     indicate 100,000s updates/s on a large SATA disk, two
          which beats the CoW B-tree in every way. In particu-          orders of magnitude faster than a CoW B-tree.
          lar, it is the first versioned dictionary to achieve optimal      Since stratified B-trees subsume CoW B-trees (and in-
          tradeoffs between space, query and update performance.        deed all other known versioned external-memory dictio-
          Therefore, we believe there is no longer a good reason to     naries), we believe there is no longer a good reason to
          use CoW B-trees for versioned data stores.                    use them for versioned data stores. Acunu is develop-
                                                                        ing a commercial in-kernel implementation of stratified
                                                                        B-tress, which we hope to release soon.
          1 Introduction
Saturday, 24 September 2011
          The B-tree was presented in 1972 [1], and it survives
Doubling Arrays

             doubling array
             mapping layer
                               “Mod-list” B-Tree
                                             insert                                                                       Bloom filters
                                            queues                                                 key
                                                                                                   get
                                arrays                                                                                           x
                                 range                                      arrays
                                queries                                   management
                                             key
                                            insert                       merges




                                                                   Arrays
             mapping layer
             modlist btree




                                              key                                                  Version tree
                                             insert                        btree
                                                                                             key
                                                                                             get
                                 btree
                                 range
                                queries                          value arrays




                                                                                          Cache
             block mapping &
              cacheing layer




                                     "Extent" layer




                                                                                                             prefetcher
                                                                                   extent block
                                                      extent                          cache
                                freespace
                                                     allocator
                                 manager




                                                                                                    flusher
                                                     & mapper

                                                                                   page cache


                                                                 castle_{btree,versions}.c
             k&




                                                                                                                          Linux Kernel
             s




Saturday, 24 September 2011
Arrays




             mapping layer
             modlist btree
                                             key                                               Version tree
                                            insert                     btree




                               Disk Layout: RDA
                                                                                         key
                                                                                         get
                                 btree
                                 range
                                queries                      value arrays




                                                                                      Cache
             block mapping &
              cacheing layer




                                     "Extent" layer




                                                                                                         prefetcher
                                                                               extent block
                                                  extent                          cache
                                freespace
                                                 allocator
                                 manager




                                                                                                flusher
                                                 & mapper

                                                                               page cache
             linux's block &




                                                                                                                      Linux Kernel
                MM layers




                                      Block layer                                Memory manager




                     castle_{cache,extent,freespace,rebuild}.c
Saturday, 24 September 2011
Disk Layout: RDA
                                        random duplicate allocation

                               4    2      1    4    5    2    5    3    1    3

                               7    10     7    6    8    9    9    10   6    8

                               15   12     14   11   13   14   11   12   13   15

                                                               16        16




Saturday, 24 September 2011
SSD tiering [taster]

                    • Why? Key to >cache random reads
                    • v1: SSD for metadata structures
                     • Redundancy provided by disk
                    • SSD for selected collection data (CFs)
                    • 10x write rate on SSDs than regular FSs

Saturday, 24 September 2011
Saturday, 24 September 2011
                                                Shared memory interface
                                                                                          keys
                                                                                                                                                  Userspace
                                                                                                                                                Acunu Kernel
                                                                                   values
                                                                                                                      In-kernel
                                                              async, shared
                                                               memory ring                                            workloads




                               interface
                                                                                     shared buffers




                              userspace
                                                                            Streaming interface
                                                  range           key              buffered              key           buffered
                                                 queries         insert           value insert           get           value get




                                interface
                              kernelspace
                                                                                  Doubling Arrays
                                                              insert                                                                             Bloom filters
                                                             queues                                                      key
                                                                                                                         get                            x
                                                 arrays
                                                  range                                       arrays
                                                 queries                                    management




                              mapping layer
                                                              key




                              doubling array
                                                             insert                         merges




                                                                                     Arrays
                                                               key                                                      Version tree
                                                              insert                         btree
                                                                                                                key
                                                                                                                get
                                                  btree
                                                  range




                              modlist btree
                              mapping layer
                                                 queries                          value arrays




                                                                                                               Cache
                                                      "Extent" layer
                                                                                                     extent block
                                                                       extent                           cache
                                                 freespace
                                                                      allocator




                                                                                                                                   prefetcher
                                                  manager
                                                                      & mapper




                               cacheing layer
                                                                                                                         flusher




                              block mapping &
                                                                                                      page cache




                                                                                                                                                Linux Kernel
                                                       Block layer                                     Memory manager




                                 MM layers
                              linux's block &
Cassandra on Castle
                  • Eliminate all ‘storage heavy lifting’
                  • Extend ColumnFamilyStore
                  • Efficient JNI bindings to libcastle C library
                  • row, col, value, t: (row, col) -> (t,value)
                  • row, a|b|c|d, value, t:
                          (row, a, b, c, d, col) -> (t,value)


Saturday, 24 September 2011
Small random inserts
                                Inserting 3 billion rows


                                               Acunu powered Cassandra -
                                                    ‘standard’ Cassandra -




Saturday, 24 September 2011
Insert latency
                              While inserting 3 billion rows

                                                   Acunu powered Cassandra x
                                                        ‘standard’ Cassandra +




Saturday, 24 September 2011
Small random range queries
                              Performed immediately after inserts

                                                     Acunu powered Cassandra -
                                                          ‘standard’ Cassandra -




Saturday, 24 September 2011
Memcache + Cassandra
     get/insert                Cass client              get/put     memcached
   Same data!                                     100k random
                              Replication logic     inserts/sec!    Replication logic



                                                     Text
                        Cassandra        memcache              Cassandra       memcache


                                 Castle                                Castle
                                                        ...
                                  H/W                                   H/W


Saturday, 24 September 2011
v2: Cross-cluster versions
                              •   Eventually consistent
                              •   Spans data centers
                              •   Tolerates node failure,
                                  network partition
                              •   High performance,
                                  no space overhead
                              •   Dev/Test/Staging on Prod
                                  clusters



Saturday, 24 September 2011
So...
                    • Castle = ZFS + BDB for Big Data
                    • Cassandra on Castle runs apps unmodified
                    • Up to 100x throughput under load
                    • No GC pauses: very predictable latencies
                    • v2: Cross-cluster snapshot and clone
                    • SSD optimisation
Saturday, 24 September 2011
Saturday, 24 September 2011
Questions?


                                        Tim Moreton // @timmoreton



                                   http://goo.gl/INTb1              http://goo.gl/gzihe



 Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
 elephant logos are trademarks of the Apache Software Foundation.

Saturday, 24 September 2011

More Related Content

PDF
Castle: Reinventing Storage for Big Data
PDF
In the brain of Tom Wilkie
PDF
Cassandra & the Acunu Data Platform
PDF
Next Generation Cassandra
PDF
Supercharging Cassandra - GOTO Amsterdam
PDF
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
PDF
CCNxCon2012: Session 5: Distributed Cooperative Caching Scheme in CCN
PDF
Summer Training In Dotnet
Castle: Reinventing Storage for Big Data
In the brain of Tom Wilkie
Cassandra & the Acunu Data Platform
Next Generation Cassandra
Supercharging Cassandra - GOTO Amsterdam
TUKE MediaEval 2012: Spoken Web Search using DTW and Unsupervised SVM
CCNxCon2012: Session 5: Distributed Cooperative Caching Scheme in CCN
Summer Training In Dotnet

What's hot (17)

PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
PDF
MXF & AAF
PDF
Presentation of the open source CFD code Code_Saturne
PDF
The Native NDB Engine for Memcached
PDF
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
PDF
Threads 2x[1]
PDF
Cloumon enterprise
PDF
Rc111 010d-wcf
PDF
Stefano Giordano
PDF
Session9part2 Servers Detailed
PDF
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
PDF
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
PDF
PDF
Oracle Arch
PPTX
Network Management in System Center 2012 SP1 - VMM
PDF
RCIM 2008 - - ALTERA
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
MXF & AAF
Presentation of the open source CFD code Code_Saturne
The Native NDB Engine for Memcached
IAP09 CUDA@MIT 6.963 - Lecture 01: GPU Computing using CUDA (David Luebke, NV...
Threads 2x[1]
Cloumon enterprise
Rc111 010d-wcf
Stefano Giordano
Session9part2 Servers Detailed
MemzNet: Memory-Mapped Zero-copy Network Channel -- Streaming exascala data o...
Building Applications Using NoSQL Architectures on top of SQL Azure: How MSN ...
Oracle Arch
Network Management in System Center 2012 SP1 - VMM
RCIM 2008 - - ALTERA
Ad

Viewers also liked (8)

KEY
ZFS Tutorial LISA 2011
PPTX
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
PDF
Data Grids vs Databases
PDF
Scylla Summit 2016: ScyllaDB, Present and Future
PDF
Performance Monitoring: Understanding Your Scylla Cluster
PDF
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
PDF
Data Grids and Data Caching
PDF
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
ZFS Tutorial LISA 2011
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Data Grids vs Databases
Scylla Summit 2016: ScyllaDB, Present and Future
Performance Monitoring: Understanding Your Scylla Cluster
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Data Grids and Data Caching
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Ad

Similar to Cassandra on Castle (20)

KEY
Castle enhanced Cassandra
PDF
Paris NoSQL User Group - In Memory Data Grids in Action (without transactions...
PDF
Paris NoSQL User Group - In Memory Data Grids in Action (without transactions...
PDF
MongoDB: Scaling write performance | Devon 2012
PDF
Acunu & OCaml: Experience Report, CUFP
PDF
Redis — memcached on steroids
PDF
Css trees
PDF
Mongodb - Scaling write performance
PDF
20121024 mongodb-boston (1)
PDF
MongoDB and Fractal Tree Indexes
PDF
DaStor/Cassandra report for CDR solution
PDF
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
PDF
Progressive NOSQL: Cassandra
PDF
Introduction to Tokyo Products
PDF
Introduction to tokyo products
PDF
OpenSplice Cache
PDF
Couchbase Korea User Gorup 2nd Meetup #1
PPTX
From distributed caches to in-memory data grids
PPTX
Telecom universal datastatesharingfabric
KEY
NoSQL "Tools in Action" talk at Devoxx
Castle enhanced Cassandra
Paris NoSQL User Group - In Memory Data Grids in Action (without transactions...
Paris NoSQL User Group - In Memory Data Grids in Action (without transactions...
MongoDB: Scaling write performance | Devon 2012
Acunu & OCaml: Experience Report, CUFP
Redis — memcached on steroids
Css trees
Mongodb - Scaling write performance
20121024 mongodb-boston (1)
MongoDB and Fractal Tree Indexes
DaStor/Cassandra report for CDR solution
読み出し性能と書き込み性能を両立させるクラウドストレージ (SACSIS2011-A6-1)
Progressive NOSQL: Cassandra
Introduction to Tokyo Products
Introduction to tokyo products
OpenSplice Cache
Couchbase Korea User Gorup 2nd Meetup #1
From distributed caches to in-memory data grids
Telecom universal datastatesharingfabric
NoSQL "Tools in Action" talk at Devoxx

More from Acunu (20)

PDF
Acunu and Hailo: a realtime analytics case study on Cassandra
PDF
Virtual nodes: Operational Aspirin
PDF
Acunu Analytics and Cassandra at Hailo All Your Base 2013
PDF
Understanding Cassandra internals to solve real-world problems
PDF
Acunu Analytics: Simpler Real-Time Cassandra Apps
PDF
All Your Base
PDF
Realtime Analytics with Apache Cassandra
PDF
Realtime Analytics with Apache Cassandra - JAX London
PDF
Real-time Cassandra
PDF
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
PDF
Realtime Analytics with Cassandra
PDF
Acunu Analytics @ Cassandra London
KEY
Exploring Big Data value for your business
PDF
Realtime Analytics on the Twitter Firehose with Cassandra
PPTX
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
KEY
Cassandra EU 2012 - Putting the X Factor into Cassandra
PPTX
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
PDF
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
PDF
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
PDF
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...
Acunu and Hailo: a realtime analytics case study on Cassandra
Virtual nodes: Operational Aspirin
Acunu Analytics and Cassandra at Hailo All Your Base 2013
Understanding Cassandra internals to solve real-world problems
Acunu Analytics: Simpler Real-Time Cassandra Apps
All Your Base
Realtime Analytics with Apache Cassandra
Realtime Analytics with Apache Cassandra - JAX London
Real-time Cassandra
Realtime Analytics on the Twitter Firehose with Apache Cassandra - Denormaliz...
Realtime Analytics with Cassandra
Acunu Analytics @ Cassandra London
Exploring Big Data value for your business
Realtime Analytics on the Twitter Firehose with Cassandra
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Putting the X Factor into Cassandra
Cassandra EU 2012 - Netflix's Cassandra Architecture and Open Source Efforts
Cassandra EU 2012 - CQL: Then, Now and When by Eric Evans
Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix
Cassandra EU 2012 - Highly Available: The Cassandra Distribution Model by Sam...

Recently uploaded (20)

PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Approach and Philosophy of On baking technology
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPT
Teaching material agriculture food technology
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Approach and Philosophy of On baking technology
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
NewMind AI Monthly Chronicles - July 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Teaching material agriculture food technology
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Cassandra on Castle

  • 1. Cassandra on Castle Tim Moreton @timmoreton Saturday, 24 September 2011
  • 2. Outline • Why Castle? • A [quick] tour of Castle • Cassandra on Castle • An aside into Memcache • Cross-cluster snapshots and clones Saturday, 24 September 2011
  • 3. Before the Flood 1990 Small databases BTree indexes BTree File systems RAID Old hardware Saturday, 24 September 2011
  • 4. Two Revolutions 2010 Distributed, shared-nothing databases Write-optimised indexes Write-optimised indexes BTree file systems BTree file systems RAID ... RAID New hardware New hardware Saturday, 24 September 2011
  • 5. Bridging the Gap 2011 Distributed, shared-nothing databases Castle Castle ... New hardware New hardware Saturday, 24 September 2011
  • 6. Saturday, 24 September 2011 Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffers userspace Streaming interface range key buffered key buffered queries insert value insert get value get interface kernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries management mapping layer key doubling array insert merges Arrays key Version tree insert btree key get btree range modlist btree mapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusher block mapping & page cache Linux Kernel Block layer Memory manager MM layers linux's block &
  • 7. Shared memory interface Castle keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace • Like ZFS+BDB for Big Data Streaming interface interface range key buffered key buffered queries insert value insert get value get • Opensource (GPLv2, MIT Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x for user libraries) range arrays queries management key insert merges Arrays • http://bitbucket.org/acunu mapping layer modlist btree key Version tree insert btree • Loadable Kernel Module, key get btree range queries value arrays Cache targeting CentOS’s 2.6.18 block mapping & • http://www.acunu.com/ cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache blogs/andy-twigg/why- acunu-kernel/ linux's block & Linux Kernel MM layers Block layer Memory manager Saturday, 24 September 2011
  • 8. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management Saturday, 24 September 2011 key
  • 9. The Interface Tree of versions Attachment • Create, snapshot, clone • Attach/detach • Keys: any dimensional • Values: any size v0 • Simple get, put, delete v1 v3 • Iterator, slice interfaces v12 v13 v15 • Streaming interface v16 v24 Saturday, 24 September 2011
  • 10. The Interface Shared memory interface keys Userspace Acunu Kernel userspace interface values In-kernel async, shared memory ring workloads shared buffers kernelspace Streaming interface interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range queries castle_{back,objects}.c arrays management Saturday, 24 September 2011 key
  • 11. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c Saturday, 24 September 2011
  • 12. Doubling Array Inserts 2 2 9 9 Buffer arrays in memory until we have > B of them Saturday, 24 September 2011
  • 13. Doubling Array Inserts 11 2 9 2 8 9 11 8 8 11 etc... Saturday, 24 September 2011
  • 14. 8KB @ 100MB/s, w/ 8ms seek 100 / 5 = 100 IOs/s = 20 updates/s ~ log (2^30)/log 100 = 5 IOs/update Range Query Update (Size Z) O(logB N) O(Z/B) B-Tree random IOs random IOs O((log N)/B) O(Z/B) Doubling Array sequential IOs sequential IOs ~ log (2^30)/100 8KB @ 100MB/s 13k / 0.2 = 0.2 IOs/update = 13k IOs/s = 65k updates/s B = “block size”, say 8KB at 100 bytes/entry ~= 100 entries Saturday, 24 September 2011
  • 15. interface userspac values In-kernel async, shared memory ring workloads shared buffers kernelspace interface Doubling Array Streaming interface range key buffered key buffered queries insert value insert get value get Doubling Arrays doubling array mapping layer insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays castle_{da,bloom}.c Saturday, 24 September 2011
  • 16. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache So how to do snapshots and clones? freespace manager allocator flusher & mapper page cache castle_{btree,versions}.c k& Linux Kernel s Saturday, 24 September 2011
  • 17. Copy-on-Write BTree Idea: • Apply path-copying [DSST] to the B-tree Problems: • Space blowup: Each update may rewrite an entire path • Slow updates: as above A log file system makes updates sequential, but relies on random access and garbage collection (achilles heel!) Saturday, 24 September 2011
  • 18. Range Update Space Query CoW B- O(logB Nv) O(Z/B) O(N B logB Nv) Tree random IOs random IOs “BigTable” O((log N)/B) O(Z/B) O(VN) LevelDB style DA sequential IOs sequential IOs “Mod-list” O((log N)/B) O(Z/B) Castle in a DA sequential IOs sequential IOs O(N) Nv = #keys live (accessible) at version v Saturday, 24 September 2011
  • 19. Stratified B-Trees • Retires Copy-On-Write B-Trees, the bedrock of modern storage (Sun ZFS, NetApp WAFL, ...) • Patent-pending, next-generation data structure • Theoretically optimal, yet highly practical Copy-on-write B-tree finally beaten. Andy Twigg∗ , Andrew Byde∗ , Grzegorz Miło´∗ , Tim Moreton∗ , John Wilkes†∗ and Tom Wilkie∗ ∗ s Acunu, † Google http://goo.gl/INTb1 firstname@acunu.com Abstract This paper presents some recent results on new con- structions for B-trees that go beyond copy-on-write, that A classic versioned data structure in storage and com- we call ‘stratified B-trees’. They solve two open prob- puter science is the copy-on-write (CoW) B-tree – it un- lems: Firstly. they offer a fully-versioned B-tree with derlies many of today’s file systems and databases, in- optimal space and the same lookup time as the CoW B- cluding WAFL, ZFS, Btrfs and more. Unfortunately, it tree. Secondly, they are the first to offer other points on doesn’t inherit the B-tree’s optimality properties; it has the Pareto optimal query/update tradeoff curve, and in poor space utilization, cannot offer fast updates, and re- particular, our structures offer fully-versioned updates in http://goo.gl/gzihe lies on random IO to scale. Yet, nothing better has o(1) IOs, while using linear space. Experimental results been developed since. We describe the ‘stratified B-tree’, indicate 100,000s updates/s on a large SATA disk, two which beats the CoW B-tree in every way. In particu- orders of magnitude faster than a CoW B-tree. lar, it is the first versioned dictionary to achieve optimal Since stratified B-trees subsume CoW B-trees (and in- tradeoffs between space, query and update performance. deed all other known versioned external-memory dictio- Therefore, we believe there is no longer a good reason to naries), we believe there is no longer a good reason to use CoW B-trees for versioned data stores. use them for versioned data stores. Acunu is develop- ing a commercial in-kernel implementation of stratified B-tress, which we hope to release soon. 1 Introduction Saturday, 24 September 2011 The B-tree was presented in 1972 [1], and it survives
  • 20. Doubling Arrays doubling array mapping layer “Mod-list” B-Tree insert Bloom filters queues key get arrays x range arrays queries management key insert merges Arrays mapping layer modlist btree key Version tree insert btree key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache castle_{btree,versions}.c k& Linux Kernel s Saturday, 24 September 2011
  • 21. Arrays mapping layer modlist btree key Version tree insert btree Disk Layout: RDA key get btree range queries value arrays Cache block mapping & cacheing layer "Extent" layer prefetcher extent block extent cache freespace allocator manager flusher & mapper page cache linux's block & Linux Kernel MM layers Block layer Memory manager castle_{cache,extent,freespace,rebuild}.c Saturday, 24 September 2011
  • 22. Disk Layout: RDA random duplicate allocation 4 2 1 4 5 2 5 3 1 3 7 10 7 6 8 9 9 10 6 8 15 12 14 11 13 14 11 12 13 15 16 16 Saturday, 24 September 2011
  • 23. SSD tiering [taster] • Why? Key to >cache random reads • v1: SSD for metadata structures • Redundancy provided by disk • SSD for selected collection data (CFs) • 10x write rate on SSDs than regular FSs Saturday, 24 September 2011
  • 24. Saturday, 24 September 2011 Shared memory interface keys Userspace Acunu Kernel values In-kernel async, shared memory ring workloads interface shared buffers userspace Streaming interface range key buffered key buffered queries insert value insert get value get interface kernelspace Doubling Arrays insert Bloom filters queues key get x arrays range arrays queries management mapping layer key doubling array insert merges Arrays key Version tree insert btree key get btree range modlist btree mapping layer queries value arrays Cache "Extent" layer extent block extent cache freespace allocator prefetcher manager & mapper cacheing layer flusher block mapping & page cache Linux Kernel Block layer Memory manager MM layers linux's block &
  • 25. Cassandra on Castle • Eliminate all ‘storage heavy lifting’ • Extend ColumnFamilyStore • Efficient JNI bindings to libcastle C library • row, col, value, t: (row, col) -> (t,value) • row, a|b|c|d, value, t: (row, a, b, c, d, col) -> (t,value) Saturday, 24 September 2011
  • 26. Small random inserts Inserting 3 billion rows Acunu powered Cassandra - ‘standard’ Cassandra - Saturday, 24 September 2011
  • 27. Insert latency While inserting 3 billion rows Acunu powered Cassandra x ‘standard’ Cassandra + Saturday, 24 September 2011
  • 28. Small random range queries Performed immediately after inserts Acunu powered Cassandra - ‘standard’ Cassandra - Saturday, 24 September 2011
  • 29. Memcache + Cassandra get/insert Cass client get/put memcached Same data! 100k random Replication logic inserts/sec! Replication logic Text Cassandra memcache Cassandra memcache Castle Castle ... H/W H/W Saturday, 24 September 2011
  • 30. v2: Cross-cluster versions • Eventually consistent • Spans data centers • Tolerates node failure, network partition • High performance, no space overhead • Dev/Test/Staging on Prod clusters Saturday, 24 September 2011
  • 31. So... • Castle = ZFS + BDB for Big Data • Cassandra on Castle runs apps unmodified • Up to 100x throughput under load • No GC pauses: very predictable latencies • v2: Cross-cluster snapshot and clone • SSD optimisation Saturday, 24 September 2011
  • 33. Questions? Tim Moreton // @timmoreton http://goo.gl/INTb1 http://goo.gl/gzihe Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and elephant logos are trademarks of the Apache Software Foundation. Saturday, 24 September 2011