SlideShare a Scribd company logo
Dynamo: not just datastores
          Susan Potter




         September 2011
Overview: In a nutshell




      Figure: "a highly available key-value storage system"
Overview: Not for all apps
Overview: Agenda


  Distribution & consistency   Riak abstractions

  t - Fault tolerance          Building an application

  N, R, W                      Considerations

  Dynamo techniques            Oh, the possibilities!
Distributed models & consistency
     Are you on ACID?
     Are your apps ACID?
     Few apps require
     strong consistency
     Embrace BASE
     for high availability
Distributed models & consistency
     Are you on ACID?
     Are your apps ACID?
     Few apps require
     strong consistency
     Embrace BASE
     for high availability
Distributed models & consistency
     Are you on ACID?
     Are your apps ACID?
     Few apps require
     strong consistency
     Embrace BASE
     for high availability
Distributed models & consistency
     Are you on ACID?
     Are your apps ACID?
     Few apps require
     strong consistency
     Embrace BASE
     for high availability
Distributed models & consistency



                   BASE
 Basically Available Soft-state Eventual consistency
Distributed models & consistency



                   BASE
 Basically Available Soft-state Eventual consistency

                  </RANT>
t-Fault tolerance: Two kinds of failure
t-Fault tolerance: More on failure


     Failstop
     fail in well understood / deterministic ways




     Byzantine
     fail in arbitrary non-deterministic ways
t-Fault tolerance: More on failure


     Failstop
     fail in well understood / deterministic ways

     t-Fault tolerance means t + 1 nodes

     Byzantine
     fail in arbitrary non-deterministic ways
t-Fault tolerance: More on failure


     Failstop
     fail in well understood / deterministic ways

     t-Fault tolerance means t + 1 nodes

     Byzantine
     fail in arbitrary non-deterministic ways
t-Fault tolerance: More on failure


     Failstop
     fail in well understood / deterministic ways

     t-Fault tolerance means t + 1 nodes

     Byzantine
     fail in arbitrary non-deterministic ways

     t-Fault tolerance means 2t + 1 nodes
CAP Controls: N, R, W

    N = number of replicas
    must be 1 ≤ N ≤ nodes

    R = number of responding read nodes
    can be set per request

    W = number of responding write nodes
    can be set per request

    Q = quorum "majority rules"
    set in stone: Q = N/2 + 1

    Tunability
    when R = W = N =⇒ strong

    when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
CAP Controls: N, R, W

    N = number of replicas
    must be 1 ≤ N ≤ nodes

    R = number of responding read nodes
    can be set per request

    W = number of responding write nodes
    can be set per request

    Q = quorum "majority rules"
    set in stone: Q = N/2 + 1

    Tunability
    when R = W = N =⇒ strong

    when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
CAP Controls: N, R, W

    N = number of replicas
    must be 1 ≤ N ≤ nodes

    R = number of responding read nodes
    can be set per request

    W = number of responding write nodes
    can be set per request

    Q = quorum "majority rules"
    set in stone: Q = N/2 + 1

    Tunability
    when R = W = N =⇒ strong

    when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
CAP Controls: N, R, W

    N = number of replicas
    must be 1 ≤ N ≤ nodes

    R = number of responding read nodes
    can be set per request

    W = number of responding write nodes
    can be set per request

    Q = quorum "majority rules"
    set in stone: Q = N/2 + 1

    Tunability
    when R = W = N =⇒ strong

    when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
CAP Controls: N, R, W

    N = number of replicas
    must be 1 ≤ N ≤ nodes

    R = number of responding read nodes
    can be set per request

    W = number of responding write nodes
    can be set per request

    Q = quorum "majority rules"
    set in stone: Q = N/2 + 1

    Tunability
    when R = W = N =⇒ strong

    when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
Dynamo: Properties

    Decentralized
    no masters exist

    Homogeneous
    nodes have same capabilities

    No Global State
    no SPOFs for global state

    Deterministic Replica Placement
    each node can calculate where replicas should exist for key

    Logical Time
    No reliance on physical time
Dynamo: Properties

    Decentralized
    no masters exist

    Homogeneous
    nodes have same capabilities

    No Global State
    no SPOFs for global state

    Deterministic Replica Placement
    each node can calculate where replicas should exist for key

    Logical Time
    No reliance on physical time
Dynamo: Properties

    Decentralized
    no masters exist

    Homogeneous
    nodes have same capabilities

    No Global State
    no SPOFs for global state

    Deterministic Replica Placement
    each node can calculate where replicas should exist for key

    Logical Time
    No reliance on physical time
Dynamo: Properties

    Decentralized
    no masters exist

    Homogeneous
    nodes have same capabilities

    No Global State
    no SPOFs for global state

    Deterministic Replica Placement
    each node can calculate where replicas should exist for key

    Logical Time
    No reliance on physical time
Dynamo: Properties

    Decentralized
    no masters exist

    Homogeneous
    nodes have same capabilities

    No Global State
    no SPOFs for global state

    Deterministic Replica Placement
    each node can calculate where replicas should exist for key

    Logical Time
    No reliance on physical time
Dynamo: Techniques


       Consistent Hashing

       Vector Clocks

       Gossip Protocol

       Hinted Handoff
Dynamo: Techniques


       Consistent Hashing

       Vector Clocks

       Gossip Protocol

       Hinted Handoff
Dynamo: Consistent hashing




         Figure: PrefsList for K is [C, D, E]
Dynamo: Vector clocks



   -type vclock() ::
 [{actor(), counter()}]
         A occurred-before B if counters in A are
     less than or equal to those in B for each actor
Dynamo: Gossip protocol




        Figure:   "You would never guess what I heard down the pub..."
Dynamo: Hinted handoff




       Figure:   "Here, take the baton and take over from me. KTHXBAI"
riak_core: Abstractions

         Coordinator
         enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes

         VNodes
         Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication

         Watchers
         gen_event, listens for ring and service events, used to calculate fallback nodes

         Ring Manager
         stores local node copy of ring (and cluster?) data

         Ring Event Handlers
         notified about ring (and cluster?) changes and broadcasts plus metadata changes
riak_core: Abstractions

         Coordinator
         enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes

         VNodes
         Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication

         Watchers
         gen_event, listens for ring and service events, used to calculate fallback nodes

         Ring Manager
         stores local node copy of ring (and cluster?) data

         Ring Event Handlers
         notified about ring (and cluster?) changes and broadcasts plus metadata changes
riak_core: Abstractions

         Coordinator
         enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes

         VNodes
         Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication

         Watchers
         gen_event, listens for ring and service events, used to calculate fallback nodes

         Ring Manager
         stores local node copy of ring (and cluster?) data

         Ring Event Handlers
         notified about ring (and cluster?) changes and broadcasts plus metadata changes
riak_core: Abstractions

         Coordinator
         enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes

         VNodes
         Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication

         Watchers
         gen_event, listens for ring and service events, used to calculate fallback nodes

         Ring Manager
         stores local node copy of ring (and cluster?) data

         Ring Event Handlers
         notified about ring (and cluster?) changes and broadcasts plus metadata changes
riak_core: Abstractions

         Coordinator
         enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes

         VNodes
         Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication

         Watchers
         gen_event, listens for ring and service events, used to calculate fallback nodes

         Ring Manager
         stores local node copy of ring (and cluster?) data

         Ring Event Handlers
         notified about ring (and cluster?) changes and broadcasts plus metadata changes
riak_core: Coordinator




             Figure:   Header of a coordinator
riak_core: Commands



  -type riak_cmd() ::
{verb(), key(), payload()}
Implement handle_command/3 clause for each command in your
                  callback VNode module
riak_core: VNodes
riak_core: VNodes




        Figure:   Sample handoff functions from RTS example app
riak_core: Project Structure

         rebar ;)
         also written by the Basho team, makes OTP building and deploying much less painful



         dependencies
         add you_app, riak_core, riak_kv, etc. as dependencies to shell project



         new in 1.0 stuff
         cluster vs ring membership, riak_pipe, etc.
riak_core: Project Structure

         rebar ;)
         also written by the Basho team, makes OTP building and deploying much less painful



         dependencies
         add you_app, riak_core, riak_kv, etc. as dependencies to shell project



         new in 1.0 stuff
         cluster vs ring membership, riak_pipe, etc.
riak_core: Project Structure

         rebar ;)
         also written by the Basho team, makes OTP building and deploying much less painful



         dependencies
         add you_app, riak_core, riak_kv, etc. as dependencies to shell project



         new in 1.0 stuff
         cluster vs ring membership, riak_pipe, etc.
Considerations

                 Other:stuff()
     Interface layer
     riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers



     Securing application
     riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies



     Distribution models
     e.g. pipelined, laned vs tiered



     Query/execution models
     e.g. map-reduce (M/R), ASsociated SET (ASSET)
Considerations

                 Other:stuff()
     Interface layer
     riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers



     Securing application
     riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies



     Distribution models
     e.g. pipelined, laned vs tiered



     Query/execution models
     e.g. map-reduce (M/R), ASsociated SET (ASSET)
Considerations

                 Other:stuff()
     Interface layer
     riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers



     Securing application
     riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies



     Distribution models
     e.g. pipelined, laned vs tiered



     Query/execution models
     e.g. map-reduce (M/R), ASsociated SET (ASSET)
Considerations

                 Other:stuff()
     Interface layer
     riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers



     Securing application
     riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies



     Distribution models
     e.g. pipelined, laned vs tiered



     Query/execution models
     e.g. map-reduce (M/R), ASsociated SET (ASSET)
Oh, the possibilities!

                    What:next()
     Concurrency models
     e.g. actor, disruptor, evented/reactor, threading



     Consistency models
     e.g. vector-field, causal, FIFO



     Computation optimizations
     e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?)



     Other optimizations
     e.g. Client discovery, Remote Direct Memory Access (RDMA)
Oh, the possibilities!

                    What:next()
     Concurrency models
     e.g. actor, disruptor, evented/reactor, threading



     Consistency models
     e.g. vector-field, causal, FIFO



     Computation optimizations
     e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?)



     Other optimizations
     e.g. Client discovery, Remote Direct Memory Access (RDMA)
Oh, the possibilities!

                    What:next()
     Concurrency models
     e.g. actor, disruptor, evented/reactor, threading



     Consistency models
     e.g. vector-field, causal, FIFO



     Computation optimizations
     e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?)



     Other optimizations
     e.g. Client discovery, Remote Direct Memory Access (RDMA)
Oh, the possibilities!

                    What:next()
     Concurrency models
     e.g. actor, disruptor, evented/reactor, threading



     Consistency models
     e.g. vector-field, causal, FIFO



     Computation optimizations
     e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?)



     Other optimizations
     e.g. Client discovery, Remote Direct Memory Access (RDMA)
# finger $(whoami)


Login: susan               Name: Susan Potter
Directory: /home/susan     Shell: /bin/zsh
On since Mon 29 Sep 1997 21:18 (GMT) on tty1 from :0
Too much unread mail on me@susanpotter.net
Now working at Assistly! Looking for smart developers!;)
Plan:
  github: mbbx6spp
  twitter: @SusanPotter
Slides & Material

 https://github.com/mbbx6spp/riak_core-templates




      Figure:   http://susanpotter.net/talks/strange-loop/2011/dynamo-not-just-for-datastores/
Examples


       Rebar Templates
       https://github.com/mbbx6spp/riak_core-templates

       Riak Pipes
       https://github.com/basho/riak_pipe

       Riak Zab
       https://github.com/jtuple/riak_zab
Questions?




         Figure:   http://www.flickr.com/photos/42682395@N04/




         @SusanPotter
Questions?




         Figure:   http://www.flickr.com/photos/42682395@N04/




         @SusanPotter

More Related Content

PDF
Ricon/West 2013: Adventures with Riak Pipe
PDF
From Zero to Application Delivery with NixOS
PDF
Functional Operations (Functional Programming at Comcast Labs Connect)
PDF
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
PPT
bluespec talk
PPTX
Behind modern concurrency primitives
PDF
Hopping in clouds: a tale of migration from one cloud provider to another
PPTX
Akka.NET streams and reactive streams
Ricon/West 2013: Adventures with Riak Pipe
From Zero to Application Delivery with NixOS
Functional Operations (Functional Programming at Comcast Labs Connect)
From Zero To Production (NixOS, Erlang) @ Erlang Factory SF 2016
bluespec talk
Behind modern concurrency primitives
Hopping in clouds: a tale of migration from one cloud provider to another
Akka.NET streams and reactive streams

What's hot (20)

PPTX
Behind modern concurrency primitives
PDF
Using ngx_lua in UPYUN
PPTX
Discovering OpenBSD on AWS
PDF
Go Programming Patterns
PDF
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
PDF
Testing your infrastructure with litmus
PDF
RestMQ - HTTP/Redis based Message Queue
PDF
Lua tech talk
PDF
Roll Your Own API Management Platform with nginx and Lua
PDF
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
PPTX
The Art of Exploiting Unconventional Use-after-free Bugs in Android Kernel by...
PDF
Puppet and the HashiStack
PPTX
[오픈소스컨설팅] Linux Network Troubleshooting
PDF
Building Distributed System with Celery on Docker Swarm - PyCon JP 2016
PDF
Static Typing in Vault
PPTX
Deep Dive in Docker Overlay Networks
PDF
Securing Prometheus exporters using HashiCorp Vault
PDF
Ansible not only for Dummies
PDF
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
PPTX
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Behind modern concurrency primitives
Using ngx_lua in UPYUN
Discovering OpenBSD on AWS
Go Programming Patterns
BDD - Buzzword Driven Development - Build the next cool app for fun and for.....
Testing your infrastructure with litmus
RestMQ - HTTP/Redis based Message Queue
Lua tech talk
Roll Your Own API Management Platform with nginx and Lua
Apache MXNet Distributed Training Explained In Depth by Viacheslav Kovalevsky...
The Art of Exploiting Unconventional Use-after-free Bugs in Android Kernel by...
Puppet and the HashiStack
[오픈소스컨설팅] Linux Network Troubleshooting
Building Distributed System with Celery on Docker Swarm - PyCon JP 2016
Static Typing in Vault
Deep Dive in Docker Overlay Networks
Securing Prometheus exporters using HashiCorp Vault
Ansible not only for Dummies
Introduction to Apache ZooKeeper | Big Data Hadoop Spark Tutorial | CloudxLab
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Ad

Viewers also liked (19)

PDF
Distributed Developer Workflows using Git
PDF
Writing Bullet-Proof Javascript: By Using CoffeeScript
PDF
Link Walking with Riak
PDF
Designing for Concurrency
PDF
Functional Algebra: Monoids Applied
PPTX
Running Free with the Monads
PDF
Why Haskell
KEY
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
KEY
Scaling Teams, Processes and Architectures
PPTX
Your data structures are made of maths!
KEY
Scalable Architectures - Taming the Twitter Firehose
PDF
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
KEY
Graphs in the Database: Rdbms In The Social Networks Age
KEY
The Art of Scalability - Managing growth
KEY
NoSQL Databases: Why, what and when
PDF
Monitoring at scale - Intuitive dashboard design
KEY
Trees In The Database - Advanced data structures
PPTX
Category theory for beginners
PPTX
Data made out of functions
Distributed Developer Workflows using Git
Writing Bullet-Proof Javascript: By Using CoffeeScript
Link Walking with Riak
Designing for Concurrency
Functional Algebra: Monoids Applied
Running Free with the Monads
Why Haskell
Modern Algorithms and Data Structures - 1. Bloom Filters, Merkle Trees
Scaling Teams, Processes and Architectures
Your data structures are made of maths!
Scalable Architectures - Taming the Twitter Firehose
Scalaz By Example (An IO Taster) -- PDXScala Meetup Jan 2014
Graphs in the Database: Rdbms In The Social Networks Age
The Art of Scalability - Managing growth
NoSQL Databases: Why, what and when
Monitoring at scale - Intuitive dashboard design
Trees In The Database - Advanced data structures
Category theory for beginners
Data made out of functions
Ad

Similar to Dynamo: Not Just For Datastores (20)

PDF
Building Distributed Systems With Riak and Riak Core
PDF
Masterless Distributed Computing with Riak Core - EUC 2010
PPT
Dynamo.ppt
PPT
Dynamo.ppt
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
PDF
Practical Byzantine Fault Tolerance
KEY
Introduction to Cassandra: Replication and Consistency
ODP
Everything you always wanted to know about Distributed databases, at devoxx l...
PDF
Reaching reliable agreement in an unreliable world
PPT
Handling Data in Mega Scale Web Systems
PDF
The computer science behind a modern disributed data store
PPTX
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
PDF
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
PPTX
dos.pptx
ODP
Manging scalability of distributed system
PDF
Replication in the wild ankara cloud meetup - feb 2017
PDF
Replication in the wild ankara cloud meetup - feb 2017
PDF
Redis: REmote DIctionary Server
PPT
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
PDF
Building a Distributed Message Log from Scratch
Building Distributed Systems With Riak and Riak Core
Masterless Distributed Computing with Riak Core - EUC 2010
Dynamo.ppt
Dynamo.ppt
Big Data & NoSQL - EFS'11 (Pavlo Baron)
Practical Byzantine Fault Tolerance
Introduction to Cassandra: Replication and Consistency
Everything you always wanted to know about Distributed databases, at devoxx l...
Reaching reliable agreement in an unreliable world
Handling Data in Mega Scale Web Systems
The computer science behind a modern disributed data store
NonStop Hadoop - Applying the PaxosFamily of Protocols to make Critical Hadoo...
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...
dos.pptx
Manging scalability of distributed system
Replication in the wild ankara cloud meetup - feb 2017
Replication in the wild ankara cloud meetup - feb 2017
Redis: REmote DIctionary Server
Building Reliable Cloud Storage with Riak and CloudStack - Andy Gross, Chief ...
Building a Distributed Message Log from Scratch

More from Susan Potter (6)

PDF
Thinking in Properties
PDF
Champaign-Urbana Javascript Meetup Talk (Jan 2020)
PDF
From Zero to Haskell: Lessons Learned
PDF
Dynamically scaling a political news and activism hub (up to 5x the traffic i...
PDF
Twitter4R OAuth
PDF
Deploying distributed software services to the cloud without breaking a sweat
Thinking in Properties
Champaign-Urbana Javascript Meetup Talk (Jan 2020)
From Zero to Haskell: Lessons Learned
Dynamically scaling a political news and activism hub (up to 5x the traffic i...
Twitter4R OAuth
Deploying distributed software services to the cloud without breaking a sweat

Recently uploaded (20)

PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Understanding_Digital_Forensics_Presentation.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Machine learning based COVID-19 study performance prediction
Review of recent advances in non-invasive hemoglobin estimation
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
cuic standard and advanced reporting.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
MYSQL Presentation for SQL database connectivity
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The Rise and Fall of 3GPP – Time for a Sabbatical?
Programs and apps: productivity, graphics, security and other tools
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx

Dynamo: Not Just For Datastores

  • 1. Dynamo: not just datastores Susan Potter September 2011
  • 2. Overview: In a nutshell Figure: "a highly available key-value storage system"
  • 3. Overview: Not for all apps
  • 4. Overview: Agenda Distribution & consistency Riak abstractions t - Fault tolerance Building an application N, R, W Considerations Dynamo techniques Oh, the possibilities!
  • 5. Distributed models & consistency Are you on ACID? Are your apps ACID? Few apps require strong consistency Embrace BASE for high availability
  • 6. Distributed models & consistency Are you on ACID? Are your apps ACID? Few apps require strong consistency Embrace BASE for high availability
  • 7. Distributed models & consistency Are you on ACID? Are your apps ACID? Few apps require strong consistency Embrace BASE for high availability
  • 8. Distributed models & consistency Are you on ACID? Are your apps ACID? Few apps require strong consistency Embrace BASE for high availability
  • 9. Distributed models & consistency BASE Basically Available Soft-state Eventual consistency
  • 10. Distributed models & consistency BASE Basically Available Soft-state Eventual consistency </RANT>
  • 11. t-Fault tolerance: Two kinds of failure
  • 12. t-Fault tolerance: More on failure Failstop fail in well understood / deterministic ways Byzantine fail in arbitrary non-deterministic ways
  • 13. t-Fault tolerance: More on failure Failstop fail in well understood / deterministic ways t-Fault tolerance means t + 1 nodes Byzantine fail in arbitrary non-deterministic ways
  • 14. t-Fault tolerance: More on failure Failstop fail in well understood / deterministic ways t-Fault tolerance means t + 1 nodes Byzantine fail in arbitrary non-deterministic ways
  • 15. t-Fault tolerance: More on failure Failstop fail in well understood / deterministic ways t-Fault tolerance means t + 1 nodes Byzantine fail in arbitrary non-deterministic ways t-Fault tolerance means 2t + 1 nodes
  • 16. CAP Controls: N, R, W N = number of replicas must be 1 ≤ N ≤ nodes R = number of responding read nodes can be set per request W = number of responding write nodes can be set per request Q = quorum "majority rules" set in stone: Q = N/2 + 1 Tunability when R = W = N =⇒ strong when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
  • 17. CAP Controls: N, R, W N = number of replicas must be 1 ≤ N ≤ nodes R = number of responding read nodes can be set per request W = number of responding write nodes can be set per request Q = quorum "majority rules" set in stone: Q = N/2 + 1 Tunability when R = W = N =⇒ strong when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
  • 18. CAP Controls: N, R, W N = number of replicas must be 1 ≤ N ≤ nodes R = number of responding read nodes can be set per request W = number of responding write nodes can be set per request Q = quorum "majority rules" set in stone: Q = N/2 + 1 Tunability when R = W = N =⇒ strong when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
  • 19. CAP Controls: N, R, W N = number of replicas must be 1 ≤ N ≤ nodes R = number of responding read nodes can be set per request W = number of responding write nodes can be set per request Q = quorum "majority rules" set in stone: Q = N/2 + 1 Tunability when R = W = N =⇒ strong when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
  • 20. CAP Controls: N, R, W N = number of replicas must be 1 ≤ N ≤ nodes R = number of responding read nodes can be set per request W = number of responding write nodes can be set per request Q = quorum "majority rules" set in stone: Q = N/2 + 1 Tunability when R = W = N =⇒ strong when R + W > N =⇒ quorum where W = 1, R = N or W = N, R = 1 or W = R = Q
  • 21. Dynamo: Properties Decentralized no masters exist Homogeneous nodes have same capabilities No Global State no SPOFs for global state Deterministic Replica Placement each node can calculate where replicas should exist for key Logical Time No reliance on physical time
  • 22. Dynamo: Properties Decentralized no masters exist Homogeneous nodes have same capabilities No Global State no SPOFs for global state Deterministic Replica Placement each node can calculate where replicas should exist for key Logical Time No reliance on physical time
  • 23. Dynamo: Properties Decentralized no masters exist Homogeneous nodes have same capabilities No Global State no SPOFs for global state Deterministic Replica Placement each node can calculate where replicas should exist for key Logical Time No reliance on physical time
  • 24. Dynamo: Properties Decentralized no masters exist Homogeneous nodes have same capabilities No Global State no SPOFs for global state Deterministic Replica Placement each node can calculate where replicas should exist for key Logical Time No reliance on physical time
  • 25. Dynamo: Properties Decentralized no masters exist Homogeneous nodes have same capabilities No Global State no SPOFs for global state Deterministic Replica Placement each node can calculate where replicas should exist for key Logical Time No reliance on physical time
  • 26. Dynamo: Techniques Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoff
  • 27. Dynamo: Techniques Consistent Hashing Vector Clocks Gossip Protocol Hinted Handoff
  • 28. Dynamo: Consistent hashing Figure: PrefsList for K is [C, D, E]
  • 29. Dynamo: Vector clocks -type vclock() :: [{actor(), counter()}] A occurred-before B if counters in A are less than or equal to those in B for each actor
  • 30. Dynamo: Gossip protocol Figure: "You would never guess what I heard down the pub..."
  • 31. Dynamo: Hinted handoff Figure: "Here, take the baton and take over from me. KTHXBAI"
  • 32. riak_core: Abstractions Coordinator enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes VNodes Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication Watchers gen_event, listens for ring and service events, used to calculate fallback nodes Ring Manager stores local node copy of ring (and cluster?) data Ring Event Handlers notified about ring (and cluster?) changes and broadcasts plus metadata changes
  • 33. riak_core: Abstractions Coordinator enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes VNodes Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication Watchers gen_event, listens for ring and service events, used to calculate fallback nodes Ring Manager stores local node copy of ring (and cluster?) data Ring Event Handlers notified about ring (and cluster?) changes and broadcasts plus metadata changes
  • 34. riak_core: Abstractions Coordinator enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes VNodes Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication Watchers gen_event, listens for ring and service events, used to calculate fallback nodes Ring Manager stores local node copy of ring (and cluster?) data Ring Event Handlers notified about ring (and cluster?) changes and broadcasts plus metadata changes
  • 35. riak_core: Abstractions Coordinator enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes VNodes Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication Watchers gen_event, listens for ring and service events, used to calculate fallback nodes Ring Manager stores local node copy of ring (and cluster?) data Ring Event Handlers notified about ring (and cluster?) changes and broadcasts plus metadata changes
  • 36. riak_core: Abstractions Coordinator enforces consistency requirements, performing anti-entropy, gen_fsm, coordinates vnodes VNodes Erlang process, vnode to hashring partition, delegated work for its partition, unit of replication Watchers gen_event, listens for ring and service events, used to calculate fallback nodes Ring Manager stores local node copy of ring (and cluster?) data Ring Event Handlers notified about ring (and cluster?) changes and broadcasts plus metadata changes
  • 37. riak_core: Coordinator Figure: Header of a coordinator
  • 38. riak_core: Commands -type riak_cmd() :: {verb(), key(), payload()} Implement handle_command/3 clause for each command in your callback VNode module
  • 40. riak_core: VNodes Figure: Sample handoff functions from RTS example app
  • 41. riak_core: Project Structure rebar ;) also written by the Basho team, makes OTP building and deploying much less painful dependencies add you_app, riak_core, riak_kv, etc. as dependencies to shell project new in 1.0 stuff cluster vs ring membership, riak_pipe, etc.
  • 42. riak_core: Project Structure rebar ;) also written by the Basho team, makes OTP building and deploying much less painful dependencies add you_app, riak_core, riak_kv, etc. as dependencies to shell project new in 1.0 stuff cluster vs ring membership, riak_pipe, etc.
  • 43. riak_core: Project Structure rebar ;) also written by the Basho team, makes OTP building and deploying much less painful dependencies add you_app, riak_core, riak_kv, etc. as dependencies to shell project new in 1.0 stuff cluster vs ring membership, riak_pipe, etc.
  • 44. Considerations Other:stuff() Interface layer riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers Securing application riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies Distribution models e.g. pipelined, laned vs tiered Query/execution models e.g. map-reduce (M/R), ASsociated SET (ASSET)
  • 45. Considerations Other:stuff() Interface layer riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers Securing application riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies Distribution models e.g. pipelined, laned vs tiered Query/execution models e.g. map-reduce (M/R), ASsociated SET (ASSET)
  • 46. Considerations Other:stuff() Interface layer riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers Securing application riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies Distribution models e.g. pipelined, laned vs tiered Query/execution models e.g. map-reduce (M/R), ASsociated SET (ASSET)
  • 47. Considerations Other:stuff() Interface layer riak_core apps need to implement their own interface layer, e.g. HTTP, XMPP, AMQP, MsgPack, ProtoBuffers Securing application riak_core gossip does not address identity/authZ/authN between nodes; relies on Erlang cookies Distribution models e.g. pipelined, laned vs tiered Query/execution models e.g. map-reduce (M/R), ASsociated SET (ASSET)
  • 48. Oh, the possibilities! What:next() Concurrency models e.g. actor, disruptor, evented/reactor, threading Consistency models e.g. vector-field, causal, FIFO Computation optimizations e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?) Other optimizations e.g. Client discovery, Remote Direct Memory Access (RDMA)
  • 49. Oh, the possibilities! What:next() Concurrency models e.g. actor, disruptor, evented/reactor, threading Consistency models e.g. vector-field, causal, FIFO Computation optimizations e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?) Other optimizations e.g. Client discovery, Remote Direct Memory Access (RDMA)
  • 50. Oh, the possibilities! What:next() Concurrency models e.g. actor, disruptor, evented/reactor, threading Consistency models e.g. vector-field, causal, FIFO Computation optimizations e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?) Other optimizations e.g. Client discovery, Remote Direct Memory Access (RDMA)
  • 51. Oh, the possibilities! What:next() Concurrency models e.g. actor, disruptor, evented/reactor, threading Consistency models e.g. vector-field, causal, FIFO Computation optimizations e.g. General Purpose GPU programming, native interfacing (NIFs, JInterface, Scalang?) Other optimizations e.g. Client discovery, Remote Direct Memory Access (RDMA)
  • 52. # finger $(whoami) Login: susan Name: Susan Potter Directory: /home/susan Shell: /bin/zsh On since Mon 29 Sep 1997 21:18 (GMT) on tty1 from :0 Too much unread mail on me@susanpotter.net Now working at Assistly! Looking for smart developers!;) Plan: github: mbbx6spp twitter: @SusanPotter
  • 53. Slides & Material https://github.com/mbbx6spp/riak_core-templates Figure: http://susanpotter.net/talks/strange-loop/2011/dynamo-not-just-for-datastores/
  • 54. Examples Rebar Templates https://github.com/mbbx6spp/riak_core-templates Riak Pipes https://github.com/basho/riak_pipe Riak Zab https://github.com/jtuple/riak_zab
  • 55. Questions? Figure: http://www.flickr.com/photos/42682395@N04/ @SusanPotter
  • 56. Questions? Figure: http://www.flickr.com/photos/42682395@N04/ @SusanPotter