SlideShare a Scribd company logo
Dynamo: Amazon’s Highly
Available Key-value Store
Giuseppe DeCandia, Deniz Hastorun,
Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan
Sivasubramanian, Peter Vosshall
and Werner Vogels
Motivation
 Build a distributed storage system:
 Scale
 Simple: key-value
 Highly available
 Guarantee Service Level Agreements (SLA)
System Assumptions and Requirements
 Query Model: simple read and write operations to a data
item that is uniquely identified by a key.
 ACID Properties: Atomicity, Consistency, Isolation,
Durability.
 Efficiency: latency requirements which are in general
measured at the 99.9th percentile of the distribution.
 Other Assumptions: operation environment is assumed
to be non-hostile and there are no security related requirements
such as authentication and authorization.
Service Level Agreements (SLA)
 Application can deliver its
functionality in abounded
time: Every dependency in the
platform needs to deliver its
functionality with even tighter
bounds.
 Example: service guaranteeing
that it will provide a response within
300ms for 99.9% of its requests for a
peak client load of 500 requests per
second.
Service-oriented architecture of
Amazon’s platform
Design Consideration
 Sacrifice strong consistency for availability
 Conflict resolution is executed during read
instead of write, i.e. “always writeable”.
 Other principles:
 Incremental scalability.
 Symmetry.
 Decentralization.
 Heterogeneity.
Summary of techniques used in Dynamo
and their advantages
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for writes
Vector clocks with reconciliation
during reads
Version size is decoupled from
update rates.
Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and
durability guarantee when some of
the replicas are not available.
Recovering from permanent
failures
Anti-entropy using Merkle trees
Synchronizes divergent replicas in
the background.
Membership and failure detection
Gossip-based membership protocol
and failure detection.
Preserves symmetry and avoids
having a centralized registry for
storing membership and node
liveness information.
Partition Algorithm
 Consistent hashing: the output
range of a hash function is treated as a
fixed circular space or “ring”.
 ”Virtual Nodes”: Each node can
be responsible for more than one
virtual node.
Advantages of using virtual nodes
 If a node becomes unavailable the
load handled by this node is evenly
dispersed across the remaining
available nodes.
 When a node becomes available
again, the newly available node
accepts a roughly equivalent
amount of load from each of the
other available nodes.
 The number of virtual nodes that a
node is responsible can decided
based on its capacity, accounting
for heterogeneity in the physical
infrastructure.
Replication
 Each data item is
replicated at N hosts.
 “preference list”: The list of
nodes that is responsible
for storing a particular key.
Data Versioning
 A put() call may return to its caller before the
update has been applied at all the replicas
 A get() call may return many versions of the
same object.
 Challenge: an object having distinct version sub-histories,
which the system will need to reconcile in the future.
 Solution: uses vector clocks in order to capture causality
between different versions of the same object.
Vector Clock
 A vector clock is a list of (node, counter)
pairs.
 Every version of every object is associated
with one vector clock.
 If the counters on the first object’s clock are
less-than-or-equal to all of the nodes in the
second clock, then the first is an ancestor of
the second and can be forgotten.
Vector clock example
Execution of get () and put ()
operations
1. Route its request through a generic load
balancer that will select a node based on
load information.
2. Use a partition-aware client library that
routes requests directly to the appropriate
coordinator nodes.
Sloppy Quorum
 R/W is the minimum number of nodes that
must participate in a successful read/write
operation.
 Setting R + W > N yields a quorum-like
system.
 In this model, the latency of a get (or put)
operation is dictated by the slowest of the R
(or W) replicas. For this reason, R and W are
usually configured to be less than N, to
provide better latency.
Hinted handoff
 Assume N = 3. When A
is temporarily down or
unreachable during a
write, send replica to D.
 D is hinted that the
replica is belong to A and
it will deliver to A when A
is recovered.
 Again: “always writeable”
Other techniques
 Replica synchronization:
 Merkle hash tree.
 Membership and Failure Detection:
 Gossip
Implementation
 Java
 Local persistence component allows for
different storage engines to be plugged in:
 Berkeley Database (BDB) Transactional Data
Store: object of tens of kilobytes
 MySQL: object of > tens of kilobytes
 BDB Java Edition, etc.
Evaluation
Evaluation

More Related Content

PDF
Dynamo Amazon’s Highly Available Key-value Store
PDF
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
PDF
Replication in the wild ankara cloud meetup - feb 2017
PDF
Replication in the wild ankara cloud meetup - feb 2017
PPTX
Chapter Introductionn to distributed system .pptx
PPT
Handling Data in Mega Scale Web Systems
PDF
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
PDF
Everything you always wanted to know about highly available distributed datab...
Dynamo Amazon’s Highly Available Key-value Store
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Replication in the wild ankara cloud meetup - feb 2017
Replication in the wild ankara cloud meetup - feb 2017
Chapter Introductionn to distributed system .pptx
Handling Data in Mega Scale Web Systems
Basics of the Highly Available Distributed Databases - teowaki - javier ramir...
Everything you always wanted to know about highly available distributed datab...

Similar to Dynamo.ppt (20)

PPTX
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
PPTX
Amazon`s Dynamo
PPT
Big Data & NoSQL - EFS'11 (Pavlo Baron)
PPT
7. Key-Value Databases: In Depth
PDF
Highly available distributed databases, how they work, javier ramirez at teowaki
PDF
Design Patterns For Distributed NO-reational databases
ODP
Everything you always wanted to know about Distributed databases, at devoxx l...
PDF
Voldemort Nosql
PDF
Concurrency and Multithreading Demistified - Reversim Summit 2014
PPTX
Storing the real world data
PPTX
Basics of Distributed Systems - Distributed Storage
PDF
Consistency, Availability, Partition: Make Your Choice
PDF
Why Distributed Databases?
PDF
Designing large scale distributed systems
PDF
Dynamo: Not Just For Datastores
PDF
From Mainframe to Microservice: An Introduction to Distributed Systems
ODP
Distributed Systems
PDF
The inner workings of Dynamo DB
PPTX
NoSQL Introduction, Theory, Implementations
PPTX
Dynamo and BigTable in light of the CAP theorem
Summary of "Amazon's Dynamo" for the 2nd nosql summer reading in Tokyo
Amazon`s Dynamo
Big Data & NoSQL - EFS'11 (Pavlo Baron)
7. Key-Value Databases: In Depth
Highly available distributed databases, how they work, javier ramirez at teowaki
Design Patterns For Distributed NO-reational databases
Everything you always wanted to know about Distributed databases, at devoxx l...
Voldemort Nosql
Concurrency and Multithreading Demistified - Reversim Summit 2014
Storing the real world data
Basics of Distributed Systems - Distributed Storage
Consistency, Availability, Partition: Make Your Choice
Why Distributed Databases?
Designing large scale distributed systems
Dynamo: Not Just For Datastores
From Mainframe to Microservice: An Introduction to Distributed Systems
Distributed Systems
The inner workings of Dynamo DB
NoSQL Introduction, Theory, Implementations
Dynamo and BigTable in light of the CAP theorem
Ad

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
web development for engineering and engineering
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Construction Project Organization Group 2.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
additive manufacturing of ss316l using mig welding
PDF
Well-logging-methods_new................
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Lecture Notes Electrical Wiring System Components
CH1 Production IntroductoryConcepts.pptx
web development for engineering and engineering
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Construction Project Organization Group 2.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
additive manufacturing of ss316l using mig welding
Well-logging-methods_new................
CYBER-CRIMES AND SECURITY A guide to understanding
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Model Code of Practice - Construction Work - 21102022 .pdf
Internet of Things (IOT) - A guide to understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mechanical Engineering MATERIALS Selection
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Foundation to blockchain - A guide to Blockchain Tech
Embodied AI: Ushering in the Next Era of Intelligent Systems
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Lecture Notes Electrical Wiring System Components
Ad

Dynamo.ppt

  • 1. Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels
  • 2. Motivation  Build a distributed storage system:  Scale  Simple: key-value  Highly available  Guarantee Service Level Agreements (SLA)
  • 3. System Assumptions and Requirements  Query Model: simple read and write operations to a data item that is uniquely identified by a key.  ACID Properties: Atomicity, Consistency, Isolation, Durability.  Efficiency: latency requirements which are in general measured at the 99.9th percentile of the distribution.  Other Assumptions: operation environment is assumed to be non-hostile and there are no security related requirements such as authentication and authorization.
  • 4. Service Level Agreements (SLA)  Application can deliver its functionality in abounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds.  Example: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. Service-oriented architecture of Amazon’s platform
  • 5. Design Consideration  Sacrifice strong consistency for availability  Conflict resolution is executed during read instead of write, i.e. “always writeable”.  Other principles:  Incremental scalability.  Symmetry.  Decentralization.  Heterogeneity.
  • 6. Summary of techniques used in Dynamo and their advantages Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
  • 7. Partition Algorithm  Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring”.  ”Virtual Nodes”: Each node can be responsible for more than one virtual node.
  • 8. Advantages of using virtual nodes  If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes.  When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes.  The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
  • 9. Replication  Each data item is replicated at N hosts.  “preference list”: The list of nodes that is responsible for storing a particular key.
  • 10. Data Versioning  A put() call may return to its caller before the update has been applied at all the replicas  A get() call may return many versions of the same object.  Challenge: an object having distinct version sub-histories, which the system will need to reconcile in the future.  Solution: uses vector clocks in order to capture causality between different versions of the same object.
  • 11. Vector Clock  A vector clock is a list of (node, counter) pairs.  Every version of every object is associated with one vector clock.  If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
  • 13. Execution of get () and put () operations 1. Route its request through a generic load balancer that will select a node based on load information. 2. Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.
  • 14. Sloppy Quorum  R/W is the minimum number of nodes that must participate in a successful read/write operation.  Setting R + W > N yields a quorum-like system.  In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency.
  • 15. Hinted handoff  Assume N = 3. When A is temporarily down or unreachable during a write, send replica to D.  D is hinted that the replica is belong to A and it will deliver to A when A is recovered.  Again: “always writeable”
  • 16. Other techniques  Replica synchronization:  Merkle hash tree.  Membership and Failure Detection:  Gossip
  • 17. Implementation  Java  Local persistence component allows for different storage engines to be plugged in:  Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes  MySQL: object of > tens of kilobytes  BDB Java Edition, etc.