SlideShare a Scribd company logo
PaxosStore: High-availability
Storage Made Practical in WeChat
• Powered by the CohAna Engine
WeChat
The new way to connect
Chat Moments Contacts Search Pay
800 Million
monthly active users
Applications
(frontend)
Services
(backend)
StoragePaxosStore
Evolution of Storage System in WeChat
1ST
GEN
2011–2015
Based on the quorum protocol (NWR)
2ND
GEN
2015–now
Based on the Paxos algorithm
PaxosStore
Paxos-based Storage Protocol
Key-Value Table Queue Set
Programming
Model
Storage
Layer
Consensus
Layer
Application Clients
... ...
Bitcask Main/Delta
Table
LSM-tree
Effective & Efficient
consensus guarantee
Elastic
for dynamic workload
Cross-datacenter
fault tolerance
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
PaxosStore implements the Paxos procedure
using semi-symmetry message passing (read our paper for details)
Prepare phase -- making a preliminary agreement
Accept phase -- reaching the eventual consensus
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
Entry EntryPaxosLog ⋯
Request
ID
Timestamp
(16 bits)
Request Seq.
(16 bits)
Client ID
(32 bits)
Promise
No.
Entry
Proposal
No.
Value
Proposer
ID
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
𝑖 + 1 𝒊 𝑖 − 1 𝑖 − 2 ⋯
𝒓
PaxosLog
Data Object
Pending
Chosen
Data Key
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
𝑟𝑖+1 𝒓𝒊
PaxosLog
Data Key
PaxosLog-as-Value
(for key-value storage)
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
For a data object 𝑟,
1) system reads its value from any
of the up-to-date 𝑟 replicas, and
2) these up-to-date replicas need to
dominate the total replicas of 𝑟
Consistent Read For read-frequent data, these criteria are likely to be satisfied
For data contention, use trial Paxos procedure to sync replicas
do not correspond to
any substantive write operation
Storage Protocol Stack
Consistent Read/Write
Data access based on PaxosLog
PaxosLog
Each log entry is determined by Paxos
Paxos
Determining value with consensus
Liveness
PaxosLog-entry batched applying
Consistent Write
Relying on the Paxos procedures
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
Failures in WeChat Production
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
mini-cluster
mini-clustermini-cluster
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
data hosted by 𝑁𝐴
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
queries
Deployment & Fault Tolerance
𝑵 𝑨 𝑵 𝑫
@Datacenter 1
𝑵 𝑩 𝑵 𝑬
@Datacenter 2
𝑵 𝑪 𝑵 𝑭
@Datacenter 3
Paxos
Paxos Paxos
Data Recovery
Recover through
PaxosLog
Recover through
delta updates of data image
Recover through
whole data image
Recovery
starts
Incremental
PaxosLog entries
exist?
No
Yes Data object is
append-only?
Yes
No
Recovery time decreases
Lazy Recovery
Obsolete data replicas are
not recovered immediately
upon node recovery, but
recovered when they are
subsequently accessed.
Failover reads
De-duplicated processing
Implementation
• Use coroutine to program asynchronous procedure in the
synchronous paradigm
Search Repository https://github.com/Tencent/libco
Much more efficient than Boost.Coroutine, while easy to use
Failure Recovery in WeChat Production
• Read/Write ratio is 15:1 on average
Failure happens at 14:20 Node resumes at 15:27
Restored to
95% normal throughput
within 3 minutes
Summary
• What covered in the paper
– The design of PaxosStore, with emphasis on the construction of the
consistent read/write protocol
– Fault-tolerant scheme and data recovery strategies
– Pragmatic optimizations come from our engineering practice
• Key lessons learned
– Apart from faults and failure, system overload is also a critical factor
that affects system availability
o Especially, the potential avalanche effect caused by overload must be paid
enough attention to when designing the system fault-tolerant scheme.
– Use coroutine and socket hook to program asynchronous procedures
in a pseudo-synchronous style
o This helps eliminate the error-prone function callbacks and simplify the
implementation of asynchronous logics.
Thank You ALL!
https://github.com/tencent/paxosstore

More Related Content

PDF
Drinking from the Firehose - Real-time Metrics
PDF
Message Queues a basic overview
PDF
Stream or segment : what is the best way to access your events in Pulsar_Neng
PPTX
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
PPTX
Flink Streaming Hadoop Summit San Jose
PDF
Big Data Streams Architectures. Why? What? How?
PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
Drinking from the Firehose - Real-time Metrics
Message Queues a basic overview
Stream or segment : what is the best way to access your events in Pulsar_Neng
Apache Flink: Streaming Done Right @ FOSDEM 2016
Flink Streaming Hadoop Summit San Jose
Big Data Streams Architectures. Why? What? How?
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...

Similar to PaxosStore: High-availability Storage Made Practical in WeChat (20)

PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
PDF
Panasas ® Los Alamos National Laboratory
PDF
nextcomputing-packet-continuum
PDF
Apache Stratos tutorial WSO2Con Europe-2014
PPTX
Bring N-Tier Apps to containers 2015 ContainerCon
PDF
ZCloud Consensus on Hardware for Distributed Systems
PPTX
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
PDF
lessons from managing a pulsar cluster
PPT
Climbing the beanstalk
PDF
Connect K of SMACK:pykafka, kafka-python or?
PPT
Spinnaker VLDB 2011
PDF
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
PDF
Distributed Consensus: Making the Impossible Possible
PDF
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
PPTX
Apache Stratos - Building a PaaS using OSGi and Equinox
PPTX
Apache Pulsar as a Dual Stream / Batch Processor
PPTX
Oracle Coherence
PPTX
Evolutionary Systems - Kafka Microservices
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
KA 5 - Lecture 1 - Parallel Processing.pdf
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Panasas ® Los Alamos National Laboratory
nextcomputing-packet-continuum
Apache Stratos tutorial WSO2Con Europe-2014
Bring N-Tier Apps to containers 2015 ContainerCon
ZCloud Consensus on Hardware for Distributed Systems
Luxun a Persistent Messaging System Tailored for Big Data Collecting & Analytics
lessons from managing a pulsar cluster
Climbing the beanstalk
Connect K of SMACK:pykafka, kafka-python or?
Spinnaker VLDB 2011
Building Retry Architectures in Kafka with Compacted Topics | Matthew Zhou, V...
Distributed Consensus: Making the Impossible Possible
Making Distributed Data Persistent Services Elastic (Without Losing All Your ...
Apache Stratos - Building a PaaS using OSGi and Equinox
Apache Pulsar as a Dual Stream / Batch Processor
Oracle Coherence
Evolutionary Systems - Kafka Microservices
Flexible and Real-Time Stream Processing with Apache Flink
KA 5 - Lecture 1 - Parallel Processing.pdf
Ad

More from Qian Lin (13)

PDF
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
PPTX
Trinity: A Distributed Graph Engine on a Memory Cloud
PPTX
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
PPTX
Adaptive Execution Support for Malleable Computation
PPTX
C-Cube: Elastic Continuous Clustering in the Cloud
PPTX
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
PPTX
Optimizing Virtual Machines Using Hybrid Virtualization
PPT
Virtual Machine Performance
PPTX
Be an Explorer, Be a Coder, Be a Writer
PPTX
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
PPTX
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
PPTX
In-situ MapReduce for Log Processing
PPTX
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
Trinity: A Distributed Graph Engine on a Memory Cloud
Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices
Adaptive Execution Support for Malleable Computation
C-Cube: Elastic Continuous Clustering in the Cloud
Kineograph: Taking the Pulse of a Fast-Changing and Connected World
Optimizing Virtual Machines Using Hybrid Virtualization
Virtual Machine Performance
Be an Explorer, Be a Coder, Be a Writer
SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
In-situ MapReduce for Log Processing
C-MR: Continuously Executing MapReduce Workflows on Multi-Core Processors
Ad

Recently uploaded (20)

PPT
Quality review (1)_presentation of this 21
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Fluorescence-microscope_Botany_detailed content
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Knowledge Engineering Part 1
Quality review (1)_presentation of this 21
IB Computer Science - Internal Assessment.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Reliability_Chapter_ presentation 1221.5784
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Fluorescence-microscope_Botany_detailed content
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Supervised vs unsupervised machine learning algorithms
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Qualitative Qantitative and Mixed Methods.pptx
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Knowledge Engineering Part 1

PaxosStore: High-availability Storage Made Practical in WeChat

  • 1. PaxosStore: High-availability Storage Made Practical in WeChat • Powered by the CohAna Engine
  • 2. WeChat The new way to connect Chat Moments Contacts Search Pay 800 Million monthly active users
  • 4. Evolution of Storage System in WeChat 1ST GEN 2011–2015 Based on the quorum protocol (NWR) 2ND GEN 2015–now Based on the Paxos algorithm
  • 5. PaxosStore Paxos-based Storage Protocol Key-Value Table Queue Set Programming Model Storage Layer Consensus Layer Application Clients ... ... Bitcask Main/Delta Table LSM-tree Effective & Efficient consensus guarantee Elastic for dynamic workload Cross-datacenter fault tolerance
  • 6. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus PaxosStore implements the Paxos procedure using semi-symmetry message passing (read our paper for details) Prepare phase -- making a preliminary agreement Accept phase -- reaching the eventual consensus
  • 7. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus Entry EntryPaxosLog ⋯ Request ID Timestamp (16 bits) Request Seq. (16 bits) Client ID (32 bits) Promise No. Entry Proposal No. Value Proposer ID
  • 8. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus 𝑖 + 1 𝒊 𝑖 − 1 𝑖 − 2 ⋯ 𝒓 PaxosLog Data Object Pending Chosen Data Key
  • 9. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus 𝑟𝑖+1 𝒓𝒊 PaxosLog Data Key PaxosLog-as-Value (for key-value storage)
  • 10. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus For a data object 𝑟, 1) system reads its value from any of the up-to-date 𝑟 replicas, and 2) these up-to-date replicas need to dominate the total replicas of 𝑟 Consistent Read For read-frequent data, these criteria are likely to be satisfied For data contention, use trial Paxos procedure to sync replicas do not correspond to any substantive write operation
  • 11. Storage Protocol Stack Consistent Read/Write Data access based on PaxosLog PaxosLog Each log entry is determined by Paxos Paxos Determining value with consensus Liveness PaxosLog-entry batched applying Consistent Write Relying on the Paxos procedures
  • 12. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos Failures in WeChat Production
  • 13. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos mini-cluster mini-clustermini-cluster
  • 14. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos data hosted by 𝑁𝐴
  • 15. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos
  • 16. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos queries
  • 17. Deployment & Fault Tolerance 𝑵 𝑨 𝑵 𝑫 @Datacenter 1 𝑵 𝑩 𝑵 𝑬 @Datacenter 2 𝑵 𝑪 𝑵 𝑭 @Datacenter 3 Paxos Paxos Paxos
  • 18. Data Recovery Recover through PaxosLog Recover through delta updates of data image Recover through whole data image Recovery starts Incremental PaxosLog entries exist? No Yes Data object is append-only? Yes No Recovery time decreases Lazy Recovery Obsolete data replicas are not recovered immediately upon node recovery, but recovered when they are subsequently accessed. Failover reads De-duplicated processing
  • 19. Implementation • Use coroutine to program asynchronous procedure in the synchronous paradigm Search Repository https://github.com/Tencent/libco Much more efficient than Boost.Coroutine, while easy to use
  • 20. Failure Recovery in WeChat Production • Read/Write ratio is 15:1 on average Failure happens at 14:20 Node resumes at 15:27 Restored to 95% normal throughput within 3 minutes
  • 21. Summary • What covered in the paper – The design of PaxosStore, with emphasis on the construction of the consistent read/write protocol – Fault-tolerant scheme and data recovery strategies – Pragmatic optimizations come from our engineering practice • Key lessons learned – Apart from faults and failure, system overload is also a critical factor that affects system availability o Especially, the potential avalanche effect caused by overload must be paid enough attention to when designing the system fault-tolerant scheme. – Use coroutine and socket hook to program asynchronous procedures in a pseudo-synchronous style o This helps eliminate the error-prone function callbacks and simplify the implementation of asynchronous logics.