SlideShare a Scribd company logo
APACHE PEGASUS(INCUBATING) - A
DISTRIBUTED KEY-VALUE STORAGE
SYSTEM
Yuchen He & Shuo Jia
Software Engineer from XiaoMi, Apache Pegasus PPMC
Incubator
Speakers
Yuchen He
• Graduate from Renmin University of China
• Software engineer from XiaoMi
• Pegasus project leader in XiaoMi
• Apache Pegasus PPMC
Shuo Jia
• Graduate from Beijing Jiaotong University of China
• Software engineer from XiaoMi
• Apache Pegasus PPMC
• Participated in the development of Pegasus for 2
years
Outline
• Basic Introduction
– Architecture, Data Model, Dual WAL, Performance
• New Features
– Duplication, Bulk load, Access control, Partition split
• Surrounding Ecosystems
– Pegasus-Spark, Meta proxy, Disk Migration tools
• Community
Basic Introduction
4
Introduction
• Redis or HBase
– Non-Volatile vs Consistent
– Remote Access
• Pegasus
– C++
– Local persistent storage
– Strongly consistent
– High performance
– Horizontally scalable
Architecture
Meta server
• Cluster controller
• Configuration manager
Replica server
• Data node
• Hash partitioning
• PacificA (strongly consistent)
• RocksDB instance for each replica
Zookeeper
• Meta server election
• Metadata storage
ClientLib
• Cache data routing table
• Straightly access to replica server
Data Model
Dual WAL
Disk
Data
Log
Replica1
Data
Log
Replica2
Data
Log
Replica3
client
Traditional solution
• Data background compaction may strongly affect WAL sync performance
Dual WAL
Data Disk
Data
Private Log
Replica1
Data
Private Log
Replica2
Data
Private Log
Replica3
client
Shared Log
Log Disk
• Separate WAL and data, sync-write shared log, async-write private log
Performance
Read:Write Client*Thread --- QPS AvgLatency P99Latency(us)
0:1 3*15
read --- --- ---
write 46128 972 5591
1:0 3*50
read 282648 542 1674
write --- --- ---
1:1 3*30
read 36014 1068 15345
write 36016 1421 8197
1:3 3*15
read 11622 779 10417
write 34989 1021 5467
2.2.0 (Newest release) benchmark
New Features
11
Duplication
Region2
Table
Region1
Table
async-duplication
Basic introduction
• Design for cross-region online backup
• Transfer log, write asynchronously
• Supporting single-master and multi-master
Duplication
Case1: Online Migration
Target Cluster
Table
Source Cluster
Table
client
1. Reserve logs
Remote storage
2. cold backup
3. restore
4. duplication
5. switch
Duplication
Case2: Master-Slave cluster
client client
Slave region
Table
Master region
Table
duplication
Eventually-consistent
read
client client
Table
Region1 Region2
Duplication
Enhancement in future
• Master-master in practice
• More than two region duplication in practice
• Facility for supporting remote disaster-tolerant system
• auto-switch master slave
• better user experience
• Extension:
• supporting CDC on demand
• eg: ES, MQ…
Bulk Load
sst file
sst file
Table
Replica server
original data
File provider
sst file sst file
1. Generate Files
2. Download Files
3. Ingest Files
client
R/W Reject write(ingestion)
Fast import lots of data offline
Access Control
Authentication: Kerberos
Authorization: Whitelist based coarse-grained table-level access control
Cluster
KeytabA
X
TableA
KeytabB
TableB
KeytabA
client
Partition Split
• Replica divide into two replicas
• Replica[i] -> Replica[i], Replica[i+original_partition_count]
Basic introduction
Replica group0
Replica0 Replica4
Replica0
Replica group1
Replica1 Replica5
Replica1
Replica group2
Replica2 Replica6
Replica2
Replica group3
Replica3 Replica7
Replica3
Partition Split
Stage1: async-learn
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
copy data
copy data copy data
• parent(old replica), child(new replica)
• child replica copy data
• client only know parent replica
Partition Split
Stage2: register
client
Replica server
child
secondary
Replica server
child
primary
Replica server
child
secondary
meta server
register child X
• when child copy all parent data
• Reject R/W while registering
Partition Split
Partition split succeed
Replica server
secondary
secondary
Replica server
primary
primary
Replica server
secondary
secondary
client
• Merged in master, will be released in 2.3.0
• GC dup-data by compaction
Surrounding Ecosystem
22
Pegasus-Spark
Best practices
• Large offline data analysis (SQL)
• Large offline data load (BulkLoad)
Pegasus-Spark
Offline Analysis
• Convert into Hive(parquet)
• Use SparkSQL to analysis
HDFS
Replica server Replica server
Hive
Schema RDD
Pegasus-Spark
Convert to SST file for Bulk load
node
node
node
node
node
node
Transform(Pegasus-Spark)
HDFS
(sst file)
Distinct
Repartition
Sort
original
data
original
data
Meta Proxy
Basic introduction
• access unification
• primary and standby cluster manager
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
client client client
Cluster A
meta meta
Cluster B
meta meta
Cluster C
meta meta
MetaProxy
Meta Proxy
Switch primary and standby cluster
client client client
Cluster primary
meta meta
Cluster secondary
meta meta
MetaProxy
duplication
client client client
Cluster secondary
meta meta
Cluster primary
meta meta
MetaProxy
duplication
switch
Disk migration tool
balance disk usage on replica server
Disk4
40%
Disk2
75%
Disk1
70%
Disk3
85%
Disk
migrator
Select Disk
Select
Replica
Migrate
Replica
balanced
Disk4
65%
Disk2
65%
Disk1
70%
Disk3
70%
Replica server Replica server
Loop
until balance
Community
29
Process
2016
Release 1.0.0
Join Apache
Release 2.0.0
Meet UP
2015
Start
Open GitHub
2017.9
2020.6
2020.9
2021.8
Tools
Start contribution from API and tools
Pegasus
core
user-cli
client
HTTP API
RPC API
monitoring
admin-cli
deploy tools
other tools …
In future
Issues, Roadmap, RFC
• New Features
• Cluster load balance
• Table Migrator Tools
• Read throughput throttling
• Support K8S
...
• Feature enhancement
• Duplication
• Bulk load
• Hot partition detection
…
• Tests
• Documents
Activities
• August 21st Beijing
First offline meetup will be coming soon
THANK YOU
QUESTIONS?
https://github.com/apache/incubator-pegasus
https://pegasus.apache.org/
Apache Pegasus

More Related Content

PPTX
The Design, Implementation and Open Source Way of Apache Pegasus
PDF
Elephants in the Cloud
PPTX
How does Apache Pegasus (incubating) community develop at SensorsData
PDF
Hosted PostgreSQL
PDF
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
PDF
Data Science in the Cloud @StitchFix
PPT
AWS (Hadoop) Meetup 30.04.09
PPT
Leveraging Hadoop in your PostgreSQL Environment
The Design, Implementation and Open Source Way of Apache Pegasus
Elephants in the Cloud
How does Apache Pegasus (incubating) community develop at SensorsData
Hosted PostgreSQL
PostgreSQL Sharding and HA: Theory and Practice (PGConf.ASIA 2017)
Data Science in the Cloud @StitchFix
AWS (Hadoop) Meetup 30.04.09
Leveraging Hadoop in your PostgreSQL Environment

Similar to Apache Pegasus (incubating): A distributed key-value storage system (20)

PPTX
Polyglot metadata for Hadoop
PPTX
Megastore by Google
PDF
Outside The Box With Apache Cassnadra
PDF
Postgres Vienna DB Meetup 2014
PPT
SQL or NoSQL, that is the question!
PPTX
PostgreSQL as an Alternative to MSSQL
PDF
Дмитрий Попович "How to build a data warehouse?"
PDF
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
PDF
System design handwritten notes guidance
PDF
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
PDF
Open Source SQL Databases
PDF
System Design.pdf
PDF
What every developer should know about database scalability, PyCon 2010
KEY
Escalando Aplicaciones Web
ODP
Hadoop Ecosystem Overview
PDF
201810 td tech_talk
PDF
20170602_OSSummit_an_intelligent_storage
PDF
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
PPT
Bhupeshbansal bigdata
PPTX
YugaByte + PKS CloudFoundry Meetup 10/15/2018
Polyglot metadata for Hadoop
Megastore by Google
Outside The Box With Apache Cassnadra
Postgres Vienna DB Meetup 2014
SQL or NoSQL, that is the question!
PostgreSQL as an Alternative to MSSQL
Дмитрий Попович "How to build a data warehouse?"
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
System design handwritten notes guidance
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Open Source SQL Databases
System Design.pdf
What every developer should know about database scalability, PyCon 2010
Escalando Aplicaciones Web
Hadoop Ecosystem Overview
201810 td tech_talk
20170602_OSSummit_an_intelligent_storage
Presto at Facebook - Presto Meetup @ Boston (10/6/2015)
Bhupeshbansal bigdata
YugaByte + PKS CloudFoundry Meetup 10/15/2018
Ad

More from acelyc1112009 (10)

PPTX
How does Apache Pegasus used in SensorsData
PDF
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
PDF
How to continuously improve Apache Pegasus in complex toB scenarios
PPTX
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
PDF
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
PPTX
The Introduction of Apache Pegasus 2.4.0
PPTX
Apache Pegasus's Practice in Data Access Business of Xiaomi
PPTX
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
PPTX
How do we manage more than one thousand of Pegasus clusters - engine part
PDF
How do we manage more than one thousand of Pegasus clusters - backend part
How does Apache Pegasus used in SensorsData
How does the Apache Pegasus used in Advertising Data Stream in SensorsData
How to continuously improve Apache Pegasus in complex toB scenarios
The Construction and Practice of Apache Pegasus in Offline and Online Scenari...
How does Apache Pegasus used in Xiaomi's Universal Recommendation Algorithm ...
The Introduction of Apache Pegasus 2.4.0
Apache Pegasus's Practice in Data Access Business of Xiaomi
The Advertising Algorithm Architecture in Xiaomi and How does Pegasus Practic...
How do we manage more than one thousand of Pegasus clusters - engine part
How do we manage more than one thousand of Pegasus clusters - backend part
Ad

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Computer network topology notes for revision
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Introduction to the R Programming Language
PPTX
IB Computer Science - Internal Assessment.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
annual-report-2024-2025 original latest.
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Database Infoormation System (DBIS).pptx
PDF
Mega Projects Data Mega Projects Data
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Computer network topology notes for revision
Introduction to Knowledge Engineering Part 1
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Introduction to the R Programming Language
IB Computer Science - Internal Assessment.pptx
Quality review (1)_presentation of this 21
Qualitative Qantitative and Mixed Methods.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
[EN] Industrial Machine Downtime Prediction
oil_refinery_comprehensive_20250804084928 (1).pptx
Fluorescence-microscope_Botany_detailed content
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
annual-report-2024-2025 original latest.
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Database Infoormation System (DBIS).pptx
Mega Projects Data Mega Projects Data

Apache Pegasus (incubating): A distributed key-value storage system