SlideShare a Scribd company logo
Bratislava, Slovakia. June 3-5, 2024
Apache Ratis
A High Performance Raft Library
Tsz-Wo Nicholas Sze
2024.6.3
About Me
Tsz-Wo Nicholas Sze, Ph.D.
● Software Engineer / Amateur Mathematician
● PMC member of Apache Hadoop, Ozone & Ratis
● Fun facts
○ Used Hadoop to create a π computation record (2010)
■ https://www.bbc.com/news/technology-11313194
○ Discovered a deterministic primality proving algorithm for Proth numbers (2008)
■ https://en.wikipedia.org/wiki/Proth_prime#Primality_testing
Agenda
● A Brief Introduction of Raft
● Apache Ratis Community
● Apache Ratis Features
Agenda
● A Brief Introduction of Raft
● Apache Ratis Community
● Apache Ratis Features
5
© 2024 Cloudera, Inc. All rights reserved.
Consensus
● What is consensus?
○ Multiple servers to agree a value
● Typical use cases
○ Log replication
○ Replicated state machines (high availability)
6
© 2024 Cloudera, Inc. All rights reserved.
Consensus Algorithms
● Paxos (1990)
○ Works but hard to understand
○ Hard to implement (correctly)
● Raft (2014) – “In Search of an Understandable Consensus Algorithm”
○ Easy to understand
○ Easy to prove
○ Easy to implement
7
© 2024 Cloudera, Inc. All rights reserved.
Raft Basic – Leader Election
● Servers are started as Followers
● Randomly timeout to become a Candidate and start a leader election
○ Candidate sends requestVote to other servers
○ A server votes for a Candidate only if the candidate has a up-to-date state.
● The Candidate becomes the Leader once it gets a majority of the votes
○ Impossible to have two Leaders at the same time.
RPC: requestVote
8
© 2024 Cloudera, Inc. All rights reserved.
Raft Basic – Log Replication
● Clients send requests to the Leader
○ A non-leader will reject client requests.
● Leader forwards requests as log entries to the Followers via appendEntries.
○ Once the Leader has obtained a majority, a request is committed
○ Committed requests are safe regardless of failures or restarts.
● Heartbeat to maintain leadership
○ The Leader sends empty appendEntries to Followers to when idle.
RPC: appendEntries
9
© 2024 Cloudera, Inc. All rights reserved.
Raft Library
● Our Motations (2016)
○ Use Raft in Hadoop Ozone – HA and Write Pipeline
● “In Search of a Usable Raft Library”
○ A long list of Raft implementations is available
○ None of them a general library ready to be consumed by other projects.
○ Most of them are tied to another project or a part of another project.
● We need a Raft library!
Agenda
● A Brief Introduction of Raft
● Apache Ratis Community
● Apache Ratis Features
11
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — A Brief History
● 2016-03: Started at Hortonworks.
● 2017-01: Entered Apache incubation.
● 2017-05: Released first version 0.1.0-alpha.
● 2017-04: Hadoop Ozone branch started using Ratis!
● 2020-07: Released first GA version 1.0.0.
● 2021-02: Became a top level Apache project.
● 2021-03: Released the version 2.0.0.
● 2023-11: Released the version 3.0.0.
● 2024-01: Releases the latest version 3.0.1.
12
© 2024 Cloudera, Inc. All rights reserved.
Who uses Apache Ratis?
● Apache Ozone (Object Store)
○ SCM/OM high availability
○ Write Pipeline
● Alluxio (Data Orchestration for the Cloud)
○ Originally called Tachyon
○ Managing all of Alluxio's journaled state
● Apache IoTDB (Database for Internet of Things)
○ Replicating application state (HA)
● Apache Celeborn (Intermediate Data Service)
○ Replication for high availability
13
© 2024 Cloudera, Inc. All rights reserved.
General Ratis Use Case
● You already have a service running on a single server
● You want to
○ (1) replicate the server data to multiple machines
■ The replication number/cluster membership can be changed in runtime
■ It can tolerate server failures.
or
○ (2) have a HA (highly available) service
■ When a server fails, another server will automatically take over.
■ Clients automatically failover to the new server.
Apache Ratis is for you!
14
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — A Raft Library
● Open source, open development, community driven
● Apache License 2.0
● Written in Java 8
Contributions are welcome!
Agenda
● A Brief Introduction of Raft
● Apache Ratis Community
● Apache Ratis Features
16
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Standard Raft Features
● Leader Election + Log Replication
○ Automatically elect a leader among the servers in a Raft group
○ Randomized timeout for avoiding split votes
○ Log is replicated in the Raft group
● Membership Changes
○ Members in a Raft group can be re-configured in runtime.
○ Replication factor can be changed in runtime.
● Log Compaction
○ Snapshots are taken periodically
○ Send snapshot instead of a long log history.
17
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Pluggability
● Pluggable state machine – the application logic
○ An application must define its state machine
● Pluggable metrics
○ Default is implemented using Dropwizard metrics v4
○ Provided an alternative using Dropwizard metrics v3
○ Applications may provide a custom metrics implementation
18
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Pluggability
● Pluggable RPC
○ Applications may provide their own RPC implementation
○ Provided implementations: gRPC, Netty, Hadoop RPC
● Pluggable Raft log
○ Users may provide their own log implementation
○ The default implementation stores log in local files
19
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — High Performance
● Asynchronous Event Driven Architecture
○ Implemented by
■ gRPC bi-directional stream API
■ Netty asynchronous event-driven network
application framework
■ Java CompletableFuture & Executor APIs
○ Server-to-server: asynchronous appendEntries
○ Client-to-server: asynchronous client requests
■ Also support blocking APIs in client
20
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Client APIs
● Async API
CompletableFuture<RaftClientReply> send(Message message);
CompletableFuture<RaftClientReply> sendReadOnly(Message message);
● Blocking API (simply call CompletableFuture.get())
RaftClientReply send(Message message) throws IOException;
RaftClientReply sendReadOnly(Message message) throws IOException;
21
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — High Performance
● Data Intensive Applications
○ In Raft,
■ All transactions and the data are written in the log
● Log entry: data + metadata
● State machine: data
■ Write amplification
● A single write may become two or more internal writes
○ In Ratis
■ Application could choose to not written all the data to log
● Log entry: only metadata
● State machine: data
22
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Streaming API
● Allow a client writing to the closest peer, instead of the Leader
● The first peer forwards the data to other peers
● The client may provide a routing table to tell how to forward the data.
● Netty zero buffer copy – never create/copy buffers in the code
○ Not necessarily the OS level zero buffer copy
23
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Streaming Performance
● The performance can be 3x compared to writing to the Leader
○ Streaming can use the full power of all three machines.
24
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Read from Followers
● Linearizable Read
○ Clients can read updated data from Followers
○ Read-index algorithm & Leader lease
● Stale Read
RaftClientReply sendStaleRead(Message message, long minIndex, RaftPeerId server) throws IOException;
CompletableFuture<RaftClientReply> sendStaleRead(Message message, long minIndex, RaftPeerId server);
○ minIndex is the minimum log index already committed in the given server
○ The returned data can be outdated (stale)
25
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Read After Write Consistency
● Suppose x = 1, a client sends the following operations asynchronously:
○ Write x = 2
○ Write x = 3
○ Read x
● What will it get for “Read x”?
○ Intuitively, it should get x = 3.
○ However, read and write are both asynchronous and executed in different paths.
■ Although the same client sends all ops, it is like
that read and write are sent by two clients.
○ “Read x” may return 1, 2 or 3.
● In Ratis, it support read-after-write consistency (configurable)
○ “Read x” must return 3.
26
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Watch Requests
● Watch an index to satisfy the given replication level.
RaftClientReply watch(long index, ReplicationLevel replication) throws IOException;
CompletableFuture<RaftClientReply> watch(long index, ReplicationLevel replication);
■MAJORITY (default) – committed at the leader and replicated to a majority of peers (Raft)
■ALL – committed at the leader and replicated to the all peers.
■MAJORITY_COMMITTED – committed at a majority of peers.
■ALL_COMMITTED – committed at all peers.
Client API
27
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Multi-Raft
● A server can join multiple Raft groups
28
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Security
● Support TLS connections in gRPC and Ratis Streaming
○ Applications may pass a Ratis TlsConf to build a Ratis client/server
○ The client/server uses the TlsConf to build a Netty SslContext
○ The SslContext is used to establish a secure connection to provide
■Authentication
■Encryption
29
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Leadership
● Transfer Leadership
○ Change the leader from a server to another server
● Leader Election Management
○ Pause and resume leader elections
30
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Snapshot Management
● Standard Raft Snapshot
○ Auto-triggering snapshot
○ Configurable snapshot retention policy
● Install Snapshot Notification
○ Notify a follower to install a snapshot from an external source
● Manual Snapshot Creation
○ API to trigger creating a snapshot
31
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — Server Priority
● Priority
○ Servers can be assigned to a priority
○ A higher priority server won’t vote for a lower priority server
■ unless the higher priority server is outdated
○ A lower priority server won’t become the Leader
■ unless all higher priority servers voted for it
32
© 2024 Cloudera, Inc. All rights reserved.
● Listeners – for listening log entries
○ They receive appendEntires.
○ They do not vote.
○ They are not counted for majority.
○ They can be served as hot standbys.
Apache Ratis — Non-voting members
33
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis — CLI
● Group: info, list
● Peer: add, remove, setPriority
● Snapshot: create
● Election: transfer, stepDown, pause, resume
● Local: raftMetaConf
Thank you
Contributions are welcome!
35
© 2024 Cloudera, Inc. All rights reserved.
Apache Ratis - Driven by Community
● 93 contributors (according to github)
● 30 committers
● 19 PMC members
36
© 2024 Cloudera, Inc. All rights reserved.
● We are currently voting for the 3.1.0 release
○ Thanks William Song from Apache IoTDB
for being the Release Manager
● Ratis Thirdparty Release 1.0.6
○ Vote passed on 2024.5.14
○ Thanks also Attila Doroszlai from Apache Ozone
for being the Release Manager
● Thanks everyone for voting, verifying and
testing the releases!
Apache Ratis - New Releases
Agenda
● A Brief Introduction of Raft
● Apache Ratis Community
● Apache Ratis Features (selected)
● A Short Example
38
© 2024 Cloudera, Inc. All rights reserved.
Example: FileStore
● Maintain a file map (key -> file)
○ Supported operations: Read, Write, Delete
○ Unsupported operations: List, Rename, etc.
● Asynchronous & In-order
○ Client may submit multiple write requests to
■ Write to multiple files at the same time
■ Each file may have multiple write requests
● File data is managed by the state machine
○ but not store in the raft log
● Performance (depending on hardware)
○ Write throughput: 1.8 GB/s (large files, old results)
○ IOPS: 16,000 txns/s (small files, recent results by Siyao Meng)

More Related Content

PDF
Cloud Native Networking & Security with Cilium & eBPF
PDF
Apache Ratis - In Search of a Usable Raft Library
PPTX
Best Practices for Forwarder Hierarchies
PPTX
Worst Splunk practices...and how to fix them
PDF
Using eBPF for High-Performance Networking in Cilium
PPTX
Log analysis using elk
PPTX
Best practices and lessons learnt from Running Apache NiFi at Renault
PDF
Service Function Chaining with SRv6
Cloud Native Networking & Security with Cilium & eBPF
Apache Ratis - In Search of a Usable Raft Library
Best Practices for Forwarder Hierarchies
Worst Splunk practices...and how to fix them
Using eBPF for High-Performance Networking in Cilium
Log analysis using elk
Best practices and lessons learnt from Running Apache NiFi at Renault
Service Function Chaining with SRv6

What's hot (20)

PDF
今秋リリース予定のPostgreSQL11を徹底解説
PDF
Life as a SRE at Instana
PDF
pfSense firewall workshop guide
PDF
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
PDF
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
PDF
DevConf 2014 Kernel Networking Walkthrough
PDF
Scapyで作る・解析するパケット
PPTX
ELK Stack
PPTX
Worst Splunk practices...and how to fix them
PDF
Scouter와 influx db – grafana 연동 가이드
PPTX
Terraform
PDF
Introduction to Spark
PPTX
Analyzing 1.2 Million Network Packets per Second in Real-time
PPTX
TRex Realistic Traffic Generator - Stateless support
PPTX
An Introduction to Prometheus (GrafanaCon 2016)
PPTX
Securing Hadoop with Apache Ranger
PPTX
Kafka: Internals
PPTX
Best Practices for Splunk Deployments
PPTX
SplunkLive 2011 Beginners Session
PPTX
Troubleshooting common oslo.messaging and RabbitMQ issues
今秋リリース予定のPostgreSQL11を徹底解説
Life as a SRE at Instana
pfSense firewall workshop guide
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
DevConf 2014 Kernel Networking Walkthrough
Scapyで作る・解析するパケット
ELK Stack
Worst Splunk practices...and how to fix them
Scouter와 influx db – grafana 연동 가이드
Terraform
Introduction to Spark
Analyzing 1.2 Million Network Packets per Second in Real-time
TRex Realistic Traffic Generator - Stateless support
An Introduction to Prometheus (GrafanaCon 2016)
Securing Hadoop with Apache Ranger
Kafka: Internals
Best Practices for Splunk Deployments
SplunkLive 2011 Beginners Session
Troubleshooting common oslo.messaging and RabbitMQ issues
Ad

Similar to Apache Ratis - A High Performance Raft Library (20)

PDF
MySQL User Camp : MySQL-Router
PDF
My sql router
ODP
ChinaNetCloud Training - HAProxy Intro
PDF
My sql fabric webinar v1.1
PDF
OpenTSDB for monitoring @ Criteo
PDF
Leveraging open source for large scale analytics
PPTX
Performance is not an Option - gRPC and Cassandra
PPTX
Aerospike Architecture
PDF
replic8 - Replication in MySQL 8
PPTX
Hadoop 3 (2017 hadoop taiwan workshop)
PDF
haproxy-150423120602-conversion-gate01.pdf
PPTX
HAProxy
PDF
SD Times - Docker v2
PDF
Using Databases and Containers From Development to Deployment
PDF
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
PPTX
Apache web service
PPTX
Real Time Data Processing Using Spark Streaming
PDF
E2E Data Pipeline - Apache Spark/Airflow/Livy
PDF
What's new in open stack juno (pnw os meetup)
PDF
H2O - the optimized HTTP server
MySQL User Camp : MySQL-Router
My sql router
ChinaNetCloud Training - HAProxy Intro
My sql fabric webinar v1.1
OpenTSDB for monitoring @ Criteo
Leveraging open source for large scale analytics
Performance is not an Option - gRPC and Cassandra
Aerospike Architecture
replic8 - Replication in MySQL 8
Hadoop 3 (2017 hadoop taiwan workshop)
haproxy-150423120602-conversion-gate01.pdf
HAProxy
SD Times - Docker v2
Using Databases and Containers From Development to Deployment
Scala for Everything: From Frontend to Backend Applications - Scala Matsuri 2020
Apache web service
Real Time Data Processing Using Spark Streaming
E2E Data Pipeline - Apache Spark/Airflow/Livy
What's new in open stack juno (pnw os meetup)
H2O - the optimized HTTP server
Ad

Recently uploaded (20)

PPTX
assetexplorer- product-overview - presentation
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Digital Systems & Binary Numbers (comprehensive )
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
System and Network Administration Chapter 2
PDF
top salesforce developer skills in 2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Computer Software and OS of computer science of grade 11.pptx
PDF
medical staffing services at VALiNTRY
PPTX
history of c programming in notes for students .pptx
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Reimagine Home Health with the Power of Agentic AI​
assetexplorer- product-overview - presentation
CHAPTER 2 - PM Management and IT Context
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Digital Systems & Binary Numbers (comprehensive )
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Designing Intelligence for the Shop Floor.pdf
System and Network Administration Chapter 2
top salesforce developer skills in 2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Design an Analysis of Algorithms II-SECS-1021-03
Odoo Companies in India – Driving Business Transformation.pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Computer Software and OS of computer science of grade 11.pptx
medical staffing services at VALiNTRY
history of c programming in notes for students .pptx
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Reimagine Home Health with the Power of Agentic AI​

Apache Ratis - A High Performance Raft Library

  • 1. Bratislava, Slovakia. June 3-5, 2024 Apache Ratis A High Performance Raft Library Tsz-Wo Nicholas Sze 2024.6.3
  • 2. About Me Tsz-Wo Nicholas Sze, Ph.D. ● Software Engineer / Amateur Mathematician ● PMC member of Apache Hadoop, Ozone & Ratis ● Fun facts ○ Used Hadoop to create a π computation record (2010) ■ https://www.bbc.com/news/technology-11313194 ○ Discovered a deterministic primality proving algorithm for Proth numbers (2008) ■ https://en.wikipedia.org/wiki/Proth_prime#Primality_testing
  • 3. Agenda ● A Brief Introduction of Raft ● Apache Ratis Community ● Apache Ratis Features
  • 4. Agenda ● A Brief Introduction of Raft ● Apache Ratis Community ● Apache Ratis Features
  • 5. 5 © 2024 Cloudera, Inc. All rights reserved. Consensus ● What is consensus? ○ Multiple servers to agree a value ● Typical use cases ○ Log replication ○ Replicated state machines (high availability)
  • 6. 6 © 2024 Cloudera, Inc. All rights reserved. Consensus Algorithms ● Paxos (1990) ○ Works but hard to understand ○ Hard to implement (correctly) ● Raft (2014) – “In Search of an Understandable Consensus Algorithm” ○ Easy to understand ○ Easy to prove ○ Easy to implement
  • 7. 7 © 2024 Cloudera, Inc. All rights reserved. Raft Basic – Leader Election ● Servers are started as Followers ● Randomly timeout to become a Candidate and start a leader election ○ Candidate sends requestVote to other servers ○ A server votes for a Candidate only if the candidate has a up-to-date state. ● The Candidate becomes the Leader once it gets a majority of the votes ○ Impossible to have two Leaders at the same time. RPC: requestVote
  • 8. 8 © 2024 Cloudera, Inc. All rights reserved. Raft Basic – Log Replication ● Clients send requests to the Leader ○ A non-leader will reject client requests. ● Leader forwards requests as log entries to the Followers via appendEntries. ○ Once the Leader has obtained a majority, a request is committed ○ Committed requests are safe regardless of failures or restarts. ● Heartbeat to maintain leadership ○ The Leader sends empty appendEntries to Followers to when idle. RPC: appendEntries
  • 9. 9 © 2024 Cloudera, Inc. All rights reserved. Raft Library ● Our Motations (2016) ○ Use Raft in Hadoop Ozone – HA and Write Pipeline ● “In Search of a Usable Raft Library” ○ A long list of Raft implementations is available ○ None of them a general library ready to be consumed by other projects. ○ Most of them are tied to another project or a part of another project. ● We need a Raft library!
  • 10. Agenda ● A Brief Introduction of Raft ● Apache Ratis Community ● Apache Ratis Features
  • 11. 11 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — A Brief History ● 2016-03: Started at Hortonworks. ● 2017-01: Entered Apache incubation. ● 2017-05: Released first version 0.1.0-alpha. ● 2017-04: Hadoop Ozone branch started using Ratis! ● 2020-07: Released first GA version 1.0.0. ● 2021-02: Became a top level Apache project. ● 2021-03: Released the version 2.0.0. ● 2023-11: Released the version 3.0.0. ● 2024-01: Releases the latest version 3.0.1.
  • 12. 12 © 2024 Cloudera, Inc. All rights reserved. Who uses Apache Ratis? ● Apache Ozone (Object Store) ○ SCM/OM high availability ○ Write Pipeline ● Alluxio (Data Orchestration for the Cloud) ○ Originally called Tachyon ○ Managing all of Alluxio's journaled state ● Apache IoTDB (Database for Internet of Things) ○ Replicating application state (HA) ● Apache Celeborn (Intermediate Data Service) ○ Replication for high availability
  • 13. 13 © 2024 Cloudera, Inc. All rights reserved. General Ratis Use Case ● You already have a service running on a single server ● You want to ○ (1) replicate the server data to multiple machines ■ The replication number/cluster membership can be changed in runtime ■ It can tolerate server failures. or ○ (2) have a HA (highly available) service ■ When a server fails, another server will automatically take over. ■ Clients automatically failover to the new server. Apache Ratis is for you!
  • 14. 14 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — A Raft Library ● Open source, open development, community driven ● Apache License 2.0 ● Written in Java 8 Contributions are welcome!
  • 15. Agenda ● A Brief Introduction of Raft ● Apache Ratis Community ● Apache Ratis Features
  • 16. 16 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Standard Raft Features ● Leader Election + Log Replication ○ Automatically elect a leader among the servers in a Raft group ○ Randomized timeout for avoiding split votes ○ Log is replicated in the Raft group ● Membership Changes ○ Members in a Raft group can be re-configured in runtime. ○ Replication factor can be changed in runtime. ● Log Compaction ○ Snapshots are taken periodically ○ Send snapshot instead of a long log history.
  • 17. 17 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Pluggability ● Pluggable state machine – the application logic ○ An application must define its state machine ● Pluggable metrics ○ Default is implemented using Dropwizard metrics v4 ○ Provided an alternative using Dropwizard metrics v3 ○ Applications may provide a custom metrics implementation
  • 18. 18 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Pluggability ● Pluggable RPC ○ Applications may provide their own RPC implementation ○ Provided implementations: gRPC, Netty, Hadoop RPC ● Pluggable Raft log ○ Users may provide their own log implementation ○ The default implementation stores log in local files
  • 19. 19 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — High Performance ● Asynchronous Event Driven Architecture ○ Implemented by ■ gRPC bi-directional stream API ■ Netty asynchronous event-driven network application framework ■ Java CompletableFuture & Executor APIs ○ Server-to-server: asynchronous appendEntries ○ Client-to-server: asynchronous client requests ■ Also support blocking APIs in client
  • 20. 20 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Client APIs ● Async API CompletableFuture<RaftClientReply> send(Message message); CompletableFuture<RaftClientReply> sendReadOnly(Message message); ● Blocking API (simply call CompletableFuture.get()) RaftClientReply send(Message message) throws IOException; RaftClientReply sendReadOnly(Message message) throws IOException;
  • 21. 21 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — High Performance ● Data Intensive Applications ○ In Raft, ■ All transactions and the data are written in the log ● Log entry: data + metadata ● State machine: data ■ Write amplification ● A single write may become two or more internal writes ○ In Ratis ■ Application could choose to not written all the data to log ● Log entry: only metadata ● State machine: data
  • 22. 22 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Streaming API ● Allow a client writing to the closest peer, instead of the Leader ● The first peer forwards the data to other peers ● The client may provide a routing table to tell how to forward the data. ● Netty zero buffer copy – never create/copy buffers in the code ○ Not necessarily the OS level zero buffer copy
  • 23. 23 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Streaming Performance ● The performance can be 3x compared to writing to the Leader ○ Streaming can use the full power of all three machines.
  • 24. 24 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Read from Followers ● Linearizable Read ○ Clients can read updated data from Followers ○ Read-index algorithm & Leader lease ● Stale Read RaftClientReply sendStaleRead(Message message, long minIndex, RaftPeerId server) throws IOException; CompletableFuture<RaftClientReply> sendStaleRead(Message message, long minIndex, RaftPeerId server); ○ minIndex is the minimum log index already committed in the given server ○ The returned data can be outdated (stale)
  • 25. 25 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Read After Write Consistency ● Suppose x = 1, a client sends the following operations asynchronously: ○ Write x = 2 ○ Write x = 3 ○ Read x ● What will it get for “Read x”? ○ Intuitively, it should get x = 3. ○ However, read and write are both asynchronous and executed in different paths. ■ Although the same client sends all ops, it is like that read and write are sent by two clients. ○ “Read x” may return 1, 2 or 3. ● In Ratis, it support read-after-write consistency (configurable) ○ “Read x” must return 3.
  • 26. 26 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Watch Requests ● Watch an index to satisfy the given replication level. RaftClientReply watch(long index, ReplicationLevel replication) throws IOException; CompletableFuture<RaftClientReply> watch(long index, ReplicationLevel replication); ■MAJORITY (default) – committed at the leader and replicated to a majority of peers (Raft) ■ALL – committed at the leader and replicated to the all peers. ■MAJORITY_COMMITTED – committed at a majority of peers. ■ALL_COMMITTED – committed at all peers. Client API
  • 27. 27 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Multi-Raft ● A server can join multiple Raft groups
  • 28. 28 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Security ● Support TLS connections in gRPC and Ratis Streaming ○ Applications may pass a Ratis TlsConf to build a Ratis client/server ○ The client/server uses the TlsConf to build a Netty SslContext ○ The SslContext is used to establish a secure connection to provide ■Authentication ■Encryption
  • 29. 29 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Leadership ● Transfer Leadership ○ Change the leader from a server to another server ● Leader Election Management ○ Pause and resume leader elections
  • 30. 30 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Snapshot Management ● Standard Raft Snapshot ○ Auto-triggering snapshot ○ Configurable snapshot retention policy ● Install Snapshot Notification ○ Notify a follower to install a snapshot from an external source ● Manual Snapshot Creation ○ API to trigger creating a snapshot
  • 31. 31 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — Server Priority ● Priority ○ Servers can be assigned to a priority ○ A higher priority server won’t vote for a lower priority server ■ unless the higher priority server is outdated ○ A lower priority server won’t become the Leader ■ unless all higher priority servers voted for it
  • 32. 32 © 2024 Cloudera, Inc. All rights reserved. ● Listeners – for listening log entries ○ They receive appendEntires. ○ They do not vote. ○ They are not counted for majority. ○ They can be served as hot standbys. Apache Ratis — Non-voting members
  • 33. 33 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis — CLI ● Group: info, list ● Peer: add, remove, setPriority ● Snapshot: create ● Election: transfer, stepDown, pause, resume ● Local: raftMetaConf
  • 35. 35 © 2024 Cloudera, Inc. All rights reserved. Apache Ratis - Driven by Community ● 93 contributors (according to github) ● 30 committers ● 19 PMC members
  • 36. 36 © 2024 Cloudera, Inc. All rights reserved. ● We are currently voting for the 3.1.0 release ○ Thanks William Song from Apache IoTDB for being the Release Manager ● Ratis Thirdparty Release 1.0.6 ○ Vote passed on 2024.5.14 ○ Thanks also Attila Doroszlai from Apache Ozone for being the Release Manager ● Thanks everyone for voting, verifying and testing the releases! Apache Ratis - New Releases
  • 37. Agenda ● A Brief Introduction of Raft ● Apache Ratis Community ● Apache Ratis Features (selected) ● A Short Example
  • 38. 38 © 2024 Cloudera, Inc. All rights reserved. Example: FileStore ● Maintain a file map (key -> file) ○ Supported operations: Read, Write, Delete ○ Unsupported operations: List, Rename, etc. ● Asynchronous & In-order ○ Client may submit multiple write requests to ■ Write to multiple files at the same time ■ Each file may have multiple write requests ● File data is managed by the state machine ○ but not store in the raft log ● Performance (depending on hardware) ○ Write throughput: 1.8 GB/s (large files, old results) ○ IOPS: 16,000 txns/s (small files, recent results by Siyao Meng)