SlideShare a Scribd company logo
HOSTED BY
Adventures in Thread-per-core Async
with Redpanda & Seastar
Travis Downs
Software Engineer at Redpanda
Travis Downs (He/Him)
Software Engineer at Redpanda
■ I love going deep on performance – all the way to
assembly, if necessary
■ I’ve held principal staff positions at Salesforce &
architect roles at SAP and Business Objects
■ I had hobbies like writing a software performance
blog, but now I’m a parent, so…
3
Redpanda in 60 seconds
Redpanda is a streaming storage engine
Clients speak Apache Kafka API to Redpanda nodes to produce
and consume from topic partitions.
Partitions are logs (~10,000s per cluster)
Each partition is a Raft group (~3 members)
Scale up and scale out should be ~equivalent
Thread-per-core
What is thread-per-core?
One thread per core and pinned: make scheduling decisions in userspace.
This thread must not block.
Question: how do we replace blocking calls?
Answer: …
5
Seastar was created by the ScyllaDB project.
Redpanda is built on Seastar. We 😍it.
Shared nothing architecture made up of “shards”:
■ A CPU core
■ A pool of memory NUMA-local to that core
■ All-to-all mesh of SPSC message queues
■ Cooperative multitasking
Seastar
6
Async C++ with
coroutines
Continuation style
ss::future<> consensus::stop() {
return _event_manager.stop()
.then([this] { return _append_requests_buffer.stop(); })
.then([this] { return _batcher.stop(); })
.then([this] { return _bg.close(); })
.then([this] {
if (likely(!_snapshot_writer)) {
return ss::now();
}
return _snapshot_writer->close().then(
[this] { _snapshot_writer.reset(); });
});
}
C++ coroutines
seastar::future<std::string> my_coroutine() {
co_await seastar::sleep(100ms); // returns future<>
co_return "hello world";
}
New in C++ 20: three new keywords
co_await
co_yield
co_return
Language provides a future concept but not implementation: Seastar still defines the
future/promise type.
When compiler sees a co_* keyword, the function is rewritten to stash stack variables on the
heap as needed to support suspension/resumption of execution.
C++20 coroutines: after
ss::future<> consensus::stop() {
…
co_await _event_manager.stop();
co_await _append_requests_buffer.stop();
co_await _batcher.stop();
_op_lock.broken();
co_await _bg.close();
if (unlikely(_snapshot_writer)) {
co_await _snapshot_writer->close();
_snapshot_writer.reset();
}
}
New vs old
ss::future<> consensus::stop() {
…
co_await _event_manager.stop();
co_await _append_requests_buffer.stop();
co_await _batcher.stop();
_op_lock.broken();
co_await _bg.close();
if (unlikely(_snapshot_writer)) {
co_await _snapshot_writer->close();
_snapshot_writer.reset();
}
}
ss::future<> consensus::stop() {
…
return _event_manager.stop()
.then([this] { return
_append_requests_buffer.stop(); })
.then([this] { return _batcher.stop(); })
.then([this] { return _bg.close(); })
.then([this] {
if (likely(!_snapshot_writer)) {
return ss::now();
}
return _snapshot_writer->close().then(
[this] { _snapshot_writer.reset(); });
});
Coroutine Performance
Coroutine performance depends on both on the framework implementing the
promise type and the compiler
Here we talk about seastar’s implementation and clang++
Preview: coroutines are not transparent when it comes to performance
Frame allocations
Observation: almost every coroutine allocates
Exception: if the compiler can statically prove the coro never suspends
- No suspension points (co_await or co_yield) in the function
- Suspension points is never reachable
- Suspension point is reachable but never suspends
Frame allocations 2
This coroutine:
- Never suspends
- Never even executes co_await
- ~200 instructions and ~80 cycles
- Always allocates
seastar::future<> empty_coro() {
if (always_false) {
co_await make_ready_future<>();
}
}
Case study: varint decode
Let’s look at a case study drawn from Redpanda code
Decode an unsigned 32-bit varint
1-5 bytes and MSB of 0 indicates final byte
Widely used in Kafka protocol (and other places)
Case study: coroutine decoder
read1() is async
Almost the same as the synchronous
version
Allocates once per decode
auto coro_decode(input_stream& s) {
detail::var_decoder decoder;
while (true) {
char c = co_await s.read1();
if (decoder.accept(c)) {
break;
}
}
co_return decoder.result();
}
~680 instructions
~220 cyles
176 bytes allocated
Case study: continuation decoder
Much harder to read (and
write)
Does not allocate
Recursion is bounded by
decoder
auto cont_recurse(iobuf_reader& s, var_decoder decoder) {
return s.read1().then([&s, decoder](char c) mutable {
if (decoder.accept(c)) {
return ss::make_ready_future<result_type>
(decoder.result());
}
return cont_recurse(s, decoder);
});
}
So is it faster?
Case study: runtime comparison
Case study: mystery method 1
Optimistic approach
Avoid any async machinery if
possible
Doubles the amount of code
auto cont_tricky(iobuf_reader& s, var_decoder decoder) {
auto f = s.read1();
while (f.available()) {
if (decoder.accept(f.get())) {
return decoder.result_as_future();
}
f = s.read1();
}
return std::move(f).then([&s, decoder](char c) mutable {
if (decoder.accept(c)) {
return decoder.result_as_future();
}
return cont_tricky(s, decoder);
});
}
Case study: mystery method 2
Synchronous version
Almost identical to coro version
Speedup varies from 4x to 9x
auto sync_decode(input_stream& s) {
detail::var_decoder decoder;
while (true) {
char c = s.read1_sync();
if (decoder.accept(c)) {
break;
}
}
return decoder.result();
}
Sync with async fallback
So how should we really do this?
Use sync with async fallback.
Peek at 5 bytes, fallback if not available.
Fallback must in own method!
auto decode_fallback(iobuf_reader& s) {
auto [buf, filled] = s.peek<5>();
if (filled) {
auto result = decode_u32(buf.data());
s.skip(result.second);
return ss::make_ready_future(result);
}
return coro_decode(s);
}
Performance Bottom Line
Async is still cheap in the large
- Context switches are 1,000s of cycles, large cache impact
Very short coroutines may be expensive: consider continuations
Continuations have a per-continuation cost: consider coroutines
Consider sync vs async fallback
Drive the above decisions via profiling
Summary: are coroutines “async made easy”?
⚠️ C++ is not memory safe, and async makes it (even) easier to write a segfault with a
careless reference. Sometimes coroutines help with this.
ℹ️ Compiler bugs: LLVM is great and things get fixed fast, but coroutines are “early adopter”
stage. Use the latest release! (e.g. llvm/llvm-project#51843).
ℹ️ Performance: It’s complicated.
✅ Net win for maintainability, robust use of RAII, and open the door to future compiler
optimization of async code.
27
Travis Downs
travis.downs@redpanda.co
m
@trav_downs
travisdowns
Thank you! Let’s connect.
Trade offs
Alternatives
Seastar is not the only option for writing fast async code:
■ C++/asio
■ Rust/tokio
■ Various GC language options (goroutines, Java lightweight threads)
Main difference: these do not adopt Seastar’s strict share-nothing model, do not avoid
atomics, and tend to only softly bind tasks to a core (e.g. tokio does work stealing).
Possibility of hybrid approaches (e.g. use Biased Reference Counting to avoid atomics
while avoiding pinning all memory to cores).
Seastar also has “alien threads” for mixing non-async code (Redpanda uses this for
Kerberos libs)
30
Trade offs
Using C++20 & Seastar is clear net benefit for Redpanda.
It might be right for you too if one or more of the following are true:
■ Starting a new project where high throughput and low latency are important
■ Does your work decompose into shard-affine units?
■ Do you need to scale to more than a few cores?
■ Is C++ your language of choice?
31
P99 Conf Template
#1A1047 #00E5FF
#753bf0 #FF2CDF
#2B53F9
Font usage Share Tech or Roboto
Color Palette
Table
Column 1 Column 2 Column 3 Column 5
Data 1 Data 2 Data 3 Data 4
Data 5 Data 6 Data 7 Data 8
#667EEA
33
What makes good high-throughput software?
Keep the disk/network fed with I/Os
Conform to the system’s topology
Not just high throughput: reliably low latency
Primary success metric: P99.9 latency
© 2023 REDPANDA DATA
Why Redpanda?
Fast
● 10x lower tail latency
vs Apache Kafka
● 6x faster transactions
● Written in C++ with
async, shared nothing
design
● No page cache, no
virtual memory
Easy
● Fully Kafka API-
compatible
● Single binary
● No JVM, No ZooKeeper
● Auto tuning & balancing
● Prometheus metrics
Efficient
● Thread-per-core
architecture
● Saturates your
infrastructure
● Extreme throughput
● Scales both vertically
and horizontally
Cost-Effective
● Reduces Kafka infra
costs by 6X
● Lower admin overhead
● Limitless data ingestion
and retention without
local disk
Coroutines and lifetimes: example 1
Real example helper function for constructing and writing a message batch, from PR #9154
ss::future<std::error_code> metadata::mark_clean(model::offset
clean_offset) {
// Construct a batch builder
auto builder = batch_start();
// Add one message
builder.mark_clean(clean_offset);
// Replicate using raft, return future for replication complete
return builder.replicate();
}
35
Coroutines and lifetimes: example 1
ss::future<std::error_code> metadata::mark_clean(model::offset clean_offset) {
auto builder = batch_start();
builder.mark_clean(clean_offset);
return builder.replicate();
// … builder falls out of scope here, the returned future still references it
}
// Imagine replicate() might generate a future that captures this
ss::future<> batch_builder::replicate() {
return something.then([this]{
// update some member variable here
});
}
36
Coroutines and lifetimes: example 1
ss::future<std::error_code> metadata::mark_clean(model::offset
clean_offset) {
auto builder = batch_start();
builder.mark_clean(clean_offset);
co_return co_await builder.replicate();
}
co_await futures inline -> ensure future completes before referenced object falls
out of scope.
Thank you coroutines! 🎉
37
Coroutines and lifetimes: example 2
// Print a string after a delay
seastar::future<> delayed_print(const std::string& msg) {
co_await seastar::sleep(100ms);
std::cout << "delayed_print: " << msg << std::endl;
}
// Print hello world after a delay
seastar::future<> delayed_hello_world(){
return delayed_print(std::string(“hello world!”)));
}
38
Coroutines and lifetimes: example 2
// Print a string after a delay
seastar::future<> delayed_print(const std::string& msg) {
co_await seastar::sleep(100ms);
std::cout << "delayed_print: " << msg << std::endl;
}
// Print hello world after a delay
seastar::future<> delayed_hello_world(){
return delayed_print(std::string(“hello world!”)));
}
39
Coroutines and lifetimes: example 2
// Print a string after a delay
seastar::future<> delayed_print(std::string msg) {
co_await seastar::sleep(100ms);
std::cout << "delayed_print: " << msg << std::endl;
}
// Print hello world after a delay
seastar::future<> delayed_hello_world(){
return delayed_print(std::string(“hello world!”)));
}
Pass by value is not expensive in this case: temporaries are rvalues, will be moved, not copied.
Always pass by value if you can, to avoid this class of issue.
40
P99 Conf Template
#1A1047
r26 g16 b71
c100 m100 y34 b45
Pantone 275c
#00E5FF
r0 g229 b255
c60 m0 y9 b0
#753bf0
r117 g59 b240
c79 m77 y0 b0
#FF2CDF
r255 g44 b223
C31 m78 y0 b0
#2B53F9
r38 g24 b250
c91 m75 y0 b0
Pantone 2727c
Color Palette Details
#667EEA
r102 g126 b234
c60 m35 y0 b5
Pantone 659c
P99 Conf Template
<here is some code>
<styling>
<use consolas for font when displaying code>
<don’t go below 12pt font size>
Slide title with white background
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3
Slide title with white background
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3
Hardware evolution
Hardware evolution
Not just CPUs:
■ Disk (SSD -> NVMe)
■ Network (100Gbps, 400Gbps)
Usually partitioned for virtualized
workloads
What if we want to run one high
throughput application on the whole
machine?
Hardware evolution
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3
Slide title with white background
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3
Slide title with white background
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu
faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula,
interdum blandit sem tortor eget dolor.
■ Bullet 1
● Bullet 2
■ Bullet 3
Centered Large Text
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar
Adventures in Thread-per-Core Async with Redpanda and Seastar
Clear slide for diagram with caption
First Last
email@address.com
@twitter
website/blog url
Thank you! Let’s connect.

More Related Content

PDF
Seastar @ SF/BA C++UG
PDF
Back to the future with C++ and Seastar
PDF
Seastar @ NYCC++UG
PPTX
Seastar at Linux Foundation Collaboration Summit
KEY
High performance network programming on the jvm oscon 2012
PPTX
Seastar Summit 2019 Keynote
PDF
Eventdriven I/O - A hands on introduction
Seastar @ SF/BA C++UG
Back to the future with C++ and Seastar
Seastar @ NYCC++UG
Seastar at Linux Foundation Collaboration Summit
High performance network programming on the jvm oscon 2012
Seastar Summit 2019 Keynote
Eventdriven I/O - A hands on introduction

Similar to Adventures in Thread-per-Core Async with Redpanda and Seastar (20)

PDF
Concurrency, Parallelism And IO
PDF
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
PDF
10.1.1.652.4894
PDF
Concurrent Ruby Application Servers
PPTX
C++11 Multithreading - Futures
PDF
Linux kernel 2.6 document
PDF
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
PDF
Asynchronous Io Programming
PDF
Tips on High Performance Server Programming
PDF
Zend con 2016 - Asynchronous Prorgamming in PHP
PDF
Highly concurrent yet natural programming
PDF
Server Tips
PPTX
C++ scalable network_io
PPTX
Realtime traffic analyser
PDF
Simon Peyton Jones: Managing parallelism
PDF
Peyton jones-2011-parallel haskell-the_future
PDF
301132
PDF
Reshaping Core Genomics Software Tools for the Manycore Era
PDF
Infinit filesystem, Reactor reloaded
PDF
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
Concurrency, Parallelism And IO
C++ CoreHard Autumn 2018. Concurrency and Parallelism in C++17 and C++20/23 -...
10.1.1.652.4894
Concurrent Ruby Application Servers
C++11 Multithreading - Futures
Linux kernel 2.6 document
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Asynchronous Io Programming
Tips on High Performance Server Programming
Zend con 2016 - Asynchronous Prorgamming in PHP
Highly concurrent yet natural programming
Server Tips
C++ scalable network_io
Realtime traffic analyser
Simon Peyton Jones: Managing parallelism
Peyton jones-2011-parallel haskell-the_future
301132
Reshaping Core Genomics Software Tools for the Manycore Era
Infinit filesystem, Reactor reloaded
php[world] 2016 - You Don’t Need Node.js - Async Programming in PHP
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Ad

Recently uploaded (20)

PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Encapsulation theory and applications.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Approach and Philosophy of On baking technology
PPT
Teaching material agriculture food technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
sap open course for s4hana steps from ECC to s4
Encapsulation theory and applications.pdf
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Spectral efficient network and resource selection model in 5G networks
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Approach and Philosophy of On baking technology
Teaching material agriculture food technology

Adventures in Thread-per-Core Async with Redpanda and Seastar

  • 1. HOSTED BY Adventures in Thread-per-core Async with Redpanda & Seastar Travis Downs Software Engineer at Redpanda
  • 2. Travis Downs (He/Him) Software Engineer at Redpanda ■ I love going deep on performance – all the way to assembly, if necessary ■ I’ve held principal staff positions at Salesforce & architect roles at SAP and Business Objects ■ I had hobbies like writing a software performance blog, but now I’m a parent, so…
  • 3. 3 Redpanda in 60 seconds Redpanda is a streaming storage engine Clients speak Apache Kafka API to Redpanda nodes to produce and consume from topic partitions. Partitions are logs (~10,000s per cluster) Each partition is a Raft group (~3 members) Scale up and scale out should be ~equivalent
  • 5. What is thread-per-core? One thread per core and pinned: make scheduling decisions in userspace. This thread must not block. Question: how do we replace blocking calls? Answer: … 5
  • 6. Seastar was created by the ScyllaDB project. Redpanda is built on Seastar. We 😍it. Shared nothing architecture made up of “shards”: ■ A CPU core ■ A pool of memory NUMA-local to that core ■ All-to-all mesh of SPSC message queues ■ Cooperative multitasking Seastar 6
  • 8. Continuation style ss::future<> consensus::stop() { return _event_manager.stop() .then([this] { return _append_requests_buffer.stop(); }) .then([this] { return _batcher.stop(); }) .then([this] { return _bg.close(); }) .then([this] { if (likely(!_snapshot_writer)) { return ss::now(); } return _snapshot_writer->close().then( [this] { _snapshot_writer.reset(); }); }); }
  • 9. C++ coroutines seastar::future<std::string> my_coroutine() { co_await seastar::sleep(100ms); // returns future<> co_return "hello world"; } New in C++ 20: three new keywords co_await co_yield co_return Language provides a future concept but not implementation: Seastar still defines the future/promise type. When compiler sees a co_* keyword, the function is rewritten to stash stack variables on the heap as needed to support suspension/resumption of execution.
  • 10. C++20 coroutines: after ss::future<> consensus::stop() { … co_await _event_manager.stop(); co_await _append_requests_buffer.stop(); co_await _batcher.stop(); _op_lock.broken(); co_await _bg.close(); if (unlikely(_snapshot_writer)) { co_await _snapshot_writer->close(); _snapshot_writer.reset(); } }
  • 11. New vs old ss::future<> consensus::stop() { … co_await _event_manager.stop(); co_await _append_requests_buffer.stop(); co_await _batcher.stop(); _op_lock.broken(); co_await _bg.close(); if (unlikely(_snapshot_writer)) { co_await _snapshot_writer->close(); _snapshot_writer.reset(); } } ss::future<> consensus::stop() { … return _event_manager.stop() .then([this] { return _append_requests_buffer.stop(); }) .then([this] { return _batcher.stop(); }) .then([this] { return _bg.close(); }) .then([this] { if (likely(!_snapshot_writer)) { return ss::now(); } return _snapshot_writer->close().then( [this] { _snapshot_writer.reset(); }); });
  • 12. Coroutine Performance Coroutine performance depends on both on the framework implementing the promise type and the compiler Here we talk about seastar’s implementation and clang++ Preview: coroutines are not transparent when it comes to performance
  • 13. Frame allocations Observation: almost every coroutine allocates Exception: if the compiler can statically prove the coro never suspends - No suspension points (co_await or co_yield) in the function - Suspension points is never reachable - Suspension point is reachable but never suspends
  • 14. Frame allocations 2 This coroutine: - Never suspends - Never even executes co_await - ~200 instructions and ~80 cycles - Always allocates seastar::future<> empty_coro() { if (always_false) { co_await make_ready_future<>(); } }
  • 15. Case study: varint decode Let’s look at a case study drawn from Redpanda code Decode an unsigned 32-bit varint 1-5 bytes and MSB of 0 indicates final byte Widely used in Kafka protocol (and other places)
  • 16. Case study: coroutine decoder read1() is async Almost the same as the synchronous version Allocates once per decode auto coro_decode(input_stream& s) { detail::var_decoder decoder; while (true) { char c = co_await s.read1(); if (decoder.accept(c)) { break; } } co_return decoder.result(); }
  • 20. Case study: continuation decoder Much harder to read (and write) Does not allocate Recursion is bounded by decoder auto cont_recurse(iobuf_reader& s, var_decoder decoder) { return s.read1().then([&s, decoder](char c) mutable { if (decoder.accept(c)) { return ss::make_ready_future<result_type> (decoder.result()); } return cont_recurse(s, decoder); }); }
  • 21. So is it faster?
  • 22. Case study: runtime comparison
  • 23. Case study: mystery method 1 Optimistic approach Avoid any async machinery if possible Doubles the amount of code auto cont_tricky(iobuf_reader& s, var_decoder decoder) { auto f = s.read1(); while (f.available()) { if (decoder.accept(f.get())) { return decoder.result_as_future(); } f = s.read1(); } return std::move(f).then([&s, decoder](char c) mutable { if (decoder.accept(c)) { return decoder.result_as_future(); } return cont_tricky(s, decoder); }); }
  • 24. Case study: mystery method 2 Synchronous version Almost identical to coro version Speedup varies from 4x to 9x auto sync_decode(input_stream& s) { detail::var_decoder decoder; while (true) { char c = s.read1_sync(); if (decoder.accept(c)) { break; } } return decoder.result(); }
  • 25. Sync with async fallback So how should we really do this? Use sync with async fallback. Peek at 5 bytes, fallback if not available. Fallback must in own method! auto decode_fallback(iobuf_reader& s) { auto [buf, filled] = s.peek<5>(); if (filled) { auto result = decode_u32(buf.data()); s.skip(result.second); return ss::make_ready_future(result); } return coro_decode(s); }
  • 26. Performance Bottom Line Async is still cheap in the large - Context switches are 1,000s of cycles, large cache impact Very short coroutines may be expensive: consider continuations Continuations have a per-continuation cost: consider coroutines Consider sync vs async fallback Drive the above decisions via profiling
  • 27. Summary: are coroutines “async made easy”? ⚠️ C++ is not memory safe, and async makes it (even) easier to write a segfault with a careless reference. Sometimes coroutines help with this. ℹ️ Compiler bugs: LLVM is great and things get fixed fast, but coroutines are “early adopter” stage. Use the latest release! (e.g. llvm/llvm-project#51843). ℹ️ Performance: It’s complicated. ✅ Net win for maintainability, robust use of RAII, and open the door to future compiler optimization of async code. 27
  • 30. Alternatives Seastar is not the only option for writing fast async code: ■ C++/asio ■ Rust/tokio ■ Various GC language options (goroutines, Java lightweight threads) Main difference: these do not adopt Seastar’s strict share-nothing model, do not avoid atomics, and tend to only softly bind tasks to a core (e.g. tokio does work stealing). Possibility of hybrid approaches (e.g. use Biased Reference Counting to avoid atomics while avoiding pinning all memory to cores). Seastar also has “alien threads” for mixing non-async code (Redpanda uses this for Kerberos libs) 30
  • 31. Trade offs Using C++20 & Seastar is clear net benefit for Redpanda. It might be right for you too if one or more of the following are true: ■ Starting a new project where high throughput and low latency are important ■ Does your work decompose into shard-affine units? ■ Do you need to scale to more than a few cores? ■ Is C++ your language of choice? 31
  • 32. P99 Conf Template #1A1047 #00E5FF #753bf0 #FF2CDF #2B53F9 Font usage Share Tech or Roboto Color Palette Table Column 1 Column 2 Column 3 Column 5 Data 1 Data 2 Data 3 Data 4 Data 5 Data 6 Data 7 Data 8 #667EEA
  • 33. 33 What makes good high-throughput software? Keep the disk/network fed with I/Os Conform to the system’s topology Not just high throughput: reliably low latency Primary success metric: P99.9 latency
  • 34. © 2023 REDPANDA DATA Why Redpanda? Fast ● 10x lower tail latency vs Apache Kafka ● 6x faster transactions ● Written in C++ with async, shared nothing design ● No page cache, no virtual memory Easy ● Fully Kafka API- compatible ● Single binary ● No JVM, No ZooKeeper ● Auto tuning & balancing ● Prometheus metrics Efficient ● Thread-per-core architecture ● Saturates your infrastructure ● Extreme throughput ● Scales both vertically and horizontally Cost-Effective ● Reduces Kafka infra costs by 6X ● Lower admin overhead ● Limitless data ingestion and retention without local disk
  • 35. Coroutines and lifetimes: example 1 Real example helper function for constructing and writing a message batch, from PR #9154 ss::future<std::error_code> metadata::mark_clean(model::offset clean_offset) { // Construct a batch builder auto builder = batch_start(); // Add one message builder.mark_clean(clean_offset); // Replicate using raft, return future for replication complete return builder.replicate(); } 35
  • 36. Coroutines and lifetimes: example 1 ss::future<std::error_code> metadata::mark_clean(model::offset clean_offset) { auto builder = batch_start(); builder.mark_clean(clean_offset); return builder.replicate(); // … builder falls out of scope here, the returned future still references it } // Imagine replicate() might generate a future that captures this ss::future<> batch_builder::replicate() { return something.then([this]{ // update some member variable here }); } 36
  • 37. Coroutines and lifetimes: example 1 ss::future<std::error_code> metadata::mark_clean(model::offset clean_offset) { auto builder = batch_start(); builder.mark_clean(clean_offset); co_return co_await builder.replicate(); } co_await futures inline -> ensure future completes before referenced object falls out of scope. Thank you coroutines! 🎉 37
  • 38. Coroutines and lifetimes: example 2 // Print a string after a delay seastar::future<> delayed_print(const std::string& msg) { co_await seastar::sleep(100ms); std::cout << "delayed_print: " << msg << std::endl; } // Print hello world after a delay seastar::future<> delayed_hello_world(){ return delayed_print(std::string(“hello world!”))); } 38
  • 39. Coroutines and lifetimes: example 2 // Print a string after a delay seastar::future<> delayed_print(const std::string& msg) { co_await seastar::sleep(100ms); std::cout << "delayed_print: " << msg << std::endl; } // Print hello world after a delay seastar::future<> delayed_hello_world(){ return delayed_print(std::string(“hello world!”))); } 39
  • 40. Coroutines and lifetimes: example 2 // Print a string after a delay seastar::future<> delayed_print(std::string msg) { co_await seastar::sleep(100ms); std::cout << "delayed_print: " << msg << std::endl; } // Print hello world after a delay seastar::future<> delayed_hello_world(){ return delayed_print(std::string(“hello world!”))); } Pass by value is not expensive in this case: temporaries are rvalues, will be moved, not copied. Always pass by value if you can, to avoid this class of issue. 40
  • 41. P99 Conf Template #1A1047 r26 g16 b71 c100 m100 y34 b45 Pantone 275c #00E5FF r0 g229 b255 c60 m0 y9 b0 #753bf0 r117 g59 b240 c79 m77 y0 b0 #FF2CDF r255 g44 b223 C31 m78 y0 b0 #2B53F9 r38 g24 b250 c91 m75 y0 b0 Pantone 2727c Color Palette Details #667EEA r102 g126 b234 c60 m35 y0 b5 Pantone 659c
  • 42. P99 Conf Template <here is some code> <styling> <use consolas for font when displaying code> <don’t go below 12pt font size>
  • 43. Slide title with white background Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula, interdum blandit sem tortor eget dolor. ■ Bullet 1 ● Bullet 2 ■ Bullet 3
  • 44. Slide title with white background Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula, interdum blandit sem tortor eget dolor. ■ Bullet 1 ● Bullet 2 ■ Bullet 3
  • 46. Hardware evolution Not just CPUs: ■ Disk (SSD -> NVMe) ■ Network (100Gbps, 400Gbps) Usually partitioned for virtualized workloads What if we want to run one high throughput application on the whole machine?
  • 47. Hardware evolution Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula, interdum blandit sem tortor eget dolor. ■ Bullet 1 ● Bullet 2 ■ Bullet 3
  • 48. Slide title with white background Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula, interdum blandit sem tortor eget dolor. ■ Bullet 1 ● Bullet 2 ■ Bullet 3
  • 49. Slide title with white background Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer auctor eros eu faucibus sodales. Nunc dictum, urna id blandit pretium, mauris velit pulvinar ligula, interdum blandit sem tortor eget dolor. ■ Bullet 1 ● Bullet 2 ■ Bullet 3
  • 55. Clear slide for diagram with caption