SlideShare a Scribd company logo
Brought to you by
Extreme HTTP Performance Tuning:
1.2M API req/s on a 4 vCPU EC2 Instance
Marc Richards
Chief Problem Solver at
Marc Richards
Chief Problem Solver at Talawah Solutions
Talawah Solutions
■ Based in Kingston Jamaica
■ Cloud Computing Consultant for almost a decade
■ Solutions Architect / DevOps Engineer / Performance Engineer
■ No low-level systems performance tuning experience before
this project!
Demystifying Systems Performance Tuning
■ You don't need to be a kernel developer or a wizard sysadmin.
■ FlameGraph and bpftrace have changed the game.
■ New ebpf based tools coming out will only make things easier!
Overview
■ I accidentally fell down this optimization rabbit hole.
■ Started with a simple, high-performance API server written in C.
■ Used FlameGraph and bpftrace to analyze and optimize the entire stack.
Overview
■ Cloud: AWS
■ Hardware: 4 vCPU c5n.xlarge** (server) / 16 vCPU c5n.4xlarge (client)
■ Benchmark: Techempower JSON Serialization test
■ Server: Techempower libreactor implementation
** In order to minimize inconsistencies at the platform level I did the final benchmark run on a c5n.9xlarge that was
restricted to 4 vCPUS using the EC2 CPU Options feature.
Blog post with even more details
https://talawah.io/blog/extreme-http-performance-tuning-one
-point-two-million/
Optimizations
Optimization Gain Req/s
Ground Zero - 224k
Application Optimizations 55% 347k
Disabling Speculative Execution Mitigations 28% 446k
Disabling Syscall Auditing / Blocking 11% 495k
Disabling iptables / netfilter 22% 603k
Perfect Locality 38% 834k
Interrupt Optimizations 28% 1.06M
The Case of the Nosy Neighbor 6% 1.12M
The Battle Against the Spin Lock 2% 1.15M
This Goes to Twelve 4% 1.20M
Optimizations
Optimization Gain Req/s
Ground Zero - 224k
Application Optimizations 55% 347k
Disabling Speculative Execution Mitigations 28% 446k
Disabling Syscall Auditing / Blocking 11% 495k
Disabling iptables / netfilter 22% 603k
Perfect Locality 38% 834k
Interrupt Optimizations 28% 1.06M
The Case of the Nosy Neighbor 6% 1.12M
The Battle Against the Spin Lock 2% 1.15M
This Goes to Twelve 4% 1.20M
Ground Zero
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 1.14ms
90.00% 1.21ms
99.00% 1.26ms
99.99% 1.32ms
2243551 requests in 10.00s, 331.64MB read
Requests/sec: 224,353.73
* I modified nginx.conf to send back a hardcoded JSON response. This is not a part of the Techempower implementation.
Ground Zero
Application Optimizations
Application Optimizations
Application Optimizations
■ Run on all logical cores/vCPUs: ~25%
■ gcc -O3 and march=native: ~15%
■ send/recv instead of write/read: ~5%
■ Remove pthread overhead: ~3%
Application Optimizations
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 723.00us
90.00% 0.88ms
99.00% 0.94ms
99.99% 1.08ms
3470892 requests in 10.00s, 483.27MB read
Requests/sec: 347,087.15
Application Optimizations
Before After
Disabling...
Speculative Execution Mitigations
Syscall Auditing / Blocking
iptables/netfilter
Disabling...
■ Speculative Execution Mitigations: 28%
● nospectre_v1 nospectre_v2 pti=off mds=off tsx_async_abort=off
■ Syscall Auditing/Blocking: 11%
● auditctl -a never,task
● docker run -d --security-opt seccomp=unconfined libreactor
■ iptables/netfilter: 22%
● modprobe -rv ip_tables
● ExecStart=/usr/bin/dockerd ---bridge=none --iptables=false --ip-forward=false
Disabling...
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 419.00us
90.00% 479.00us
99.00% 517.00us
99.99% 575.00us
6031161 requests in 10.00s, 839.76MB read
Requests/sec: 603,112.18
Disabling...
Before After
Perfect Locality
+
Interrupt Optimizations
Perfect Locality + Interrupt Optimizations
■ Perfect Locality
● Pin processes to CPUs
● Pin network queues to CPUs (RSS + XPS)
● SO_REUSEPORT + SO_ATTACH_REUSEPORT_CBPF
■ Interrupt Moderation
● ethtool -C eth0 adaptive-rx on
■ Busy polling
● net.core.busy_poll=1
■ Perfect Locality + Interrupt Moderation + Busy Polling = 💯
Perfect Locality + Interrupt Optimizations
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 233.00us
90.00% 263.00us
99.00% 292.00us
99.99% 348.00us
10660410 requests in 10.00s, 1.45GB read
Requests/sec: 1,066,034.60
Perfect Locality + Interrupt Optimizations
Before After
The Case of the Nosy Neighbor
+
The Battle Against the Spin Lock
The Case of the Nosy Neighbor
Someone, somewhere was spying on all my packets (kinda)
■ dev_queue_xmit_nit() -> packet_rcv()
■ packet_rcv() implicates AF_PACKET
■ sudo ss --packet --processes -> (("dhclient",pid=3191,fd=5))
■ My (extreme) solution was to disable dhclient after boot
The Case of the Nosy Neighbor
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 218.00us
90.00% 254.00us
99.00% 285.00us
99.99% 341.00us
11279049 requests in 10.00s, 1.53GB read
Requests/sec: 1,127,894.86
The Case of the Nosy Neighbor
Before After
The Battle Against the Spin Lock
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 212.00us
90.00% 246.00us
99.00% 276.00us
99.99% 338.00us
11551707 requests in 10.00s, 1.57GB read
Requests/sec: 1,155,162.15
The Battle Against the Spin Lock
Before After
This Goes to Twelve
This Goes to Twelve
■ Disabling Generic Receive Offload (GRO)
■ TCP Congestion Control: cubic -> reno
■ Static Interrupt Moderation
This Goes to Twelve
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 203.00us
90.00% 236.00us
99.00% 265.00us
99.99% 317.00us
12031718 requests in 10.00s, 1.64GB read
Requests/sec: 1,203,164.22
Conclusion
436% increase in requests per second. 79% reduction in p99 latency.
■ Throughput: 224k req/s -> 1.2M req/s
■ p99 latency: 1.26ms -> 265.00us
■ p99.99 latency: 1.32ms -> 317.00us
All 11 implementations on a c5n.xlarge using the stock Amazon Linux 2 AMI without any OS/Networking optimizations
All 11 implementations on a c5n.xlarge with all OS/Networking optimizations applied
Next Steps
■ Next gen kernel: 5.10 LTS
■ Next gen technologies: io_uring
■ Next gen instances: ARM vs Intel vs AMD
■ Driving performance from the bottom-up using Rust, Java, etc
Brought to you by
Marc Richards
https://talawah.io/contact
@talawahtech

More Related Content

PDF
YOW2018 Cloud Performance Root Cause Analysis at Netflix
PDF
【Interop Tokyo 2023】ShowNetにおけるジュニパーネットワークスの取り組み
PDF
Using eBPF for High-Performance Networking in Cilium
PDF
Machine Learning Night - Preferred Networksの顧客向けプロダクト開発 - 谷脇大輔
PDF
PostgreSQL on EXT4, XFS, BTRFS and ZFS
PPTX
Apache Tez: Accelerating Hadoop Query Processing
PPTX
FD.io VPP事始め
PDF
IIJにおけるGlusterFS利用事例 GlusterFSの詳解と2年間の運用ノウハウ
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
【Interop Tokyo 2023】ShowNetにおけるジュニパーネットワークスの取り組み
Using eBPF for High-Performance Networking in Cilium
Machine Learning Night - Preferred Networksの顧客向けプロダクト開発 - 谷脇大輔
PostgreSQL on EXT4, XFS, BTRFS and ZFS
Apache Tez: Accelerating Hadoop Query Processing
FD.io VPP事始め
IIJにおけるGlusterFS利用事例 GlusterFSの詳解と2年間の運用ノウハウ
 

What's hot (20)

PDF
ネットワークの自動化・監視の取り組みについて #netopscoding #npstudy
PDF
今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...
PDF
MySQLとPostgreSQLの基本的なレプリケーション設定比較
PPTX
Microservices Network Architecture 101
PPTX
TRex Realistic Traffic Generator - Stateless support
PDF
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
PDF
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
PDF
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
PDF
DPDKによる高速コンテナネットワーキング
PPTX
NTP Server - How it works?
PDF
忙しい人の5分で分かるMesos入門 - Mesos って何だ?
PPTX
GraalVM を普通の Java VM として使う ~クラウドベンチマークなどでの比較~
PDF
Implementing BGP Flowspec at IP transit network
PPTX
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
PDF
Kamailio on Docker
PDF
10分でわかる Cilium と XDP / BPF
PDF
ISPの向こう側、どうなってますか
PDF
Overview of kubernetes network functions
PPT
Cassandraのしくみ データの読み書き編
PDF
Hadoop and Kerberos
ネットワークの自動化・監視の取り組みについて #netopscoding #npstudy
今、改めて考えるPostgreSQLプラットフォーム - マルチクラウドとポータビリティ -(PostgreSQL Conference Japan 20...
MySQLとPostgreSQLの基本的なレプリケーション設定比較
Microservices Network Architecture 101
TRex Realistic Traffic Generator - Stateless support
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
忙しい人のための Rocky Linux 入門〜Rocky LinuxはCentOSの後継者たり得るか?〜
[GKE & Spanner 勉強会] Cloud Spanner の技術概要
DPDKによる高速コンテナネットワーキング
NTP Server - How it works?
忙しい人の5分で分かるMesos入門 - Mesos って何だ?
GraalVM を普通の Java VM として使う ~クラウドベンチマークなどでの比較~
Implementing BGP Flowspec at IP transit network
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...
Kamailio on Docker
10分でわかる Cilium と XDP / BPF
ISPの向こう側、どうなってますか
Overview of kubernetes network functions
Cassandraのしくみ データの読み書き編
Hadoop and Kerberos
Ad

Similar to Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance (20)

PPTX
VMworld 2016: vSphere 6.x Host Resource Deep Dive
PDF
Otimizando servidores web
PDF
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
PDF
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
PPTX
Debugging linux issues with eBPF
PDF
Kernel Recipes 2019 - Metrics are money
PDF
NetConf 2018 BPF Observability
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
PDF
Tuning TCP and NGINX on EC2
PDF
User-space Network Processing
PPTX
QCon 2015 Broken Performance Tools
PPTX
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
PDF
Handy Networking Tools and How to Use Them
PPTX
High performace network of Cloud Native Taiwan User Group
PDF
Measuring a 25 and 40Gb/s Data Plane
PPTX
DPDK layer for porting IPS-IDS
POTX
Performance Tuning EC2 Instances
PDF
Tuning the Kernel for Varnish Cache
PDF
Configuration Management Tools on NX-OS
PDF
TRex Traffic Generator - Hanoch Haim
VMworld 2016: vSphere 6.x Host Resource Deep Dive
Otimizando servidores web
20160927-tierney-improving-performance-40G-100G-data-transfer-nodes.pdf
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
Debugging linux issues with eBPF
Kernel Recipes 2019 - Metrics are money
NetConf 2018 BPF Observability
Linux Kernel vs DPDK: HTTP Performance Showdown
Tuning TCP and NGINX on EC2
User-space Network Processing
QCon 2015 Broken Performance Tools
Helen Tabunshchyk "Handling large amounts of traffic on the Edge"
Handy Networking Tools and How to Use Them
High performace network of Cloud Native Taiwan User Group
Measuring a 25 and 40Gb/s Data Plane
DPDK layer for porting IPS-IDS
Performance Tuning EC2 Instances
Tuning the Kernel for Varnish Cache
Configuration Management Tools on NX-OS
TRex Traffic Generator - Hanoch Haim
Ad

More from ScyllaDB (20)

PDF
Understanding The True Cost of DynamoDB Webinar
PDF
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
PDF
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
PDF
New Ways to Reduce Database Costs with ScyllaDB
PDF
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
PDF
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
PDF
Leading a High-Stakes Database Migration
PDF
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
PDF
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
PDF
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
PDF
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
PDF
ScyllaDB: 10 Years and Beyond by Dor Laor
PDF
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
PDF
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
PDF
Vector Search with ScyllaDB by Szymon Wasik
PDF
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
PDF
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
PDF
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
PDF
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
PDF
Lessons Learned from Building a Serverless Notifications System by Srushith R...
Understanding The True Cost of DynamoDB Webinar
Database Benchmarking for Performance Masterclass: Session 2 - Data Modeling ...
Database Benchmarking for Performance Masterclass: Session 1 - Benchmarking F...
New Ways to Reduce Database Costs with ScyllaDB
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Powering a Billion Dreams: Scaling Meesho’s E-commerce Revolution with Scylla...
Leading a High-Stakes Database Migration
Achieving Extreme Scale with ScyllaDB: Tips & Tradeoffs
Securely Serving Millions of Boot Artifacts a Day by João Pedro Lima & Matt ...
How Agoda Scaled 50x Throughput with ScyllaDB by Worakarn Isaratham
How Yieldmo Cut Database Costs and Cloud Dependencies Fast by Todd Coleman
ScyllaDB: 10 Years and Beyond by Dor Laor
Reduce Your Cloud Spend with ScyllaDB by Tzach Livyatan
Migrating 50TB Data From a Home-Grown Database to ScyllaDB, Fast by Terence Liu
Vector Search with ScyllaDB by Szymon Wasik
Workload Prioritization: How to Balance Multiple Workloads in a Cluster by Fe...
Two Leading Approaches to Data Virtualization, and Which Scales Better? by Da...
Scaling a Beast: Lessons from 400x Growth in a High-Stakes Financial System b...
Object Storage in ScyllaDB by Ran Regev, ScyllaDB
Lessons Learned from Building a Serverless Notifications System by Srushith R...

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Encapsulation theory and applications.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
cuic standard and advanced reporting.pdf
PPT
Teaching material agriculture food technology
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Machine learning based COVID-19 study performance prediction
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Chapter 3 Spatial Domain Image Processing.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Encapsulation theory and applications.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
cuic standard and advanced reporting.pdf
Teaching material agriculture food technology
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Machine learning based COVID-19 study performance prediction
Mobile App Security Testing_ A Comprehensive Guide.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf

Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance

  • 1. Brought to you by Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance Marc Richards Chief Problem Solver at
  • 2. Marc Richards Chief Problem Solver at Talawah Solutions Talawah Solutions ■ Based in Kingston Jamaica ■ Cloud Computing Consultant for almost a decade ■ Solutions Architect / DevOps Engineer / Performance Engineer ■ No low-level systems performance tuning experience before this project!
  • 3. Demystifying Systems Performance Tuning ■ You don't need to be a kernel developer or a wizard sysadmin. ■ FlameGraph and bpftrace have changed the game. ■ New ebpf based tools coming out will only make things easier!
  • 4. Overview ■ I accidentally fell down this optimization rabbit hole. ■ Started with a simple, high-performance API server written in C. ■ Used FlameGraph and bpftrace to analyze and optimize the entire stack.
  • 5. Overview ■ Cloud: AWS ■ Hardware: 4 vCPU c5n.xlarge** (server) / 16 vCPU c5n.4xlarge (client) ■ Benchmark: Techempower JSON Serialization test ■ Server: Techempower libreactor implementation ** In order to minimize inconsistencies at the platform level I did the final benchmark run on a c5n.9xlarge that was restricted to 4 vCPUS using the EC2 CPU Options feature.
  • 6. Blog post with even more details https://talawah.io/blog/extreme-http-performance-tuning-one -point-two-million/
  • 7. Optimizations Optimization Gain Req/s Ground Zero - 224k Application Optimizations 55% 347k Disabling Speculative Execution Mitigations 28% 446k Disabling Syscall Auditing / Blocking 11% 495k Disabling iptables / netfilter 22% 603k Perfect Locality 38% 834k Interrupt Optimizations 28% 1.06M The Case of the Nosy Neighbor 6% 1.12M The Battle Against the Spin Lock 2% 1.15M This Goes to Twelve 4% 1.20M
  • 8. Optimizations Optimization Gain Req/s Ground Zero - 224k Application Optimizations 55% 347k Disabling Speculative Execution Mitigations 28% 446k Disabling Syscall Auditing / Blocking 11% 495k Disabling iptables / netfilter 22% 603k Perfect Locality 38% 834k Interrupt Optimizations 28% 1.06M The Case of the Nosy Neighbor 6% 1.12M The Battle Against the Spin Lock 2% 1.15M This Goes to Twelve 4% 1.20M
  • 9. Ground Zero Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 1.14ms 90.00% 1.21ms 99.00% 1.26ms 99.99% 1.32ms 2243551 requests in 10.00s, 331.64MB read Requests/sec: 224,353.73
  • 10. * I modified nginx.conf to send back a hardcoded JSON response. This is not a part of the Techempower implementation.
  • 14. Application Optimizations ■ Run on all logical cores/vCPUs: ~25% ■ gcc -O3 and march=native: ~15% ■ send/recv instead of write/read: ~5% ■ Remove pthread overhead: ~3%
  • 15. Application Optimizations Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 723.00us 90.00% 0.88ms 99.00% 0.94ms 99.99% 1.08ms 3470892 requests in 10.00s, 483.27MB read Requests/sec: 347,087.15
  • 17. Disabling... Speculative Execution Mitigations Syscall Auditing / Blocking iptables/netfilter
  • 18. Disabling... ■ Speculative Execution Mitigations: 28% ● nospectre_v1 nospectre_v2 pti=off mds=off tsx_async_abort=off ■ Syscall Auditing/Blocking: 11% ● auditctl -a never,task ● docker run -d --security-opt seccomp=unconfined libreactor ■ iptables/netfilter: 22% ● modprobe -rv ip_tables ● ExecStart=/usr/bin/dockerd ---bridge=none --iptables=false --ip-forward=false
  • 19. Disabling... Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 419.00us 90.00% 479.00us 99.00% 517.00us 99.99% 575.00us 6031161 requests in 10.00s, 839.76MB read Requests/sec: 603,112.18
  • 22. Perfect Locality + Interrupt Optimizations ■ Perfect Locality ● Pin processes to CPUs ● Pin network queues to CPUs (RSS + XPS) ● SO_REUSEPORT + SO_ATTACH_REUSEPORT_CBPF ■ Interrupt Moderation ● ethtool -C eth0 adaptive-rx on ■ Busy polling ● net.core.busy_poll=1 ■ Perfect Locality + Interrupt Moderation + Busy Polling = 💯
  • 23. Perfect Locality + Interrupt Optimizations Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 233.00us 90.00% 263.00us 99.00% 292.00us 99.99% 348.00us 10660410 requests in 10.00s, 1.45GB read Requests/sec: 1,066,034.60
  • 24. Perfect Locality + Interrupt Optimizations Before After
  • 25. The Case of the Nosy Neighbor + The Battle Against the Spin Lock
  • 26. The Case of the Nosy Neighbor Someone, somewhere was spying on all my packets (kinda) ■ dev_queue_xmit_nit() -> packet_rcv() ■ packet_rcv() implicates AF_PACKET ■ sudo ss --packet --processes -> (("dhclient",pid=3191,fd=5)) ■ My (extreme) solution was to disable dhclient after boot
  • 27. The Case of the Nosy Neighbor Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 218.00us 90.00% 254.00us 99.00% 285.00us 99.99% 341.00us 11279049 requests in 10.00s, 1.53GB read Requests/sec: 1,127,894.86
  • 28. The Case of the Nosy Neighbor Before After
  • 29. The Battle Against the Spin Lock Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 212.00us 90.00% 246.00us 99.00% 276.00us 99.99% 338.00us 11551707 requests in 10.00s, 1.57GB read Requests/sec: 1,155,162.15
  • 30. The Battle Against the Spin Lock Before After
  • 31. This Goes to Twelve
  • 32. This Goes to Twelve ■ Disabling Generic Receive Offload (GRO) ■ TCP Congestion Control: cubic -> reno ■ Static Interrupt Moderation
  • 33. This Goes to Twelve Running 10s test @ http://server.tfb:8080/json 16 threads and 256 connections Latency Distribution 50.00% 203.00us 90.00% 236.00us 99.00% 265.00us 99.99% 317.00us 12031718 requests in 10.00s, 1.64GB read Requests/sec: 1,203,164.22
  • 34. Conclusion 436% increase in requests per second. 79% reduction in p99 latency. ■ Throughput: 224k req/s -> 1.2M req/s ■ p99 latency: 1.26ms -> 265.00us ■ p99.99 latency: 1.32ms -> 317.00us
  • 35. All 11 implementations on a c5n.xlarge using the stock Amazon Linux 2 AMI without any OS/Networking optimizations
  • 36. All 11 implementations on a c5n.xlarge with all OS/Networking optimizations applied
  • 37. Next Steps ■ Next gen kernel: 5.10 LTS ■ Next gen technologies: io_uring ■ Next gen instances: ARM vs Intel vs AMD ■ Driving performance from the bottom-up using Rust, Java, etc
  • 38. Brought to you by Marc Richards https://talawah.io/contact @talawahtech