Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance

Brought to you by
Extreme HTTP Performance Tuning:
1.2M API req/s on a 4 vCPU EC2 Instance
Marc Richards
Chief Problem Solver at

Marc Richards
Chief Problem Solver at Talawah Solutions
Talawah Solutions
■ Based in Kingston Jamaica
■ Cloud Computing Consultant for almost a decade
■ Solutions Architect / DevOps Engineer / Performance Engineer
■ No low-level systems performance tuning experience before
this project!

Demystifying Systems Performance Tuning
■ You don't need to be a kernel developer or a wizard sysadmin.
■ FlameGraph and bpftrace have changed the game.
■ New ebpf based tools coming out will only make things easier!

Overview
■ I accidentally fell down this optimization rabbit hole.
■ Started with a simple, high-performance API server written in C.
■ Used FlameGraph and bpftrace to analyze and optimize the entire stack.

Overview
■ Cloud: AWS
■ Hardware: 4 vCPU c5n.xlarge** (server) / 16 vCPU c5n.4xlarge (client)
■ Benchmark: Techempower JSON Serialization test
■ Server: Techempower libreactor implementation
** In order to minimize inconsistencies at the platform level I did the ﬁnal benchmark run on a c5n.9xlarge that was
restricted to 4 vCPUS using the EC2 CPU Options feature.

Blog post with even more details
https://talawah.io/blog/extreme-http-performance-tuning-one
-point-two-million/

Optimizations
Optimization Gain Req/s
Ground Zero - 224k
Application Optimizations 55% 347k
Disabling Speculative Execution Mitigations 28% 446k
Disabling Syscall Auditing / Blocking 11% 495k
Disabling iptables / netﬁlter 22% 603k
Perfect Locality 38% 834k
Interrupt Optimizations 28% 1.06M
The Case of the Nosy Neighbor 6% 1.12M
The Battle Against the Spin Lock 2% 1.15M
This Goes to Twelve 4% 1.20M

Ground Zero
Running 10s test @ http://server.tfb:8080/json
16 threads and 256 connections
Latency Distribution
50.00% 1.14ms
90.00% 1.21ms
99.00% 1.26ms
99.99% 1.32ms
2243551 requests in 10.00s, 331.64MB read
Requests/sec: 224,353.73

* I modiﬁed nginx.conf to send back a hardcoded JSON response. This is not a part of the Techempower implementation.

Application Optimizations
■ Run on all logical cores/vCPUs: ~25%
■ gcc -O3 and march=native: ~15%
■ send/recv instead of write/read: ~5%
■ Remove pthread overhead: ~3%

50.00% 723.00us
90.00% 0.88ms
99.00% 0.94ms
99.99% 1.08ms

Before After

Disabling...
Speculative Execution Mitigations
Syscall Auditing / Blocking
iptables/netﬁlter

Disabling...
■ Speculative Execution Mitigations: 28%
● nospectre_v1 nospectre_v2 pti=off mds=off tsx_async_abort=off
■ Syscall Auditing/Blocking: 11%
● auditctl -a never,task
● docker run -d --security-opt seccomp=unconﬁned libreactor
■ iptables/netﬁlter: 22%
● modprobe -rv ip_tables
● ExecStart=/usr/bin/dockerd ---bridge=none --iptables=false --ip-forward=false

Disabling...
50.00% 419.00us
90.00% 479.00us
99.00% 517.00us
99.99% 575.00us

Perfect Locality
+
Interrupt Optimizations

Perfect Locality + Interrupt Optimizations
■ Perfect Locality
● Pin processes to CPUs
● Pin network queues to CPUs (RSS + XPS)
● SO_REUSEPORT + SO_ATTACH_REUSEPORT_CBPF
■ Interrupt Moderation
● ethtool -C eth0 adaptive-rx on
■ Busy polling
● net.core.busy_poll=1
■ Perfect Locality + Interrupt Moderation + Busy Polling = 💯

50.00% 233.00us
90.00% 263.00us
99.00% 292.00us
99.99% 348.00us
10660410 requests in 10.00s, 1.45GB read
Requests/sec: 1,066,034.60

Before After

The Case of the Nosy Neighbor
+
The Battle Against the Spin Lock

Someone, somewhere was spying on all my packets (kinda)
■ dev_queue_xmit_nit() -> packet_rcv()
■ packet_rcv() implicates AF_PACKET
■ sudo ss --packet --processes -> (("dhclient",pid=3191,fd=5))
■ My (extreme) solution was to disable dhclient after boot

50.00% 218.00us
90.00% 254.00us
99.00% 285.00us
99.99% 341.00us

Before After

50.00% 212.00us
90.00% 246.00us
99.00% 276.00us
99.99% 338.00us

Before After

This Goes to Twelve
■ Disabling Generic Receive Oﬄoad (GRO)
■ TCP Congestion Control: cubic -> reno
■ Static Interrupt Moderation

This Goes to Twelve
50.00% 203.00us
90.00% 236.00us
99.00% 265.00us
99.99% 317.00us

Conclusion
436% increase in requests per second. 79% reduction in p99 latency.
■ Throughput: 224k req/s -> 1.2M req/s
■ p99 latency: 1.26ms -> 265.00us
■ p99.99 latency: 1.32ms -> 317.00us

All 11 implementations on a c5n.xlarge using the stock Amazon Linux 2 AMI without any OS/Networking optimizations

All 11 implementations on a c5n.xlarge with all OS/Networking optimizations applied

Next Steps
■ Next gen kernel: 5.10 LTS
■ Next gen technologies: io_uring
■ Next gen instances: ARM vs Intel vs AMD
■ Driving performance from the bottom-up using Rust, Java, etc

Brought to you by
Marc Richards
https://talawah.io/contact
@talawahtech

Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance

More Related Content

What's hot (20)

Similar to Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance (20)

More from ScyllaDB (20)

Recently uploaded (20)

Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 Instance