SlideShare a Scribd company logo
When the OS
gets in the way
(and what you can do about it)
Mark Price
@epickrram
LMAX Exchange
When the OS
gets in the way
(and what you can do about it)
Linux
Mark Price
@epickrram
LMAX Exchange
● Linux is an excellent general-purpose OS
● Many target platforms
● Scheduling is actually fairly complicated
● Low-latency is a special use-case
● We need to provide some hints
It’s not the OS’s fault
Why should I care?
Useful in some scenarios
● Low latency applications
● Response times < 1ms
● Compute-intensive workloads
● Long-running jobs
A real-world scenario: LMAX
System
Latency = T1 - T0
Before tuning:
250us / 10+ms
After tuning:
80us / <1ms
(mean / max)
Jitter
● “slight irregular movement, variation, or
unsteadiness, especially in an electrical
signal or electronic device”
● Variation in response time latency
● Long-tail in response time
Dealing with it
● First take care of the low-hanging fruit
○ e.g. Garbage collection (gc-free / Zing)
○ e.g. Slow I/O
● Once response times are < 10ms the fun
begins
● Make sure your code is running!
Measure first
● Need to validate changes are good
● End-to-end tests
● Using realistic load
● Change one thing and observe
● A refresher...
Modern hardware layout
Multi-tasking
● num(tasks) > num(HyperThreads)
● OS must share out hardware resources
● Clever? Dumb? Fast? Slow?
● Fair...
Linux CFS
● Completely Fair Scheduler
● Maintains a task ‘queue’ per HT
● Runs the task with the lowest runtime
● Updates task runtime after execution
● Higher priority implies longer execution time
● Tasks are load-balanced across HTs
An example application ...
Threads
… running on a language runtime
… running on an operating system
Optimise for locality - PCI/memory
Target deployment
How do I start?
● BIOS settings for maximum performance
● That’s a whole other talk...
Start with the metal
● lstopo is a useful tool for looking at hardware
● Provided by the hwloc package
● Displays:
○ HyperThreads
○ Physical cores
○ NUMA nodes
○ PCI locality
Discover what’s available
lstopo
lstopo
HyperThread
Core
Caches
NUMA-local
RAM
● Use isolcpus to reserve cpu resource
● kernel boot parameter
● isolcpus=0-5,10-13
● Use taskset to pin your application to cpus:
● taskset -c 10-13 java …
● Set affinity of hot threads:
● sched_setaffinity(...)
Reserve & use specific resource
Deploy the application
sched_setaffinity() !{isolcpus} taskset
You have no load-balancer
Pile-up
A solution: cpusets
● Create hierarchical sets of reserved
resource
● CPU, memory
● Userland tools: cset (SUSE)
Isolate OS processes
● cset set --set=/system --cpu=6-9
○ create a cpuset with cpus 6-9
○ create it at the path /system
● cset proc --move --from-set=/ --to-set=/system
○ move all processes from / to /system
○ -k => move unbound kernel threads
○ --threads => move child threads
○ --force => erm... force
Run the application
● cset set --cpu=0-5,10-13 --set=/app
● cset proc --exec /app taskset -cp 10-13 java …
○ start a process in the /app cpuset
○ run the program on cpus 10-13
● sched_setaffinity() to pin the hot threads to cpus 1,3,5
Isolated threads
/app
sched_setaffinity()
/system
/app
taskset
No more jitter?
● Sampling tracer
● Static/dynamic trace points
● Very low overhead
● A good starting point for digging deeper
● perf list to view available trace points
● network, file-system, scheduler, etc
perf_events
What’s happening CPU?
● perf record -e "sched:sched_switch" -C 3
○ Sample task switches on CPU 3
● perf report (best for multiple events)
● perf script (best for single events)
Rogue process
java
36049 [003] 3011858.780856: sched:sched_switch: java:
36049 [110] R ==> kworker/3:1:13991 [120]
kworker/3:1
13991 [003] 3011858.780861: sched:sched_switch:
kworker/3:1:13991 [120] S ==> java:36049 [110]
ftrace
● Function tracer
● Static/dynamic trace points
● Higher overhead
● But captures everything
● Can provide function graphs
● trace-cmd is the usable front-end
So what is that kernel thread doing?
● trace-cmd record -P <pid> -p function_graph
○ Trace functions called by process <pid>
● trace-cmd report
○ Display captured trace data
Some things can’t be deferred
kworker/3:1-13991 [003] 3013287.180771: funcgraph_entry: | process_one_work() {
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: | cache_reap() {
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.137 us | mutex_trylock();
kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.289 us | drain_array();
kworker/3:1-13991 [003] 3013287.180773: funcgraph_entry: 0.040 us | _cond_resched();
………
………
kworker/3:1-13991 [003] 3013287.180859: funcgraph_exit: +86.735 us | }
+86.735 us
Things to look out for
● cache_reap() - SLAB allocator
● vmstat_update() - kernel stats
● other workqueue events
○ perf record -e “workqueue:*” -C 3
● Interrupts - set affinity in /proc/irq
● Timer ticks - tickless mode
● CPU governor - set to performance
○ /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor
Some numbers
● Inter-thread latency is a good proxy
● 2 busy-spinning threads passing a message
● Time taken between producer & consumer
● Record times over several seconds
● Compare tuned/untuned
Results
== Latency (ns) ==
mean
min
50.00%
90.00%
99.00%
99.90%
99.99%
max
untuned
466
200
464
608
768
992
2432
69632
tuned
216
128
208
288
336
544
1664
69632
tuned vs untuned
tuned vs untuned (log scale)
Results (loaded system)
== Latency (ns) ==
mean
min
50.00%
90.00%
99.00%
99.90%
99.99%
max
untuned
545
144
464
544
736
2944
294913
884739
tuned
332
216
336
352
448
544
704
36864
tuned vs untuned (loaded system)
Summary
● Select threads that need access to CPU
● Isolate CPUs from the OS
● Pin important threads to isolated CPUs
● Don’t forget interrupts
● There will be more things…
● Always test assumptions!
● Run validation tests to ensure tunings are as
expected
Thank you
● lmax.com/blog/staff-blogs/
● epickrram.blogspot.com
● github.com/epickrram/perf-workshop
● @epickrram

More Related Content

PDF
Q4.11: Sched_mc on dual / quad cores
PDF
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
PDF
RTOS on ARM cortex-M platform -draft
PDF
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
PDF
Bpf performance tools chapter 4 bcc
PDF
FOSDEM2015: Live migration for containers is around the corner
PDF
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
PPTX
Contiki os timer tutorial
Q4.11: Sched_mc on dual / quad cores
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
RTOS on ARM cortex-M platform -draft
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Bpf performance tools chapter 4 bcc
FOSDEM2015: Live migration for containers is around the corner
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Contiki os timer tutorial

What's hot (20)

PDF
LAS16-TR04: Using tracing to tune and optimize EAS (English)
PPT
Kgdb kdb modesetting
PPT
Linux monitoring and Troubleshooting for DBA's
PDF
Interruption Timer Périodique
PDF
Le guide de dépannage de la jvm
PPT
PPTX
Linux kernel debugging
PDF
BKK16-104 sched-freq
PDF
Spying on the Linux kernel for fun and profit
PDF
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
PPT
Linux Troubleshooting
PPT
Process scheduling linux
PPTX
Smarter Scheduling
PDF
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
PPT
Ganglia Monitoring Tool
PPTX
Am I reading GC logs Correctly?
PDF
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
PDF
Metrics with Ganglia
PDF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
LAS16-TR04: Using tracing to tune and optimize EAS (English)
Kgdb kdb modesetting
Linux monitoring and Troubleshooting for DBA's
Interruption Timer Périodique
Le guide de dépannage de la jvm
Linux kernel debugging
BKK16-104 sched-freq
Spying on the Linux kernel for fun and profit
Kernel Recipes 2018 - New GPIO interface for linux user space - Bartosz Golas...
Linux Troubleshooting
Process scheduling linux
Smarter Scheduling
Kernel Recipes 2018 - KernelShark 1.0; What's new and what's coming - Steven ...
Ganglia Monitoring Tool
Am I reading GC logs Correctly?
Embedded Recipes 2018 - Finding sources of Latency In your system - Steven Ro...
Metrics with Ganglia
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Ad

Viewers also liked (19)

ODP
Performance: Observe and Tune
PDF
PPTX
FPGA Applications in Finance
PPT
TMPA-2015: FPGA-Based Low Latency Sponsored Access
ODP
Writing and testing high frequency trading engines in java
PPTX
Extent3 turquoise equity_trading_2012
PPT
Extent3 exactpro testing_of_hft_gui
PPTX
Premiare i dipendenti: i nostri 5 suggerimenti
PPTX
Come (e perché) le HR devono sviluppare la loro resilienza...
PDF
Esclarecer o-habitus
PPTX
Levchenko Andrey
PDF
Presupuesto publico
PDF
Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...
DOCX
Silabus mtk xii
Performance: Observe and Tune
FPGA Applications in Finance
TMPA-2015: FPGA-Based Low Latency Sponsored Access
Writing and testing high frequency trading engines in java
Extent3 turquoise equity_trading_2012
Extent3 exactpro testing_of_hft_gui
Premiare i dipendenti: i nostri 5 suggerimenti
Come (e perché) le HR devono sviluppare la loro resilienza...
Esclarecer o-habitus
Levchenko Andrey
Presupuesto publico
Three NJ High Schools Roll Out New CTEP Marketing Course to Prepare Students ...
Silabus mtk xii
Ad

Similar to When the OS gets in the way (20)

PDF
BUD17-309: IRQ prediction
ODP
Optimizing Linux Servers
PDF
OS scheduling and The anatomy of a context switch
PDF
HKG15-409: ARM Hibernation enablement on SoCs - a case study
PPTX
UNIT 3 - General Purpose Processors
PPTX
Linux Network Stack
PPTX
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
PDF
Hardware Assisted Latency Investigations
PDF
PERFORMANCE_SCHEMA and sys schema
PPTX
CS345 09 - Ch04 Threads operating system1.pptx
PDF
Linux Systems Performance 2016
PDF
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
PDF
Performance Analysis: new tools and concepts from the cloud
PDF
My First 100 days with an Exadata (PPT)
PDF
Analyzing OS X Systems Performance with the USE Method
PDF
lecture04_Overview of a Cloud architecture.pdf
PDF
Distributed implementation of a lstm on spark and tensorflow
PDF
Lec 12-15 mips instruction set processor
PDF
RTOS implementation
PPTX
VMworld 2016: vSphere 6.x Host Resource Deep Dive
BUD17-309: IRQ prediction
Optimizing Linux Servers
OS scheduling and The anatomy of a context switch
HKG15-409: ARM Hibernation enablement on SoCs - a case study
UNIT 3 - General Purpose Processors
Linux Network Stack
Embedded_ PPT_4-5 unit_Dr Monika-edited.pptx
Hardware Assisted Latency Investigations
PERFORMANCE_SCHEMA and sys schema
CS345 09 - Ch04 Threads operating system1.pptx
Linux Systems Performance 2016
CONFidence 2017: Escaping the (sand)box: The promises and pitfalls of modern ...
Performance Analysis: new tools and concepts from the cloud
My First 100 days with an Exadata (PPT)
Analyzing OS X Systems Performance with the USE Method
lecture04_Overview of a Cloud architecture.pdf
Distributed implementation of a lstm on spark and tensorflow
Lec 12-15 mips instruction set processor
RTOS implementation
VMworld 2016: vSphere 6.x Host Resource Deep Dive

Recently uploaded (20)

PDF
Getting Started with Data Integration: FME Form 101
PDF
WOOl fibre morphology and structure.pdf for textiles
PDF
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
PPTX
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
August Patch Tuesday
PDF
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
A contest of sentiment analysis: k-nearest neighbor versus neural network
PDF
Getting started with AI Agents and Multi-Agent Systems
PDF
Developing a website for English-speaking practice to English as a foreign la...
PDF
STKI Israel Market Study 2025 version august
PPTX
Modernising the Digital Integration Hub
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
A novel scalable deep ensemble learning framework for big data classification...
Getting Started with Data Integration: FME Form 101
WOOl fibre morphology and structure.pdf for textiles
From MVP to Full-Scale Product A Startup’s Software Journey.pdf
Assigned Numbers - 2025 - Bluetooth® Document
ENT215_Completing-a-large-scale-migration-and-modernization-with-AWS.pdf
MicrosoftCybserSecurityReferenceArchitecture-April-2025.pptx
A comparative study of natural language inference in Swahili using monolingua...
Univ-Connecticut-ChatGPT-Presentaion.pdf
Group 1 Presentation -Planning and Decision Making .pptx
August Patch Tuesday
How ambidextrous entrepreneurial leaders react to the artificial intelligence...
Programs and apps: productivity, graphics, security and other tools
A contest of sentiment analysis: k-nearest neighbor versus neural network
Getting started with AI Agents and Multi-Agent Systems
Developing a website for English-speaking practice to English as a foreign la...
STKI Israel Market Study 2025 version august
Modernising the Digital Integration Hub
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
observCloud-Native Containerability and monitoring.pptx
A novel scalable deep ensemble learning framework for big data classification...

When the OS gets in the way

  • 1. When the OS gets in the way (and what you can do about it) Mark Price @epickrram LMAX Exchange
  • 2. When the OS gets in the way (and what you can do about it) Linux Mark Price @epickrram LMAX Exchange
  • 3. ● Linux is an excellent general-purpose OS ● Many target platforms ● Scheduling is actually fairly complicated ● Low-latency is a special use-case ● We need to provide some hints It’s not the OS’s fault
  • 4. Why should I care?
  • 5. Useful in some scenarios ● Low latency applications ● Response times < 1ms ● Compute-intensive workloads ● Long-running jobs
  • 6. A real-world scenario: LMAX System Latency = T1 - T0 Before tuning: 250us / 10+ms After tuning: 80us / <1ms (mean / max)
  • 7. Jitter ● “slight irregular movement, variation, or unsteadiness, especially in an electrical signal or electronic device” ● Variation in response time latency ● Long-tail in response time
  • 8. Dealing with it ● First take care of the low-hanging fruit ○ e.g. Garbage collection (gc-free / Zing) ○ e.g. Slow I/O ● Once response times are < 10ms the fun begins ● Make sure your code is running!
  • 9. Measure first ● Need to validate changes are good ● End-to-end tests ● Using realistic load ● Change one thing and observe ● A refresher...
  • 11. Multi-tasking ● num(tasks) > num(HyperThreads) ● OS must share out hardware resources ● Clever? Dumb? Fast? Slow? ● Fair...
  • 12. Linux CFS ● Completely Fair Scheduler ● Maintains a task ‘queue’ per HT ● Runs the task with the lowest runtime ● Updates task runtime after execution ● Higher priority implies longer execution time ● Tasks are load-balanced across HTs
  • 13. An example application ... Threads
  • 14. … running on a language runtime
  • 15. … running on an operating system
  • 16. Optimise for locality - PCI/memory
  • 18. How do I start?
  • 19. ● BIOS settings for maximum performance ● That’s a whole other talk... Start with the metal
  • 20. ● lstopo is a useful tool for looking at hardware ● Provided by the hwloc package ● Displays: ○ HyperThreads ○ Physical cores ○ NUMA nodes ○ PCI locality Discover what’s available
  • 23. ● Use isolcpus to reserve cpu resource ● kernel boot parameter ● isolcpus=0-5,10-13 ● Use taskset to pin your application to cpus: ● taskset -c 10-13 java … ● Set affinity of hot threads: ● sched_setaffinity(...) Reserve & use specific resource
  • 25. You have no load-balancer Pile-up
  • 26. A solution: cpusets ● Create hierarchical sets of reserved resource ● CPU, memory ● Userland tools: cset (SUSE)
  • 27. Isolate OS processes ● cset set --set=/system --cpu=6-9 ○ create a cpuset with cpus 6-9 ○ create it at the path /system ● cset proc --move --from-set=/ --to-set=/system ○ move all processes from / to /system ○ -k => move unbound kernel threads ○ --threads => move child threads ○ --force => erm... force
  • 28. Run the application ● cset set --cpu=0-5,10-13 --set=/app ● cset proc --exec /app taskset -cp 10-13 java … ○ start a process in the /app cpuset ○ run the program on cpus 10-13 ● sched_setaffinity() to pin the hot threads to cpus 1,3,5
  • 31. ● Sampling tracer ● Static/dynamic trace points ● Very low overhead ● A good starting point for digging deeper ● perf list to view available trace points ● network, file-system, scheduler, etc perf_events
  • 32. What’s happening CPU? ● perf record -e "sched:sched_switch" -C 3 ○ Sample task switches on CPU 3 ● perf report (best for multiple events) ● perf script (best for single events)
  • 33. Rogue process java 36049 [003] 3011858.780856: sched:sched_switch: java: 36049 [110] R ==> kworker/3:1:13991 [120] kworker/3:1 13991 [003] 3011858.780861: sched:sched_switch: kworker/3:1:13991 [120] S ==> java:36049 [110]
  • 34. ftrace ● Function tracer ● Static/dynamic trace points ● Higher overhead ● But captures everything ● Can provide function graphs ● trace-cmd is the usable front-end
  • 35. So what is that kernel thread doing? ● trace-cmd record -P <pid> -p function_graph ○ Trace functions called by process <pid> ● trace-cmd report ○ Display captured trace data
  • 36. Some things can’t be deferred kworker/3:1-13991 [003] 3013287.180771: funcgraph_entry: | process_one_work() { kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: | cache_reap() { kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.137 us | mutex_trylock(); kworker/3:1-13991 [003] 3013287.180772: funcgraph_entry: 0.289 us | drain_array(); kworker/3:1-13991 [003] 3013287.180773: funcgraph_entry: 0.040 us | _cond_resched(); ……… ……… kworker/3:1-13991 [003] 3013287.180859: funcgraph_exit: +86.735 us | } +86.735 us
  • 37. Things to look out for ● cache_reap() - SLAB allocator ● vmstat_update() - kernel stats ● other workqueue events ○ perf record -e “workqueue:*” -C 3 ● Interrupts - set affinity in /proc/irq ● Timer ticks - tickless mode ● CPU governor - set to performance ○ /sys/devices/system/cpu/cpuN/cpufreq/scaling_governor
  • 38. Some numbers ● Inter-thread latency is a good proxy ● 2 busy-spinning threads passing a message ● Time taken between producer & consumer ● Record times over several seconds ● Compare tuned/untuned
  • 39. Results == Latency (ns) == mean min 50.00% 90.00% 99.00% 99.90% 99.99% max untuned 466 200 464 608 768 992 2432 69632 tuned 216 128 208 288 336 544 1664 69632
  • 41. tuned vs untuned (log scale)
  • 42. Results (loaded system) == Latency (ns) == mean min 50.00% 90.00% 99.00% 99.90% 99.99% max untuned 545 144 464 544 736 2944 294913 884739 tuned 332 216 336 352 448 544 704 36864
  • 43. tuned vs untuned (loaded system)
  • 44. Summary ● Select threads that need access to CPU ● Isolate CPUs from the OS ● Pin important threads to isolated CPUs ● Don’t forget interrupts ● There will be more things… ● Always test assumptions! ● Run validation tests to ensure tunings are as expected
  • 45. Thank you ● lmax.com/blog/staff-blogs/ ● epickrram.blogspot.com ● github.com/epickrram/perf-workshop ● @epickrram