SlideShare a Scribd company logo
Presented by
Date
Event
sched-freq
integrating the scheduler
and cpufreq
Steve Muckle
BKK16-104 March 7, 2016
Linaro Connect BKK16
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
clarifications/quick questions here
proposals and debate in hacking session
Tuesday 8th March, 15:00-15:50
HACKING-2 (lobby lounge - 23rd floor)
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
how does cpufreq currently work?
- plugin architecture (governors)
- popular governors are sampling-based
- let's assume:
- fmin = 100MHz, fmax = 1000Mhz
- a policy which goes to util*fmax + 100Mhz
sampling = 80%
freq = 900 MHz
sampling = 30%
freq = 400 Mhz
sampling = 20%
freq = 300 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 70%
freq = 800 Mhz
sampling = 10%
freq = 200 Mhz
sampling = 10%
freq = 200 Mhz
cpu busy time
cpufreq governor sampling
sampling = 80%
freq = 900 MHz
sampling = 30%
freq = 400 Mhz
sampling = 20%
freq = 300 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 70%
freq = 800 Mhz
sampling = 10%
freq = 200 Mhz
sampling = 10%
freq = 200 Mhz
cpu busy time
oops
cpufreq governor sampling
sampling = 0%
freq = 100 MHz
cpu busy time
sampling = 60%
freq = 700 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
more problems: new tasks
oops
sampling = 0%
freq = 100 MHz
cpu busy time
sampling = 60%
freq = 700 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
more problems: new tasks
sampling = 100%
freq = 1000 MHz
cpu busy time
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 0%
freq = 100 MHz
sampling = 100%
freq = 1000 MHz
sampling = 90%
freq = 1000 MHz
sampling = 0%
freq = 100 MHz
more problems: exiting tasks
oops
sampling = 100%
freq = 1000 MHz
cpu busy time
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 100%
freq = 1000 MHz
sampling = 0%
freq = 100 MHz
sampling = 100%
freq = 1000 MHz
sampling = 90%
freq = 1000 MHz
sampling = 0%
freq = 100 MHz
more problems: exiting tasks
cpu busy time
more problems: task migration
oops
cpu busy time
oops
more problems: task migration
cpu busy time
more problems: task migration
oops
cpu busy time
oops
more problems: task migration
more problems: tuning
- ondemand has 7 knobs
- interactive has 11 knobs
- ...and more get added by OEMs
- along with hacks
the line in the sand
Ingo Molnar, May 31 2013
Note that I still disagree with the whole design notion of
having an "idle back-end" (and a 'cpufreq back end')
separate from scheduler power saving policy…
This is a "line in the sand", a 'must have' design property
for any scheduler power saving patches to be acceptable…
https://lwn.net/Articles/552889/
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
CFS
RT
DL
schedfreq drivercpufreq
CPU N
schedfreq design
- schedfreq is a cpufreq governor
- can cpufreq be removed?
estimating CFS capacity
- per-entity load tracking (PELT)
- introduced in 3.8
- exponential moving average
- sum = is_running() + sum*y
- frequency invariance required
- partially merged (core bits in, ARM support not)
- microarch invariance required
- partially merged (core bits in, ARM support not)
estimating CFS capacity - PELT
estimating CFS capacity - PELT
- initial task load
- was 0 when task is fork-balanced to different CPU
due to bug
- fix is on mainline/4.5
- http://thread.gmane.org/gmane.linux.
kernel/2106780/
- blocked load is included in util_avg
estimating DL capacity
- runtime utilization tracking not strictly required
- DL tasks have runtime, deadline, and period parameters
- this describes the task’s bandwidth reservation
- util = runtime/period
- track DL bandwidth admitted into the system
estimating DL capacity
- util = runtime/period has drawbacks
- it’s worst case
- it’s always there
- better solution - track active utilization
- related to bandwidth reclaiming
- both solutions under discussion
estimating RT capacity
- task priority but no constraints
- monitor RT utilization
- use rt_avg
- no way to react to short latency constraints
- focus on long term constraints and soft real time
- may not be optimal but it already exists
- be sure to budget capacity for RT
- do not steal from other class’s cap requests
aggregation of sched classes
- sched class capacities are summed
- headroom added to CFS and RT
- (CFS + RT) * 1.25
- no headroom for DL tasks
- DL tasks have precise capacity parameters
- total capacity converted to frequency
- scale using policy->max
cpu0 cpu1 cpu2 cpu3
1.6 GHz
1.3 GHz
1.0 GHz
300 MHz
aggregation of CPUs
- CPU with max request in freq domain drives
frequency
setting the frequency
- tricky to do from hot scheduler paths
- locking
- performance implications
- varying CPU frequency driver behavior
- does driver sleep?
- is driver slow?
setting the frequency - fast path
- target freq can always be calculated
- if…
- driver isn’t slow or sleeps AND
- schedfreq isn’t throttled AND
- a freq transition isn’t underway AND
- the slow path isn’t active
then we can set freq in the fast path
setting the frequency - slow path
- kthread spawned by schedfreq
- safe to sleep
- safe to do more work
setting the frequency - slow path
- kthread spawned by schedfreq
- safe to sleep
- safe to do more work
but
setting the frequency - slow path
- kthread spawned by schedfreq
- safe to sleep
- safe to do more work
but
- task wake overhead ($$$)
locking
- lots of cleanup going on in cpufreq
- ongoing work from several people
- no blocking locking issues seen
...for now
locking
- sched hooks hold rq lock
- protect per-CPU data
- avoid accessing policy->rwsem
- freq_table
- min/max
- transition_ongoing
- not required to initiate freq transitions
locking
- schedfreq has 3 internal locks
- gov_enable_lock (mutex)
- GOV_START/STOP for static key control
- fastpath_lock (spinlock)
- solve race to re-evaluate frequency domain
- slowpath_lock (mutex)
- solve race between slow path, fast path,
GOV_START/STOP
scheduler hooks
- enqueue_task_fair, dequeue_task_fair*
- set CFS capacity
- CFS load balance paths
- set CFS capacity at src, dest
- scheduler_tick()
- jump to fmax if headroom is impacted
- pick_next_task_rt, task_tick_rt
- set RT capacity
scheduler hooks - todo
- DL
- migration paths in kernel/sched/core.c
- changing task affinity
- hotplug
- balance on exec()
- NUMA balancing
policy summary
- re-evaluate and set freq when tasks
- wake
- block (except when CPU goes idle)
- migrate
- event driven - too many events?
- go to fmax at tick if headroom is impacted
policy summary
- when CPU goes idle, clear vote but don’t re-
evaluate/set domain freq
- don’t initiate more work when going idle
- right thing to do?
policy summary
- PELT is very important, will need work
- Patrick Bellasi’s util_est
- buffer utilization value to yield more stable estimate
- Vincent Guittot’s invariance improvements
- same amount of work over same period at different
freqs => different utilizations
- tuning via schedtune
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
rt-app cpufreq test case
- simple periodic workload
- each test case uses a different duty cycle
- 16 different test cases/duty cycles
- either 10, 100 or 1000 loops
- varies from ~1% to ~43% busy
rt-app cpufreq test case
test case # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
busy (ms) 1 10 1 10 100 6 66 4 40 400 5 50 500 9 90 900
idle (ms) 100 1000 10 100 1000 33 333 10 100 1000 9 90 900 12 120 1200
loops 100 10 1000 100 10 300 30 1000 100 10 1000 100 10 1000 100 10
duration (s) 10.1 10.1 11 11 11 11.7 11.97 14 14 14 14 14 14 21 21 21
busy% 0.99% 0.99% 9.09% 9.09% 9.09% 15.38% 16.54% 28.57% 28.57% 28.57% 35.71% 35.71% 35.71% 42.86% 42.86% 42.86%
rt-app cpufreq test case
- for each loop, record
- time to execute busy work
- whether busy work overran period
- for each test case, report
- average time to complete busy work
- number of overruns
rt-app cpufreq test case
- define overhead as ...
(avg_time_test_gov - avg_time_perf_gov) /
(avg_time_pwrsv_gov - avg_time_perf_gov)
- 0% = completes as fast as perf gov
- 100% = completes as fast as powersave gov
Samsung Chromebook 2
- “Peach Pi”
- Exynos 5800
- CPUs 0-3: 200 MHz - 2000 MHz A15
- CPUs 4-7: 200 MHz - 1300 MHz A7
- A15 fmax 1800 MHz with most recent clock
support
- no power numbers yet
Samsung Chromebook 2
SCHED_OTHER (CFS) Perf ondemand interactive sched
run (ms) idle (ms) loops OR OH OR OH OR OH
1 100 100 0 62.07% 0 100.02% 0 78.49%
10 1000 10 0 21.80% 0 22.74% 0 72.56%
1 10 1000 0 21.72% 0 63.08% 0 52.40%
10 100 100 0 8.09% 0 15.53% 0 17.33%
100 1000 10 0 1.83% 0 1.77% 0 0.29%
6 33 300 0 15.32% 0 8.60% 0 17.34%
66 333 30 0 0.79% 0 3.18% 0 12.26%
4 10 1000 0 5.87% 0 10.21% 0 6.15%
40 100 100 0 0.41% 0 0.04% 0 2.68%
400 1000 10 0 0.42% 0 0.50% 0 1.22%
5 9 1000 2 3.82% 1 6.10% 0 2.51%
50 90 100 0 0.19% 0 0.05% 0 1.71%
500 900 10 0 0.37% 0 0.38% 0 1.82%
9 12 1000 6 1.79% 1 0.77% 0 0.26%
90 120 100 0 0.16% 1 0.05% 0 0.49%
900 1200 10 0 0.09% 0 0.26% 0 0.62%
Looks mostly good…
Samsung Chromebook 2
SCHED_FIFO (RT) Perf ondemand interactive sched
run (ms) idle (ms) loops OR OH OR OH OR OH
1 100 100 0 39.61% 0 100.49% 0 99.57%
10 1000 10 0 73.51% 0 21.09% 0 96.66%
1 10 1000 0 18.01% 0 61.46% 0 67.68%
10 100 100 0 31.31% 0 18.62% 0 77.01%
100 1000 10 0 58.80% 0 1.90% 0 15.40%
6 33 300 251 85.99% 0 9.20% 1 30.09%
66 333 30 24 84.03% 0 3.38% 0 33.23%
4 10 1000 0 6.23% 0 12.21% 10 11.54%
40 100 100 100 62.08% 0 0.11% 1 11.85%
400 1000 10 10 62.09% 0 0.51% 0 7.00%
5 9 1000 999 12.29% 1 6.03% 0 0.04%
50 90 100 99 61.47% 0 0.05% 2 6.53%
500 900 10 10 43.37% 0 0.39% 0 6.30%
9 12 1000 999 9.83% 0 0.01% 14 1.69%
90 120 100 99 61.47% 0 0.01% 28 2.29%
900 1200 10 10 43.31% 0 0.22% 0 2.15%
RTavg mechanism not reacting fast enough.
sched_time_avg_ms = 50
MediaTek 8173 EVB
- CPUs 0-1: 507 MHz - 1573 MHz A53
- CPUs 2-3: 507 MHz - 1989 MHz A72
- power measured via onboard TI INA219s
- thanks to Freedom Tan for this data
MediaTek 8173 EVB
SCHED_OTHER (CFS) Perf ondemand interactive sched
run (ms) idle (ms) loops OR OH OR OH OR OH
1 100 100 0 98.04% 0 100.41% 0 98.04%
10 1000 10 0 34.00% 0 67.68% 0 99.95%
1 10 1000 0 56.32% 0 101.11% 0 100.69%
10 100 100 0 18.31% 0 31.57% 0 100.02%
100 1000 10 0 2.77% 0 6.79% 0 7.72%
6 33 300 0 41.29% 0 100.27% 0 100.28%
66 333 30 0 1.27% 0 10.38% 0 39.31%
4 10 1000 0 21.45% 2 18.90% 6 65.66%
40 100 100 0 1.35% 0 8.16% 0 24.30%
400 1000 10 0 1.02% 0 1.74% 0 6.55%
5 9 1000 0 13.43% 2 14.14% 5 52.51%
50 90 100 0 1.31% 0 1.39% 1 12.62%
500 900 10 0 0.54% 0 1.32% 0 5.21%
9 12 1000 1 7.19% 1 8.59% 3 27.47%
90 120 100 0 0.88% 0 0.75% 1 3.80%
900 1200 10 0 0.16% 0 0.79% 0 2.83%
Trouble with most workloads with run < 100ms.
MediaTek 8173 EVB
SCHED_OTHER (CFS) Power/Perf power delta perf (from last slide)
run (ms) idle (ms) loops delta w/ondemand delta w/interactive sched overhead
1 100 100 -7.56% 0.01% 98.04%
10 1000 10 -0.41% 1.21% 99.95%
1 10 1000 12.97% 3.92% 100.69%
10 100 100 -3.19% -4.90% 100.02%
100 1000 10 -7.00% 1.37% 7.72%
6 33 300 -11.84% -9.95% 100.28%
66 333 30 -8.15% -2.09% 39.31%
4 10 1000 -0.93% -8.59% 65.66%
40 100 100 -3.18% -9.88% 24.30%
400 1000 10 -2.99% 0.07% 6.55%
5 9 1000 1.67% -9.33% 52.51%
50 90 100 -5.97% -10.89% 12.62%
500 900 10 -2.29% 1.73% 5.21%
9 12 1000 -5.90% -9.75% 27.47%
90 120 100 -6.88% -5.12% 3.80%
900 1200 10 5.23% 6.57% 2.83%
avg delta w/interactive: -3.48%
avg delta w/ondemand: -2.9%
Power savings seen, but not meaningful with
observed perf losses.
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
an upstream surprise
- scheduler - cpufreq hooks
posted by Rafael Wysocki
- Jan 29th 2016
- now in linux-next
- sched utilization driven gov
also posted by Rafael
- Feb 21st 2016
important differences
- ondemand-like freq algorithm
- possible to get stuck due to freq invariance
- weird semantics w.r.t. headroom
- no aggregation of sched class capacities
- currently goes to fmax for RT, DL
- uses workqueue rather than kthread
what’s it mean?
- more engagement from upstream
- what’s the value of schedfreq?
words of encouragement
“What I'd like to see from a scheduler metrics
usage POV is a single central place,
kernel/sched/cpufreq.c, where all the high level
('governor') decisions are made.
This is the approach Steve's series takes.”
Ingo Molnar, 03/03/2016
Note: Mike Turquette and Juri Lelli conceived and
authored much of the schedfreq series.
outline
- how things (don’t) work today
- schedfreq design
- latest test results
- an upstream surprise
- next steps
next steps
- address shortcomings in schedutil
- freq algorithm
- scheduler hooks
- better RT, DL response
- more in-depth testing and analysis
- experiments with real-world usecases
- Android UI, games, benchmarks, etc.
- merging with EAS
- integration and testing with schedtune
- window-based load tracking
the end
backup
ondemand
- sample every sampling_rate usec
- cpu usage = busy% of last sample_rate usec
- if busy% > up_threshold, go to fmax
- otherwise scale with load
- freq_next = fmin + busy% * (fmax - fmin)
- stay at fmax longer with sampling_down_factor
interactive
- sample every timer_rate usec
- cpu usage = busy% of last timer_rate usec
- if busy% > go_hispeed_load go to hispeed_freq
- otherwise scale CPU according to target_loads
- 85 1000000:90 1700000:95
- XXX put a note in here to explain this
- can prevent slowdown with min_sample_time
- delay speedups past hispeed_freq with
above_hispeed_delay

More Related Content

PDF
BKK16-208 EAS
PDF
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
PDF
SFO15-302: Energy Aware Scheduling: Progress Update
PDF
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
PDF
Q4.11: Sched_mc on dual / quad cores
PDF
LAS16-TR04: Using tracing to tune and optimize EAS (English)
PDF
LAS16-307: Benchmarking Schedutil in Android
PDF
ACM Applicative System Methodology 2016
BKK16-208 EAS
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
SFO15-302: Energy Aware Scheduling: Progress Update
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
Q4.11: Sched_mc on dual / quad cores
LAS16-TR04: Using tracing to tune and optimize EAS (English)
LAS16-307: Benchmarking Schedutil in Android
ACM Applicative System Methodology 2016

What's hot (20)

PDF
SREcon 2016 Performance Checklists for SREs
PPTX
Broken Linux Performance Tools 2016
PDF
Netflix SRE perf meetup_slides
PDF
Velocity 2015 linux perf tools
PDF
Linux 4.x Tracing Tools: Using BPF Superpowers
PDF
USENIX ATC 2017: Visualizing Performance with Flame Graphs
PPTX
Troubleshooting common oslo.messaging and RabbitMQ issues
PDF
New Ways to Find Latency in Linux Using Tracing
PPTX
QCon 2015 Broken Performance Tools
PDF
Performance Lessons learned in vRouter - Stephen Hemminger
PDF
Linux Performance Profiling and Monitoring
PDF
Stop the Guessing: Performance Methodologies for Production Systems
PDF
Q2.12: Scheduler Inputs
PDF
Introduction to Perf
PDF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
Linux Profiling at Netflix
PDF
Netflix: From Clouds to Roots
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
ODP
Performance: Observe and Tune
SREcon 2016 Performance Checklists for SREs
Broken Linux Performance Tools 2016
Netflix SRE perf meetup_slides
Velocity 2015 linux perf tools
Linux 4.x Tracing Tools: Using BPF Superpowers
USENIX ATC 2017: Visualizing Performance with Flame Graphs
Troubleshooting common oslo.messaging and RabbitMQ issues
New Ways to Find Latency in Linux Using Tracing
QCon 2015 Broken Performance Tools
Performance Lessons learned in vRouter - Stephen Hemminger
Linux Performance Profiling and Monitoring
Stop the Guessing: Performance Methodologies for Production Systems
Q2.12: Scheduler Inputs
Introduction to Perf
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Linux Profiling at Netflix
Netflix: From Clouds to Roots
Kernel Recipes 2017: Using Linux perf at Netflix
Performance: Observe and Tune
Ad

Viewers also liked (20)

PDF
BKK16-107 Budget Fair Queueing heuristics in the block layer
PDF
BKK16-102 Creating new workload for Workload Automation & using WA with LAVA
PDF
Q2.12: Idling ARMs in a busy world: Linux Power Management for ARM Multiclust...
PDF
Dulloor xen-summit
PDF
Reverse engineering for_beginners-en
PDF
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
PDF
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
PDF
BKK16-404A PCI Development Meeting
PDF
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
PDF
Virtualization overheads
PDF
Docker and friends at Linux Days 2014 in Prague
PDF
Linux numa evolution
PDF
Cgroup resource mgmt_v1
PPTX
Gc and-pagescan-attacks-by-linux
PDF
Known basic of NFV Features
PDF
Non-Uniform Memory Access ( NUMA)
PDF
LAS16-TR02: Upstreaming 101
PDF
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
PDF
Linux NUMA & Databases: Perils and Opportunities
PDF
P4, EPBF, and Linux TC Offload
BKK16-107 Budget Fair Queueing heuristics in the block layer
BKK16-102 Creating new workload for Workload Automation & using WA with LAVA
Q2.12: Idling ARMs in a busy world: Linux Power Management for ARM Multiclust...
Dulloor xen-summit
Reverse engineering for_beginners-en
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Specification-Based Test Program Generation for ARM VMSAv8-64 MMUs
BKK16-404A PCI Development Meeting
Tackling the Management Challenges of Server Consolidation on Multi-core Systems
Virtualization overheads
Docker and friends at Linux Days 2014 in Prague
Linux numa evolution
Cgroup resource mgmt_v1
Gc and-pagescan-attacks-by-linux
Known basic of NFV Features
Non-Uniform Memory Access ( NUMA)
LAS16-TR02: Upstreaming 101
SFO15-TR9: PSCI, ACPI (and UEFI to boot)
Linux NUMA & Databases: Perils and Opportunities
P4, EPBF, and Linux TC Offload
Ad

Similar to BKK16-104 sched-freq (20)

PDF
Debugging Ruby
PDF
Performance tweaks and tools for Linux (Joe Damato)
PPT
Galvin-operating System(Ch6)
PDF
Debugging Ruby Systems
PPTX
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
PPTX
PPT
CPU Scheduling
PPT
Cpu scheduling (1)
PDF
WALT vs PELT : Redux - SFO17-307
PDF
Deep review of LMS process
PDF
20150918 klug el performance tuning-v1.4
PDF
Getput suite
PDF
Docker tips-for-java-developers
PPTX
Using Libtracecmd to Analyze Your Latency and Performance Troubles
PDF
Guide to alfresco monitoring
PDF
YOW2020 Linux Systems Performance
PPTX
Joker 2015 - Валеев Тагир - Что же мы измеряем?
PPT
Dynamic Shift Frequency Scaling Of ATPG Patterns
PDF
Load testing with Blitz
PDF
cloud computing chapter one in computer science
Debugging Ruby
Performance tweaks and tools for Linux (Joe Damato)
Galvin-operating System(Ch6)
Debugging Ruby Systems
Puppet Availability and Performance at 100K Nodes - PuppetConf 2014
CPU Scheduling
Cpu scheduling (1)
WALT vs PELT : Redux - SFO17-307
Deep review of LMS process
20150918 klug el performance tuning-v1.4
Getput suite
Docker tips-for-java-developers
Using Libtracecmd to Analyze Your Latency and Performance Troubles
Guide to alfresco monitoring
YOW2020 Linux Systems Performance
Joker 2015 - Валеев Тагир - Что же мы измеряем?
Dynamic Shift Frequency Scaling Of ATPG Patterns
Load testing with Blitz
cloud computing chapter one in computer science

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Approach and Philosophy of On baking technology
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
KodekX | Application Modernization Development
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Machine learning based COVID-19 study performance prediction
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Approach and Philosophy of On baking technology
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
The AUB Centre for AI in Media Proposal.docx
Encapsulation_ Review paper, used for researhc scholars
KodekX | Application Modernization Development
NewMind AI Monthly Chronicles - July 2025
Unlocking AI with Model Context Protocol (MCP)
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Chapter 3 Spatial Domain Image Processing.pdf
Electronic commerce courselecture one. Pdf
Building Integrated photovoltaic BIPV_UPV.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Machine learning based COVID-19 study performance prediction
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

BKK16-104 sched-freq

  • 1. Presented by Date Event sched-freq integrating the scheduler and cpufreq Steve Muckle BKK16-104 March 7, 2016 Linaro Connect BKK16
  • 2. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 3. clarifications/quick questions here proposals and debate in hacking session Tuesday 8th March, 15:00-15:50 HACKING-2 (lobby lounge - 23rd floor)
  • 4. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 5. how does cpufreq currently work? - plugin architecture (governors) - popular governors are sampling-based - let's assume: - fmin = 100MHz, fmax = 1000Mhz - a policy which goes to util*fmax + 100Mhz
  • 6. sampling = 80% freq = 900 MHz sampling = 30% freq = 400 Mhz sampling = 20% freq = 300 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 70% freq = 800 Mhz sampling = 10% freq = 200 Mhz sampling = 10% freq = 200 Mhz cpu busy time cpufreq governor sampling
  • 7. sampling = 80% freq = 900 MHz sampling = 30% freq = 400 Mhz sampling = 20% freq = 300 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 70% freq = 800 Mhz sampling = 10% freq = 200 Mhz sampling = 10% freq = 200 Mhz cpu busy time oops cpufreq governor sampling
  • 8. sampling = 0% freq = 100 MHz cpu busy time sampling = 60% freq = 700 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz more problems: new tasks
  • 9. oops sampling = 0% freq = 100 MHz cpu busy time sampling = 60% freq = 700 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz more problems: new tasks
  • 10. sampling = 100% freq = 1000 MHz cpu busy time sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 0% freq = 100 MHz sampling = 100% freq = 1000 MHz sampling = 90% freq = 1000 MHz sampling = 0% freq = 100 MHz more problems: exiting tasks
  • 11. oops sampling = 100% freq = 1000 MHz cpu busy time sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 100% freq = 1000 MHz sampling = 0% freq = 100 MHz sampling = 100% freq = 1000 MHz sampling = 90% freq = 1000 MHz sampling = 0% freq = 100 MHz more problems: exiting tasks
  • 12. cpu busy time more problems: task migration
  • 13. oops cpu busy time oops more problems: task migration
  • 14. cpu busy time more problems: task migration
  • 15. oops cpu busy time oops more problems: task migration
  • 16. more problems: tuning - ondemand has 7 knobs - interactive has 11 knobs - ...and more get added by OEMs - along with hacks
  • 17. the line in the sand Ingo Molnar, May 31 2013 Note that I still disagree with the whole design notion of having an "idle back-end" (and a 'cpufreq back end') separate from scheduler power saving policy… This is a "line in the sand", a 'must have' design property for any scheduler power saving patches to be acceptable… https://lwn.net/Articles/552889/
  • 18. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 19. CFS RT DL schedfreq drivercpufreq CPU N schedfreq design - schedfreq is a cpufreq governor - can cpufreq be removed?
  • 20. estimating CFS capacity - per-entity load tracking (PELT) - introduced in 3.8 - exponential moving average - sum = is_running() + sum*y - frequency invariance required - partially merged (core bits in, ARM support not) - microarch invariance required - partially merged (core bits in, ARM support not)
  • 22. estimating CFS capacity - PELT - initial task load - was 0 when task is fork-balanced to different CPU due to bug - fix is on mainline/4.5 - http://thread.gmane.org/gmane.linux. kernel/2106780/ - blocked load is included in util_avg
  • 23. estimating DL capacity - runtime utilization tracking not strictly required - DL tasks have runtime, deadline, and period parameters - this describes the task’s bandwidth reservation - util = runtime/period - track DL bandwidth admitted into the system
  • 24. estimating DL capacity - util = runtime/period has drawbacks - it’s worst case - it’s always there - better solution - track active utilization - related to bandwidth reclaiming - both solutions under discussion
  • 25. estimating RT capacity - task priority but no constraints - monitor RT utilization - use rt_avg - no way to react to short latency constraints - focus on long term constraints and soft real time - may not be optimal but it already exists - be sure to budget capacity for RT - do not steal from other class’s cap requests
  • 26. aggregation of sched classes - sched class capacities are summed - headroom added to CFS and RT - (CFS + RT) * 1.25 - no headroom for DL tasks - DL tasks have precise capacity parameters - total capacity converted to frequency - scale using policy->max
  • 27. cpu0 cpu1 cpu2 cpu3 1.6 GHz 1.3 GHz 1.0 GHz 300 MHz aggregation of CPUs - CPU with max request in freq domain drives frequency
  • 28. setting the frequency - tricky to do from hot scheduler paths - locking - performance implications - varying CPU frequency driver behavior - does driver sleep? - is driver slow?
  • 29. setting the frequency - fast path - target freq can always be calculated - if… - driver isn’t slow or sleeps AND - schedfreq isn’t throttled AND - a freq transition isn’t underway AND - the slow path isn’t active then we can set freq in the fast path
  • 30. setting the frequency - slow path - kthread spawned by schedfreq - safe to sleep - safe to do more work
  • 31. setting the frequency - slow path - kthread spawned by schedfreq - safe to sleep - safe to do more work but
  • 32. setting the frequency - slow path - kthread spawned by schedfreq - safe to sleep - safe to do more work but - task wake overhead ($$$)
  • 33. locking - lots of cleanup going on in cpufreq - ongoing work from several people - no blocking locking issues seen ...for now
  • 34. locking - sched hooks hold rq lock - protect per-CPU data - avoid accessing policy->rwsem - freq_table - min/max - transition_ongoing - not required to initiate freq transitions
  • 35. locking - schedfreq has 3 internal locks - gov_enable_lock (mutex) - GOV_START/STOP for static key control - fastpath_lock (spinlock) - solve race to re-evaluate frequency domain - slowpath_lock (mutex) - solve race between slow path, fast path, GOV_START/STOP
  • 36. scheduler hooks - enqueue_task_fair, dequeue_task_fair* - set CFS capacity - CFS load balance paths - set CFS capacity at src, dest - scheduler_tick() - jump to fmax if headroom is impacted - pick_next_task_rt, task_tick_rt - set RT capacity
  • 37. scheduler hooks - todo - DL - migration paths in kernel/sched/core.c - changing task affinity - hotplug - balance on exec() - NUMA balancing
  • 38. policy summary - re-evaluate and set freq when tasks - wake - block (except when CPU goes idle) - migrate - event driven - too many events? - go to fmax at tick if headroom is impacted
  • 39. policy summary - when CPU goes idle, clear vote but don’t re- evaluate/set domain freq - don’t initiate more work when going idle - right thing to do?
  • 40. policy summary - PELT is very important, will need work - Patrick Bellasi’s util_est - buffer utilization value to yield more stable estimate - Vincent Guittot’s invariance improvements - same amount of work over same period at different freqs => different utilizations - tuning via schedtune
  • 41. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 42. rt-app cpufreq test case - simple periodic workload - each test case uses a different duty cycle - 16 different test cases/duty cycles - either 10, 100 or 1000 loops - varies from ~1% to ~43% busy
  • 43. rt-app cpufreq test case test case # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 busy (ms) 1 10 1 10 100 6 66 4 40 400 5 50 500 9 90 900 idle (ms) 100 1000 10 100 1000 33 333 10 100 1000 9 90 900 12 120 1200 loops 100 10 1000 100 10 300 30 1000 100 10 1000 100 10 1000 100 10 duration (s) 10.1 10.1 11 11 11 11.7 11.97 14 14 14 14 14 14 21 21 21 busy% 0.99% 0.99% 9.09% 9.09% 9.09% 15.38% 16.54% 28.57% 28.57% 28.57% 35.71% 35.71% 35.71% 42.86% 42.86% 42.86%
  • 44. rt-app cpufreq test case - for each loop, record - time to execute busy work - whether busy work overran period - for each test case, report - average time to complete busy work - number of overruns
  • 45. rt-app cpufreq test case - define overhead as ... (avg_time_test_gov - avg_time_perf_gov) / (avg_time_pwrsv_gov - avg_time_perf_gov) - 0% = completes as fast as perf gov - 100% = completes as fast as powersave gov
  • 46. Samsung Chromebook 2 - “Peach Pi” - Exynos 5800 - CPUs 0-3: 200 MHz - 2000 MHz A15 - CPUs 4-7: 200 MHz - 1300 MHz A7 - A15 fmax 1800 MHz with most recent clock support - no power numbers yet
  • 47. Samsung Chromebook 2 SCHED_OTHER (CFS) Perf ondemand interactive sched run (ms) idle (ms) loops OR OH OR OH OR OH 1 100 100 0 62.07% 0 100.02% 0 78.49% 10 1000 10 0 21.80% 0 22.74% 0 72.56% 1 10 1000 0 21.72% 0 63.08% 0 52.40% 10 100 100 0 8.09% 0 15.53% 0 17.33% 100 1000 10 0 1.83% 0 1.77% 0 0.29% 6 33 300 0 15.32% 0 8.60% 0 17.34% 66 333 30 0 0.79% 0 3.18% 0 12.26% 4 10 1000 0 5.87% 0 10.21% 0 6.15% 40 100 100 0 0.41% 0 0.04% 0 2.68% 400 1000 10 0 0.42% 0 0.50% 0 1.22% 5 9 1000 2 3.82% 1 6.10% 0 2.51% 50 90 100 0 0.19% 0 0.05% 0 1.71% 500 900 10 0 0.37% 0 0.38% 0 1.82% 9 12 1000 6 1.79% 1 0.77% 0 0.26% 90 120 100 0 0.16% 1 0.05% 0 0.49% 900 1200 10 0 0.09% 0 0.26% 0 0.62% Looks mostly good…
  • 48. Samsung Chromebook 2 SCHED_FIFO (RT) Perf ondemand interactive sched run (ms) idle (ms) loops OR OH OR OH OR OH 1 100 100 0 39.61% 0 100.49% 0 99.57% 10 1000 10 0 73.51% 0 21.09% 0 96.66% 1 10 1000 0 18.01% 0 61.46% 0 67.68% 10 100 100 0 31.31% 0 18.62% 0 77.01% 100 1000 10 0 58.80% 0 1.90% 0 15.40% 6 33 300 251 85.99% 0 9.20% 1 30.09% 66 333 30 24 84.03% 0 3.38% 0 33.23% 4 10 1000 0 6.23% 0 12.21% 10 11.54% 40 100 100 100 62.08% 0 0.11% 1 11.85% 400 1000 10 10 62.09% 0 0.51% 0 7.00% 5 9 1000 999 12.29% 1 6.03% 0 0.04% 50 90 100 99 61.47% 0 0.05% 2 6.53% 500 900 10 10 43.37% 0 0.39% 0 6.30% 9 12 1000 999 9.83% 0 0.01% 14 1.69% 90 120 100 99 61.47% 0 0.01% 28 2.29% 900 1200 10 10 43.31% 0 0.22% 0 2.15% RTavg mechanism not reacting fast enough. sched_time_avg_ms = 50
  • 49. MediaTek 8173 EVB - CPUs 0-1: 507 MHz - 1573 MHz A53 - CPUs 2-3: 507 MHz - 1989 MHz A72 - power measured via onboard TI INA219s - thanks to Freedom Tan for this data
  • 50. MediaTek 8173 EVB SCHED_OTHER (CFS) Perf ondemand interactive sched run (ms) idle (ms) loops OR OH OR OH OR OH 1 100 100 0 98.04% 0 100.41% 0 98.04% 10 1000 10 0 34.00% 0 67.68% 0 99.95% 1 10 1000 0 56.32% 0 101.11% 0 100.69% 10 100 100 0 18.31% 0 31.57% 0 100.02% 100 1000 10 0 2.77% 0 6.79% 0 7.72% 6 33 300 0 41.29% 0 100.27% 0 100.28% 66 333 30 0 1.27% 0 10.38% 0 39.31% 4 10 1000 0 21.45% 2 18.90% 6 65.66% 40 100 100 0 1.35% 0 8.16% 0 24.30% 400 1000 10 0 1.02% 0 1.74% 0 6.55% 5 9 1000 0 13.43% 2 14.14% 5 52.51% 50 90 100 0 1.31% 0 1.39% 1 12.62% 500 900 10 0 0.54% 0 1.32% 0 5.21% 9 12 1000 1 7.19% 1 8.59% 3 27.47% 90 120 100 0 0.88% 0 0.75% 1 3.80% 900 1200 10 0 0.16% 0 0.79% 0 2.83% Trouble with most workloads with run < 100ms.
  • 51. MediaTek 8173 EVB SCHED_OTHER (CFS) Power/Perf power delta perf (from last slide) run (ms) idle (ms) loops delta w/ondemand delta w/interactive sched overhead 1 100 100 -7.56% 0.01% 98.04% 10 1000 10 -0.41% 1.21% 99.95% 1 10 1000 12.97% 3.92% 100.69% 10 100 100 -3.19% -4.90% 100.02% 100 1000 10 -7.00% 1.37% 7.72% 6 33 300 -11.84% -9.95% 100.28% 66 333 30 -8.15% -2.09% 39.31% 4 10 1000 -0.93% -8.59% 65.66% 40 100 100 -3.18% -9.88% 24.30% 400 1000 10 -2.99% 0.07% 6.55% 5 9 1000 1.67% -9.33% 52.51% 50 90 100 -5.97% -10.89% 12.62% 500 900 10 -2.29% 1.73% 5.21% 9 12 1000 -5.90% -9.75% 27.47% 90 120 100 -6.88% -5.12% 3.80% 900 1200 10 5.23% 6.57% 2.83% avg delta w/interactive: -3.48% avg delta w/ondemand: -2.9% Power savings seen, but not meaningful with observed perf losses.
  • 52. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 53. an upstream surprise - scheduler - cpufreq hooks posted by Rafael Wysocki - Jan 29th 2016 - now in linux-next - sched utilization driven gov also posted by Rafael - Feb 21st 2016
  • 54. important differences - ondemand-like freq algorithm - possible to get stuck due to freq invariance - weird semantics w.r.t. headroom - no aggregation of sched class capacities - currently goes to fmax for RT, DL - uses workqueue rather than kthread
  • 55. what’s it mean? - more engagement from upstream - what’s the value of schedfreq?
  • 56. words of encouragement “What I'd like to see from a scheduler metrics usage POV is a single central place, kernel/sched/cpufreq.c, where all the high level ('governor') decisions are made. This is the approach Steve's series takes.” Ingo Molnar, 03/03/2016 Note: Mike Turquette and Juri Lelli conceived and authored much of the schedfreq series.
  • 57. outline - how things (don’t) work today - schedfreq design - latest test results - an upstream surprise - next steps
  • 58. next steps - address shortcomings in schedutil - freq algorithm - scheduler hooks - better RT, DL response - more in-depth testing and analysis - experiments with real-world usecases - Android UI, games, benchmarks, etc. - merging with EAS - integration and testing with schedtune - window-based load tracking
  • 61. ondemand - sample every sampling_rate usec - cpu usage = busy% of last sample_rate usec - if busy% > up_threshold, go to fmax - otherwise scale with load - freq_next = fmin + busy% * (fmax - fmin) - stay at fmax longer with sampling_down_factor
  • 62. interactive - sample every timer_rate usec - cpu usage = busy% of last timer_rate usec - if busy% > go_hispeed_load go to hispeed_freq - otherwise scale CPU according to target_loads - 85 1000000:90 1700000:95 - XXX put a note in here to explain this - can prevent slowdown with min_sample_time - delay speedups past hispeed_freq with above_hispeed_delay