SlideShare a Scribd company logo
1
Research Update on
big.LITTLE MP Scheduling
Morten Rasmussen
Technology Researcher
2
Why is big.LITTLE different from SMP?
 SMP:
 Scheduling goal is to distribute work evenly across all available CPUs
to get maximum performance.
 If we have DVFS support we can even save power this way too.
 big.LITTLE:
 Scheduling goal is to maximize power efficiency with only a modest
performance sacrifice.
 Task should be distributed unevenly. Only critical tasks should
execute on big CPUs to minimize power consumption.
 Contrary to SMP, it matters where a task is scheduled.
3
 Example: Android UI render thread execution time.
What is the (mainline) status?
4 core SMP
2+2 big.LITTLE (emulated)
It matters where a task is scheduled.
4
 Example: Android UI render thread execution time.
What is the (mainline) status?
4 core SMP
2+2 big.LITTLE (emulated)
It matters where a task is scheduled.
big.LITTLE aware scheduling
5
Mainline Linux Scheduler
 Linux has two schedulers to handle the scheduling policies:
 RT: Real-time scheduler for very high priority tasks.
 CFS: Completely Fair Scheduler for anything else and is used for
almost all tasks.
 We need proper big.LITTLE/heterogeneous platform support
in CFS.
 Load-balancing is currently based on an expression of CPU load
which is basically:
 The scheduler does not know how much CPU time is consumed by
each task.
 The current scheduler can handle distributing task fairly evenly based
on cpu_power for big.LITTLE system, but this is not what we want for
power efficiency.
cpuload=cpupower⋅∑
task
priotask
6
Tracking task load
 The load contribution of a particular task is needed to make
an appropriate scheduling decision.
 We have experimented internally with identifying task
characteristics based on the tasks’ time slice utilization.
 Recently, Paul Turner (Google) posted a RFC patch set on
LKML with similar features.
 LKML: https://lkml.org/lkml/2012/2/1/763
7
Entity load-tracking summary
 Patch set for improving fair group scheduling, but adds some
essential bits that are very useful for big.LITTLE.
 Tracks the time each task spends on the runqueue (executing or
waiting) approximately every ms. Note that: trunqueue ≥ texecuting
 The contributed load is a geometric series over the history of time
spent on the runqueue scaled by the task priority.
Task load
Task state
Executing Sleep
Load decay
8
big.LITTLE scheduling: First stab
 Policy: Keep all tasks on little cores unless:
1. The task load (runqueue residency) is above a fixed threshold, and
2. The task priority is default or higher (nice ≤ 0)
 Goal: Only use big cores when it is necessary.
 Frequent, but low intensity tasks are assumed to suffer minimally by
being stuck on a little core.
 High intensity low priority tasks will not be scheduled on big cores to
finish earlier when it is not necessary.
 Tasks can migrate to match current requirements. Migrate to big
Migrate to LITTLE
Task 1 state
Task 2 state
Task loads
9
Experimental Implementation
 Scheduler modifications:
 Apply PJTs’ load-tracking patch set.
 Set up big and little sched_domains with
no load-balancing between them.
 select_task_rq_fair() checks task load
history to select appropriate target CPU
for tasks waking up.
 Add forced migration mechanism to push
of the currently running task to big core
similar to the existing active load
balancing mechanism.
 Periodically check
(run_rebalance_domains()) current task on
little runqueues for tasks that need to be
forced to migrate to a big core.
 Note: There are known issues related to
global load-balancing.
LL LL
BB BB
load_balance load_balance
select_task_rq_fair()/
forced migration
Forced migration latency:
~160 us on vexpress-a9
(migration->schedule)
10
Evaluation Platforms
 ARM Cortex-A9x4 on Versatile Express platform (SMP)
 4x ARM Cortex-A9 @ 400 MHz, no GPU, no DVFS, no idle.
 Base kernel: Linaro vexpress-a9 Android kernel
 File system: Android 2.3
 LinSched for Linux 3.3-rc7
 Scheduler wrapper/simulator
 https://lkml.org/lkml/2012/3/14/590
 Scheduler ftrace output extension.
 Extended to support simple modelling of performance heterogeneous
systems.
11
Bbench on Android
 Browser benchmark
 Renders a new webpage every ~50s using JavaScript.
 Scrolls each page after a fixed delay.
 Two main threads involved:
 WebViewCoreThread: Webkit rendering thread.
 SurfaceFlinger: Android UI rendering thread.
12
vexpress: Vanilla Scheduler
BB BB BB BB
Time spent in idle.
Roughly equivalent to idle states.
load_balance
Note: big and little CPU’s
have equal performance.
Setup:
~wakeups
13
vexpress: big.LITTLE optimizations
Idle switching
minimized
Deep sleep
most of time
Key tasks mainly
on big cores
BB BB
load_balance
BB BB
load_balance
select_task_rq_fair()/
forced migration
Setup:
Note: big and little CPU’s
have equal performance.
~wakeups
14
big.LITTLE emulation
 Goal: Slow down selected cores on Versatile Express SMP
platform to emulate big.LITTLE performance heterogeneity.
 How: Abusing perf
 Tool for sampling performance counters.
 Setup to sample every 10000 instructions on the little core.
 The sampling overhead reduces the perceived performance.
 Details:
perf record -a -e instructions -C 1,3 -c 10000
-o /dev/null sleep 7200
 Determined by experiments a sampling rate of 10000 slows the cores
down by around 50%.
 Very short tasks might not get hit by a perf sample, thus they might
not experience the performance reduction.
15
vexpress+b.L-emu: Vanilla kernel
High little
residency
Note: Task affinity is more or less
random.
This is just one example run.
BB BB
load_balance
Setup:
LL LL
~wakeups
16
vexpress+b.L-emu: b.L optimizations
Shorter
execution
time.
Key tasks
have higher
big
residency.
Frequent short task has
higher little residency.
load_balance
BB BB
load_balance
select_task_rq_fair()/
forced migration
Setup:
LL LL
~wakeups
17
vexpress+b.L-emu: SurfaceFlinger
 Android UI render task
 Total execution time for 20
runs:
 SMP: 4xA9 no slow-down
(upper bound for performance).
 b.L: 2xA9 with perf slow-down
+ 2xA9 without.
 Execution time varies
significantly on b.L vanilla.
 Task affinity is more or less
random.
 The b.L optimizations solves
this issue.
[s] SMP b.L van. b.L opt.
AVG 10.10 12.68 10.27
MIN 9.78 10.27 9.48
MAX 10.54 16.30 10.92
STDEV 0.12 1.24 0.23
18
vexpress+b.L-emu: Page render time
 Web page render times
 WebViewCore start ->
SurfaceFlinger done
 Render #2: Page scroll
 Render #6: Load new page
 b.L optimizations reduce
render time variations.
 Note: No GPU and low CPU
frequency (400 MHz).
[s] SMP b.L van. b.L opt.
#2
AVG 1.45 1.58 1.45
STDEV 0.01 0.11 0.01
#6
AVG 2.58 2.88 2.62
STDEV 0.05 0.24 0.06
19
LinSched Test Case
 Synthetic workload inspired by Bbench processes on Android
 Setup: 2 big + 2 LITTLE
 big CPUs are 2x faster than LITTLE in this model.
 Task definitions:
Task nice busy* sleep* Description
1+2 0 3 40 Background noise, too short for big
3 0 200 100 CPU intensive, big candidate
4 0 200 120 CPU intensive, big candidate
5 10 200 400 Low priority, CPU intensive
6 10 100 300 Low priority, CPU intensive
7 10 100 250 Low priority, CPU intensive
* [ms]
20
LinSched: Vanilla Linux Scheduler
Processes:
1-2: Background noise tasks
3-4: Important tasks
5-7: Low priority tasks
Frequent wakeups on big
Important tasks
~wakeups
21
LinSched: big.LITTLE optimized sched.
Important tasks
completed faster
on big
Processes:
1-2: Background noise tasks
3-4: Important tasks
5-7: Low priority tasks
Idle switching
minimized
~wakeups
22
Next: Improve big.LITTLE support
 big.LITTLE sched_domain balancing
 Use all cores, including LITTLE, for heavy multi-threaded workloads.
 Fixes the sysbench CPU benchmark use case.
 Requires appropriate CPU_POWER to be set for each domain.
LL
BB T0T0 T1T1 T2T2 T3T3
Active tasks:
idleidle
Load:
0%
100%
23
Next: Improve big.LITTLE support
 Per sched domain scheduling policies
 Support for different load-balancing policies for big and LITTLE
domains. For example:
 LITTLE: Spread tasks to minimize frequency.
 Big: Consolidate tasks to as few cores as possible.
LL
BB T2T2 T3T3 T4T4 T5T5
Active tasks:
idleidle
Load:
50%
100%
BB
T0T0
0%
LL50%
T1T1
24
Next: Improve big.LITTLE support
 CPUfreq -> scheduler feedback
 Let the scheduler know about current OPP and max. OPP for each
core to improve load-balancer power awareness.
 This could improve SMP as well.
 Ongoing discussions on LKML about topology/scheduler interface:
 http://lkml.indiana.edu/hypermail/linux/kernel/1205.1/02641.html
 Linaro Connect session: What inputs could the scheduler use?
LL
BB
T1T1 T2T2
Active tasks:
idleidle
Load:
100%
0%
Freq:
50%
50%
Increase LITTLE freq instead
25
Questions/Discussion
26
Backup slides
27
Forced Migration Latency
 Measured on vexpress-a9
 Latency from migration ->
schedule on target
 ~160 us (immediate schedule)
 Much longer if target is
already busy (~10 ms)
Scheduled immediately
Scheduled later
28
sched_domain configurations
[ 0.364272] CPU0 attaching sched-domain:
[ 0.364306] domain 0: span 0,2 level MC
[ 0.364336] groups: 0 2
[ 0.364380] domain 1: does not load-balance
[ 0.364474] CPU1 attaching sched-domain:
[ 0.364500] domain 0: span 1,3 level MC
[ 0.364526] groups: 1 3
[ 0.364567] domain 1: does not load-balance
[ 0.364611] CPU2 attaching sched-domain:
[ 0.364633] domain 0: span 0,2 level MC
[ 0.364658] groups: 2 0
[ 0.364700] domain 1: does not load-balance
[ 0.364742] CPU3 attaching sched-domain:
[ 0.364764] domain 0: span 1,3 level MC
[ 0.364788] groups: 3 1
[ 0.364829] domain 1: does not load-balance
big.LITTLE optimizationsVanilla
[ 0.372939] CPU0 attaching sched-domain:
[ 0.373014] domain 0: span 0-3 level MC
[ 0.373044] groups: 0 1 2 3
[ 0.373172] CPU1 attaching sched-domain:
[ 0.373196] domain 0: span 0-3 level MC
[ 0.373222] groups: 1 2 3 0
[ 0.373293] CPU2 attaching sched-domain:
[ 0.373313] domain 0: span 0-3 level MC
[ 0.373337] groups: 2 3 0 1
[ 0.373404] CPU3 attaching sched-domain:
[ 0.373423] domain 0: span 0-3 level MC
[ 0.373446] groups: 3 0 1 2

More Related Content

PDF
Lightweight DNN Processor Design (based on NVDLA)
PDF
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
PDF
Cat @ scale
PDF
Memory Bandwidth QoS
PDF
Task migration using CRIU
PDF
gcma: guaranteed contiguous memory allocator
PDF
Supporting Time-Sensitive Applications on a Commodity OS
Lightweight DNN Processor Design (based on NVDLA)
Simulation of BRKSS Architecture for Data Warehouse Employing Shared Nothing ...
Cat @ scale
Memory Bandwidth QoS
Task migration using CRIU
gcma: guaranteed contiguous memory allocator
Supporting Time-Sensitive Applications on a Commodity OS

What's hot (20)

PDF
Kernel Recipes 2019 - XDP closer integration with network stack
PDF
Trip down the GPU lane with Machine Learning
PDF
ACM Applicative System Methodology 2016
PDF
Q2.12: Scheduler Inputs
PDF
Understanding of linux kernel memory model
PPTX
Broken Linux Performance Tools 2016
PDF
Real Time Support For Xen
PDF
SFO15-302: Energy Aware Scheduling: Progress Update
PDF
Alibaba cloud benchmarking report ecs rds limton xavier
PDF
GCMA: Guaranteed Contiguous Memory Allocator
PPTX
Kafka replication apachecon_2013
PDF
Building zero data loss pipelines with apache kafka
PDF
Linux Kernel Memory Model
PDF
Lisa12 methodologies
PDF
Efficient execution of quantized deep learning models a compiler approach
PDF
CPU scheduling ppt file
PDF
Mastering Real-time Linux
PPTX
Rate limiters in big data systems
PDF
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
PDF
An Introduction to the Formalised Memory Model for Linux Kernel
Kernel Recipes 2019 - XDP closer integration with network stack
Trip down the GPU lane with Machine Learning
ACM Applicative System Methodology 2016
Q2.12: Scheduler Inputs
Understanding of linux kernel memory model
Broken Linux Performance Tools 2016
Real Time Support For Xen
SFO15-302: Energy Aware Scheduling: Progress Update
Alibaba cloud benchmarking report ecs rds limton xavier
GCMA: Guaranteed Contiguous Memory Allocator
Kafka replication apachecon_2013
Building zero data loss pipelines with apache kafka
Linux Kernel Memory Model
Lisa12 methodologies
Efficient execution of quantized deep learning models a compiler approach
CPU scheduling ppt file
Mastering Real-time Linux
Rate limiters in big data systems
HKG15-305: Real Time processing comparing the RT patch vs Core isolation
An Introduction to the Formalised Memory Model for Linux Kernel
Ad

Viewers also liked (20)

PDF
Kernel Features for Reducing Power Consumption on Embedded Devices
PDF
ENERGY EFFICIENCY OF ARM ARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS
PDF
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
PDF
Achieving Performance Isolation with Lightweight Co-Kernels
PDF
Denser, cooler, faster, stronger: PHP on ARM microservers
PPT
Cache profiling on ARM Linux
PPT
ODP
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
PDF
Preemptable ticket spinlocks: improving consolidated performance in the cloud
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
PDF
SDN - OpenFlow + OpenVSwitch + Quantum
PPT
Effect of Virtualization on OS Interference
PDF
DOXLON November 2016: Facebook Engineering on cgroupv2
PDF
reference_guide_Kernel_Crash_Dump_Analysis
PDF
Linux Device Driver parallelism using SMP and Kernel Pre-emption
PDF
Memory Barriers in the Linux Kernel
PDF
How Ceph performs on ARM Microserver Cluster
PDF
Linux cgroups and namespaces
PDF
SFO15-407: Performance Overhead of ARM Virtualization
PPTX
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
Kernel Features for Reducing Power Consumption on Embedded Devices
ENERGY EFFICIENCY OF ARM ARCHITECTURES FOR CLOUD COMPUTING APPLICATIONS
Shoot4U: Using VMM Assists to Optimize TLB Operations on Preempted vCPUs
Achieving Performance Isolation with Lightweight Co-Kernels
Denser, cooler, faster, stronger: PHP on ARM microservers
Cache profiling on ARM Linux
LinuxCon2009: 10Gbit/s Bi-Directional Routing on standard hardware running Linux
Preemptable ticket spinlocks: improving consolidated performance in the cloud
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
SDN - OpenFlow + OpenVSwitch + Quantum
Effect of Virtualization on OS Interference
DOXLON November 2016: Facebook Engineering on cgroupv2
reference_guide_Kernel_Crash_Dump_Analysis
Linux Device Driver parallelism using SMP and Kernel Pre-emption
Memory Barriers in the Linux Kernel
How Ceph performs on ARM Microserver Cluster
Linux cgroups and namespaces
SFO15-407: Performance Overhead of ARM Virtualization
Smarter Scheduling (Priorities, Preemptive Priority Scheduling, Lottery and S...
Ad

Similar to Q2.12: Research Update on big.LITTLE MP Scheduling (20)

PDF
Deadline Miss Detection with SCHED_DEADLINE
PDF
BKK16-208 EAS
PDF
Playing BBR with a userspace network stack
PPTX
CPN302 your-linux-ami-optimization-and-performance
PPT
Parallel_and_Cluster_Computing.ppt
PDF
LCU13: Power-efficient scheduling, and the latest news from the kernel summit
PDF
Lab6 rtos
PDF
RTOS implementation
PDF
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
PDF
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
PDF
Oracle R12 EBS Performance Tuning
PDF
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
PDF
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
PDF
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
PDF
Linux Kernel vs DPDK: HTTP Performance Showdown
PPTX
RTDroid_Presentation
PDF
load-balancing-method-for-embedded-rt-system-20120711-0940
PDF
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
PDF
Parallel Batch Performance Considerations
Deadline Miss Detection with SCHED_DEADLINE
BKK16-208 EAS
Playing BBR with a userspace network stack
CPN302 your-linux-ami-optimization-and-performance
Parallel_and_Cluster_Computing.ppt
LCU13: Power-efficient scheduling, and the latest news from the kernel summit
Lab6 rtos
RTOS implementation
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
CloudOpen 2013: Developing cloud infrastructure: from scratch: the tale of an...
Oracle R12 EBS Performance Tuning
XPDS14 - RT-Xen: Real-Time Virtualization in Xen - Sisu Xi, Washington Univer...
Optimizing Servers for High-Throughput and Low-Latency at Dropbox
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
Linux Kernel vs DPDK: HTTP Performance Showdown
RTDroid_Presentation
load-balancing-method-for-embedded-rt-system-20120711-0940
Big Lab Problems Solved with Spectrum Scale: Innovations for the Coral Program
Parallel Batch Performance Considerations

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
KodekX | Application Modernization Development
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
sap open course for s4hana steps from ECC to s4
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
KodekX | Application Modernization Development
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Electronic commerce courselecture one. Pdf
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
MIND Revenue Release Quarter 2 2025 Press Release
NewMind AI Weekly Chronicles - August'25 Week I
Programs and apps: productivity, graphics, security and other tools
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
20250228 LYD VKU AI Blended-Learning.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
sap open course for s4hana steps from ECC to s4

Q2.12: Research Update on big.LITTLE MP Scheduling

  • 1. 1 Research Update on big.LITTLE MP Scheduling Morten Rasmussen Technology Researcher
  • 2. 2 Why is big.LITTLE different from SMP?  SMP:  Scheduling goal is to distribute work evenly across all available CPUs to get maximum performance.  If we have DVFS support we can even save power this way too.  big.LITTLE:  Scheduling goal is to maximize power efficiency with only a modest performance sacrifice.  Task should be distributed unevenly. Only critical tasks should execute on big CPUs to minimize power consumption.  Contrary to SMP, it matters where a task is scheduled.
  • 3. 3  Example: Android UI render thread execution time. What is the (mainline) status? 4 core SMP 2+2 big.LITTLE (emulated) It matters where a task is scheduled.
  • 4. 4  Example: Android UI render thread execution time. What is the (mainline) status? 4 core SMP 2+2 big.LITTLE (emulated) It matters where a task is scheduled. big.LITTLE aware scheduling
  • 5. 5 Mainline Linux Scheduler  Linux has two schedulers to handle the scheduling policies:  RT: Real-time scheduler for very high priority tasks.  CFS: Completely Fair Scheduler for anything else and is used for almost all tasks.  We need proper big.LITTLE/heterogeneous platform support in CFS.  Load-balancing is currently based on an expression of CPU load which is basically:  The scheduler does not know how much CPU time is consumed by each task.  The current scheduler can handle distributing task fairly evenly based on cpu_power for big.LITTLE system, but this is not what we want for power efficiency. cpuload=cpupower⋅∑ task priotask
  • 6. 6 Tracking task load  The load contribution of a particular task is needed to make an appropriate scheduling decision.  We have experimented internally with identifying task characteristics based on the tasks’ time slice utilization.  Recently, Paul Turner (Google) posted a RFC patch set on LKML with similar features.  LKML: https://lkml.org/lkml/2012/2/1/763
  • 7. 7 Entity load-tracking summary  Patch set for improving fair group scheduling, but adds some essential bits that are very useful for big.LITTLE.  Tracks the time each task spends on the runqueue (executing or waiting) approximately every ms. Note that: trunqueue ≥ texecuting  The contributed load is a geometric series over the history of time spent on the runqueue scaled by the task priority. Task load Task state Executing Sleep Load decay
  • 8. 8 big.LITTLE scheduling: First stab  Policy: Keep all tasks on little cores unless: 1. The task load (runqueue residency) is above a fixed threshold, and 2. The task priority is default or higher (nice ≤ 0)  Goal: Only use big cores when it is necessary.  Frequent, but low intensity tasks are assumed to suffer minimally by being stuck on a little core.  High intensity low priority tasks will not be scheduled on big cores to finish earlier when it is not necessary.  Tasks can migrate to match current requirements. Migrate to big Migrate to LITTLE Task 1 state Task 2 state Task loads
  • 9. 9 Experimental Implementation  Scheduler modifications:  Apply PJTs’ load-tracking patch set.  Set up big and little sched_domains with no load-balancing between them.  select_task_rq_fair() checks task load history to select appropriate target CPU for tasks waking up.  Add forced migration mechanism to push of the currently running task to big core similar to the existing active load balancing mechanism.  Periodically check (run_rebalance_domains()) current task on little runqueues for tasks that need to be forced to migrate to a big core.  Note: There are known issues related to global load-balancing. LL LL BB BB load_balance load_balance select_task_rq_fair()/ forced migration Forced migration latency: ~160 us on vexpress-a9 (migration->schedule)
  • 10. 10 Evaluation Platforms  ARM Cortex-A9x4 on Versatile Express platform (SMP)  4x ARM Cortex-A9 @ 400 MHz, no GPU, no DVFS, no idle.  Base kernel: Linaro vexpress-a9 Android kernel  File system: Android 2.3  LinSched for Linux 3.3-rc7  Scheduler wrapper/simulator  https://lkml.org/lkml/2012/3/14/590  Scheduler ftrace output extension.  Extended to support simple modelling of performance heterogeneous systems.
  • 11. 11 Bbench on Android  Browser benchmark  Renders a new webpage every ~50s using JavaScript.  Scrolls each page after a fixed delay.  Two main threads involved:  WebViewCoreThread: Webkit rendering thread.  SurfaceFlinger: Android UI rendering thread.
  • 12. 12 vexpress: Vanilla Scheduler BB BB BB BB Time spent in idle. Roughly equivalent to idle states. load_balance Note: big and little CPU’s have equal performance. Setup: ~wakeups
  • 13. 13 vexpress: big.LITTLE optimizations Idle switching minimized Deep sleep most of time Key tasks mainly on big cores BB BB load_balance BB BB load_balance select_task_rq_fair()/ forced migration Setup: Note: big and little CPU’s have equal performance. ~wakeups
  • 14. 14 big.LITTLE emulation  Goal: Slow down selected cores on Versatile Express SMP platform to emulate big.LITTLE performance heterogeneity.  How: Abusing perf  Tool for sampling performance counters.  Setup to sample every 10000 instructions on the little core.  The sampling overhead reduces the perceived performance.  Details: perf record -a -e instructions -C 1,3 -c 10000 -o /dev/null sleep 7200  Determined by experiments a sampling rate of 10000 slows the cores down by around 50%.  Very short tasks might not get hit by a perf sample, thus they might not experience the performance reduction.
  • 15. 15 vexpress+b.L-emu: Vanilla kernel High little residency Note: Task affinity is more or less random. This is just one example run. BB BB load_balance Setup: LL LL ~wakeups
  • 16. 16 vexpress+b.L-emu: b.L optimizations Shorter execution time. Key tasks have higher big residency. Frequent short task has higher little residency. load_balance BB BB load_balance select_task_rq_fair()/ forced migration Setup: LL LL ~wakeups
  • 17. 17 vexpress+b.L-emu: SurfaceFlinger  Android UI render task  Total execution time for 20 runs:  SMP: 4xA9 no slow-down (upper bound for performance).  b.L: 2xA9 with perf slow-down + 2xA9 without.  Execution time varies significantly on b.L vanilla.  Task affinity is more or less random.  The b.L optimizations solves this issue. [s] SMP b.L van. b.L opt. AVG 10.10 12.68 10.27 MIN 9.78 10.27 9.48 MAX 10.54 16.30 10.92 STDEV 0.12 1.24 0.23
  • 18. 18 vexpress+b.L-emu: Page render time  Web page render times  WebViewCore start -> SurfaceFlinger done  Render #2: Page scroll  Render #6: Load new page  b.L optimizations reduce render time variations.  Note: No GPU and low CPU frequency (400 MHz). [s] SMP b.L van. b.L opt. #2 AVG 1.45 1.58 1.45 STDEV 0.01 0.11 0.01 #6 AVG 2.58 2.88 2.62 STDEV 0.05 0.24 0.06
  • 19. 19 LinSched Test Case  Synthetic workload inspired by Bbench processes on Android  Setup: 2 big + 2 LITTLE  big CPUs are 2x faster than LITTLE in this model.  Task definitions: Task nice busy* sleep* Description 1+2 0 3 40 Background noise, too short for big 3 0 200 100 CPU intensive, big candidate 4 0 200 120 CPU intensive, big candidate 5 10 200 400 Low priority, CPU intensive 6 10 100 300 Low priority, CPU intensive 7 10 100 250 Low priority, CPU intensive * [ms]
  • 20. 20 LinSched: Vanilla Linux Scheduler Processes: 1-2: Background noise tasks 3-4: Important tasks 5-7: Low priority tasks Frequent wakeups on big Important tasks ~wakeups
  • 21. 21 LinSched: big.LITTLE optimized sched. Important tasks completed faster on big Processes: 1-2: Background noise tasks 3-4: Important tasks 5-7: Low priority tasks Idle switching minimized ~wakeups
  • 22. 22 Next: Improve big.LITTLE support  big.LITTLE sched_domain balancing  Use all cores, including LITTLE, for heavy multi-threaded workloads.  Fixes the sysbench CPU benchmark use case.  Requires appropriate CPU_POWER to be set for each domain. LL BB T0T0 T1T1 T2T2 T3T3 Active tasks: idleidle Load: 0% 100%
  • 23. 23 Next: Improve big.LITTLE support  Per sched domain scheduling policies  Support for different load-balancing policies for big and LITTLE domains. For example:  LITTLE: Spread tasks to minimize frequency.  Big: Consolidate tasks to as few cores as possible. LL BB T2T2 T3T3 T4T4 T5T5 Active tasks: idleidle Load: 50% 100% BB T0T0 0% LL50% T1T1
  • 24. 24 Next: Improve big.LITTLE support  CPUfreq -> scheduler feedback  Let the scheduler know about current OPP and max. OPP for each core to improve load-balancer power awareness.  This could improve SMP as well.  Ongoing discussions on LKML about topology/scheduler interface:  http://lkml.indiana.edu/hypermail/linux/kernel/1205.1/02641.html  Linaro Connect session: What inputs could the scheduler use? LL BB T1T1 T2T2 Active tasks: idleidle Load: 100% 0% Freq: 50% 50% Increase LITTLE freq instead
  • 27. 27 Forced Migration Latency  Measured on vexpress-a9  Latency from migration -> schedule on target  ~160 us (immediate schedule)  Much longer if target is already busy (~10 ms) Scheduled immediately Scheduled later
  • 28. 28 sched_domain configurations [ 0.364272] CPU0 attaching sched-domain: [ 0.364306] domain 0: span 0,2 level MC [ 0.364336] groups: 0 2 [ 0.364380] domain 1: does not load-balance [ 0.364474] CPU1 attaching sched-domain: [ 0.364500] domain 0: span 1,3 level MC [ 0.364526] groups: 1 3 [ 0.364567] domain 1: does not load-balance [ 0.364611] CPU2 attaching sched-domain: [ 0.364633] domain 0: span 0,2 level MC [ 0.364658] groups: 2 0 [ 0.364700] domain 1: does not load-balance [ 0.364742] CPU3 attaching sched-domain: [ 0.364764] domain 0: span 1,3 level MC [ 0.364788] groups: 3 1 [ 0.364829] domain 1: does not load-balance big.LITTLE optimizationsVanilla [ 0.372939] CPU0 attaching sched-domain: [ 0.373014] domain 0: span 0-3 level MC [ 0.373044] groups: 0 1 2 3 [ 0.373172] CPU1 attaching sched-domain: [ 0.373196] domain 0: span 0-3 level MC [ 0.373222] groups: 1 2 3 0 [ 0.373293] CPU2 attaching sched-domain: [ 0.373313] domain 0: span 0-3 level MC [ 0.373337] groups: 2 3 0 1 [ 0.373404] CPU3 attaching sched-domain: [ 0.373423] domain 0: span 0-3 level MC [ 0.373446] groups: 3 0 1 2