SlideShare a Scribd company logo
Using tracing to tune and optimize EAS
Leo Yan & Daniel Thompson
Linaro Support and Solutions Engineering
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Analyze for task ping-pong issue
○ Analyze for small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Typical workflow for optimizing GTS
This simple workflow is easy to understand
but has problems in practice.
Tunables are complex and interact with each
other (making it hard to decide which
tuneable to adjust).
Tuning for multiple use-cases is difficult.
Tuning is SoC specific, optimizations will not
necessarily apply to other SoCs.
Adjust
tuneables
Benchmark
ENGINEERS AND DEVICES
WORKING TOGETHER
GTS tunables
GTS
up_threshold
down_threshold
packing_enable
load_avg_period_ms
frequency_invariant_load_scale
CPUFreq
Interactive
Governor
hispeed_freq
go_hispeed_load
target_loads
Timer_rate
min_sample_time
above_hispeed_delay
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Typical workflow for optimizing EAS systems
Trace a
use-case
Benchmark
Examine
traces
Improve
decisions
Workflow is knowledge intensive.
Decisions can be improved by
improving the power model or by
finding new opportunities in the
scheduler (a.k.a. debugging).
Optimizations are more portable.
● Can be shared for review
● Likely to benefit your new SoC
ENGINEERS AND DEVICES
WORKING TOGETHER
Trace points for EAS
Has a set of stock trace
points in kernel for diving
into debugging
Trace points are added by
patches marked “DEBUG”
Not posted to LKML,
currently only found
in product focused
patchsets
Enable kernel config:
CONFIG_FTRACE
ENGINEERS AND DEVICES
WORKING TOGETHER
Trace points for EAS - cont.
sched_contrib_scale_f
sched_load_avg_task
sched_load_avg_cpu
PELT signals
EAS core SchedTune
sched_switch
sched_migrate_task
sched_wakeup
sched_wakeup_new
Scheduler
default events
SchedFreq
LISA can be easily
extended to support
these trace points
cpufreq_sched_throttled
cpufreq_sched_request_opp
cpufreq_sched_update_capacity
sched_energy_diff
sched_overutilized
sched_tune_config
sched_boost_cpu
sched_tune_tasks_update
sched_tune_boostgroup_update
sched_boost_task
sched_tune_filter
Tracepoints in mainline kernel
Tracepoints for EAS extension
E.g. enable trace points:
trace-cmd start -e sched_energy_diff -e sched_wakeup
ENGINEERS AND DEVICES
WORKING TOGETHER
Summary
Features EAS GTS
Make decision strategy Power modeling Heuristics thresholds
Frequency selection Sched-freq or sched-util,
integrated with scheduler
Governor’s cascaded
parameters
Scenario based tuning schedTune (CGroup) None
Energy awared scheduling (EAS) has very few tunables and thus requires a
significantly different approach to tuning and optimization when compared to
global task scheduling (GTS).
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
LISA - interactive analysis and testing
● “Distro” of python libraries for interactive analysis and automatic testing
● Libraries support includes
○ Target control and manipulation (set cpufreq mode, run this workload, initiate trace)
○ Gather power measurement data and calculate energy
○ Analyze and graph trace results
○ Test assertions about the trace results (e.g. big CPU does not run more than 20ms)
● Interactive analysis using ipython and jupyter
○ Provides a notebook framework similar to Maple, Mathematica or Sage
○ Notebooks mix together documentation with executable code fragments
○ Notebooks record the output of an interactive session
○ All permanent file storage is on the host
○ Trace files and graphs can be reexamined in the future without starting the target
● Automatic testing
○ Notebooks containing assertion based tests that can be converted to normal python
ENGINEERS AND DEVICES
WORKING TOGETHER
General workflow for LISA
http://events.linuxfoundation.org/sites/events/files/slides/ELC16_LISA_20160326.pdf
ENGINEERS AND DEVICES
WORKING TOGETHER
LISA interactive test mode
http://127.0.0.1:8888 with
ipython file
Menu & control buttons
Markdown (headers)
Execute box with python
programming
Result box, the results of
experiments are
recorded when next
time reopen this file
ENGINEERS AND DEVICES
WORKING TOGETHER
kernelshark
Task scheduling
Filters for events, tasks, and CPUs
Details for events
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Development platform for the worked examples
● All examples use artificial workloads to provoke a specific behaviour
○ It turned out to be quite difficult to deliberately provoke undesired behavior!
● Examples are reproducible on 96Boards HiKey
○ Octo-A53 multi-cluster (2x4) SMP device with five OPPs per cluster
■ Not big.LITTLE, and not using a fast/slow silicon process
○ We are able to fake a fast/slow system by using asymmetric power
modeling parameters and artificially reducing the running/runnable delta
time for “fast” CPU so the metrics indicate that is has a higher performance
● Most plots shown in these slides are copied from a LISA notebook
○ Notebooks and trace files have been shared for use after training
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Testing environment
● CPU capacity info
○ The little core’s highest capacity is 447@850MHz
○ The big core’s highest capacity is 1024@1.1GHz
○ This case is running with correct power model parameters
● Test case
○ 16 small tasks are running with 15% utilization of little CPU (util ~= 67)
○ A single large task is running with 40% utilization of little CPU (util ~= 180)
ENGINEERS AND DEVICES
WORKING TOGETHER
General analysis steps
LISA::
Trace()
LISA::
Filters()
Generate
workload
Filter out
tasks LISA::
TraceAnalysis()
Analyze
tasks
LISA::
TestEnv()
Connect with
target
Analyze
events
LISA::
rtapp()
Step 1: Run workload and generate trace data
Step 2: Analyze trace data
platform.json
Platform description file
ENGINEERS AND DEVICES
WORKING TOGETHER
Connect with target board
Specify target board info for
connection
Calibration for CPUs
Tools copied to target board
Enable ftrace events
Create connection
ENGINEERS AND DEVICES
WORKING TOGETHER
Generate and execute workload
Define workload
Capture Ftrace data
Execute workload
Capture energy data
ENGINEERS AND DEVICES
WORKING TOGETHER
Graph showing task placement in LISA
Specify events to be extracted
Specify time interval
Display task placement graph
ENGINEERS AND DEVICES
WORKING TOGETHER
First make a quick graph showing task placement...
Too many tasks, need
method to quickly filter
out statistics for every
task.
ENGINEERS AND DEVICES
WORKING TOGETHER
… and decide how to tackle step 2 analysis
LISA::
Trace()
LISA::
Filters()
Generate
workload
Filter out
tasks LISA::
TraceAnalysis()
Analyze
tasks
LISA::
TestEnv()
Connect with
target
Analyze
events
LISA::
rtapp()
Step 1: Run workload and generate trace data
Step 2: Analyze trace data
platform.json
Platform description file
ENGINEERS AND DEVICES
WORKING TOGETHER
Analyze trace data for events
events_to_parse = [
"sched_switch",
"sched_wakeup",
"sched_wakeup_new",
"sched_contrib_scale_f",
"sched_load_avg_cpu",
"sched_load_avg_task",
"sched_tune_config",
"sched_tune_tasks_update",
"sched_tune_boostgroup_update",
"sched_tune_filter",
"sched_boost_cpu",
"sched_boost_task",
"sched_energy_diff",
"cpu_frequency",
"cpu_capacity",
""
]
platform.json
{
"clusters": {
"big": [
4,
5,
6,
7
],
"little": [
0,
1,
2,
3
]
},
"cpus_count": 8,
"freqs": {
"big": [
208000,
432000,
729000,
960000,
1200000
],
"little": [
208000,
432000,
729000,
960000,
1200000
]
},
[...]
}trace.dat
Format:
“SYSTRACE” or “Ftrace”
ENGINEERS AND DEVICES
WORKING TOGETHER
Selecting only task of interest (big tasks)
top_big_tasks
{'mmcqd/0': 733,
'Task01' : 2441,
'task010': 2442,
'task011': 2443,
'task012': 2444,
'task015': 2447,
'task016': 2448,
'task017': 2449,
'task019': 2451,
'task020': 2452,
'task021': 2453,
'task022': 2454,
'task023': 2455,
'task024': 2456,
'task025': 2457}
ENGINEERS AND DEVICES
WORKING TOGETHER
Plot big tasks with TraceAnalysis
ENGINEERS AND DEVICES
WORKING TOGETHER
TraceAnalysis graph of task residency on CPUs
At beginning task is placed on big core1
Then it ping-pongs between big
cores and LITTLE cores
2
ENGINEERS AND DEVICES
WORKING TOGETHER
TraceAnalysis graph of task PELT signals
Big core’s highest capacity
LITTLE core’s highest capacity
Big core’s tipping point
LITTLE core’s tipping point
util_avg = PELT(running time)
load_avg
= PELT(running time + runnable time) * weight
= PELT(running time + runnable time) (if NICE = 0)
The difference
between load_avg
and util_avg is
task’s runnable time
on rq (for NICE=0)
ENGINEERS AND DEVICES
WORKING TOGETHER
static int select_task_rq_fair(struct task_struct *p, int
prev_cpu, int sd_flag, int wake_flags)
{
[...]
if (!sd) {
if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
new_cpu = energy_aware_wake_cpu(p, prev_cpu);
else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, new_cpu);
} else while (sd) {
[...]
}
}
System cross tipping point for “over-utilized”
static void
enqueue_task_fair(struct rq *rq, struct task_struct
*p, int flags)
{
[...]
if (!se) {
add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
[...]
}
}
Over tipping point
EAS path
SMP load balance
static struct sched_group *find_busiest_group(struct lb_env
*env)
{
if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
[...]
}
EAS path
SMP load balance
ENGINEERS AND DEVICES
WORKING TOGETHER
Write function to Analyze tipping point
If the LISA toolkit does not include the
plotting function you need, you can
write a plot function yourself
ENGINEERS AND DEVICES
WORKING TOGETHER
Plot for tipping point
System is over tipping point, migrate task
from CPU3 (little core) to CPU4 (big core)
System is under tipping point, migrate task
from CPU4 (big core) to CPU3 (little core)
ENGINEERS AND DEVICES
WORKING TOGETHER
Detailed trace log for migration to big core
nohz_idle_balance() for tasks migration
Migrate big task from CPU3 to CPU4
ENGINEERS AND DEVICES
WORKING TOGETHER
Issue 1: migration big task back to LITTLE core
Migrate task to LITTLE core
Migrate big task from CPU4 to CPU3
ENGINEERS AND DEVICES
WORKING TOGETHER
Issue 2: migration small tasks to big core
Migrate small tasks to big core
CPU is overutilized again
ENGINEERS AND DEVICES
WORKING TOGETHER
Tipping point criteria
Over tipping point
Util: 90%
Any CPU: cpu_util(cpu) > cpu_capacity(cpu) *
80%
Under tipping point
80% of capacity
E.g. LITTLE core:
cpu_capacity(cpu0) = 447
ALL CPUs: cpu_util(cpu) < cpu_capacity(cpu) * 80%
Util: 90% 80% of capacity
E.g. Big core:
cpu_capacity(cpu4) = 1024
Util: 70% 80% of capacity
E.g. LITTLE core:
cpu_capacity(cpu0) = 447
Util: 70%
80% of capacity
E.g. Big core:
cpu_capacity(cpu4) = 1024
ENGINEERS AND DEVICES
WORKING TOGETHER
Phenomenon for ping-pong issue
LITTLE cluster
Over tipping point Under tipping point
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Task1
Migration
ENGINEERS AND DEVICES
WORKING TOGETHER
Fixes for ping-pong issue
LITTLE cluster
Over tipping point Under tipping point
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Migration
Filter out small tasks to
avoid migrate to big core
Avoid migrating big task
back to LITTLE cluster
ENGINEERS AND DEVICES
WORKING TOGETHER
Fallback to LITTLE cluster after it is idle
LITTLE cluster
Over tipping point Under tipping point
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Migration
Migrate big task back to
LITTLE cluster if it’s idle
Task1
ENGINEERS AND DEVICES
WORKING TOGETHER
Filter out small tasks for (tick, idle) load balance
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
[...]
if (energy_aware() &&
(capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) {
if (task_util(p) * 4 < capacity_orig_of(env->src_cpu))
return 0;
}
[...]
}
Filter out small tasks: task running time < ¼ LITTLE CPU capacity.
These tasks will NOT be migrated to big core after return 0.
Result: Only big tasks has a chance to migrate to big core.
ENGINEERS AND DEVICES
WORKING TOGETHER
Avoid migrating big task to LITTLE cluster
static bool need_spread_task(int cpu)
{
struct sched_domain *sd;
int spread = 0, i;
if (cpu_rq(cpu)->rd->overutilized)
return 1;
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
if (!sd)
return 0;
for_each_cpu(i, sched_domain_span(sd)) {
if (cpu_rq(i)->cfs.h_nr_running >= 1 &&
cpu_halfutilized(i)) {
spread = 1;
Break;
}
}
return spread;
}
static int select_task_rq_fair(struct task_struct *p, int prev_cpu,
int sd_flag, int wake_flags)
{
[...]
if (!sd) {
if (energy_aware() &&
(!need_spread_task(cpu) || need_filter_task(p)))
new_cpu = energy_aware_wake_cpu(p, prev_cpu);
else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, new_cpu);
} else while (sd) {
[...]
}
}
Check if cluster is busy or not as well as
checking system tipping point:
● Easier to spread tasks within cluster
if cluster is busy
● Fallback to migrating big task when
cluster is idle
ENGINEERS AND DEVICES
WORKING TOGETHER
static bool need_filter_task(struct task_struct *p)
{
int cpu = task_cpu(p);
int origin_max_cap = capacity_orig_of(cpu);
int target_max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val;
struct sched_domain *sd;
struct sched_group *sg;
sd = rcu_dereference(per_cpu(sd_ea, cpu));
sg = sd->groups;
do {
int first_cpu = group_first_cpu(sg);
if (capacity_orig_of(first_cpu) < target_max_cap &&
task_util(p) * 4 < capacity_orig_of(first_cpu))
target_max_cap = capacity_orig_of(first_cpu);
} while (sg = sg->next, sg != sd->groups);
if (target_max_cap < origin_max_cap)
return 1;
return 0;
}
Filter out small tasks for wake up balance
Two purposes of this
function:
● Select small tasks
(task running time < ¼
LITTLE CPU capacity)
and keep them on the
energy aware path
● Prevent energy aware
path for big tasks on
the big core from
doing harm to little
tasks.
ENGINEERS AND DEVICES
WORKING TOGETHER
Results after applying patches
The big task always
run on CPU6 and
small tasks run on
LITTLE cores!
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Testing environment
● Testing environment
○ The LITTLE core’s highest capacity is 447@850MHz
○ The big core’s highest capacity is 1024@1.1GHz
○ Single small task is running with 9% utilization of big CPU (util ~= 95)
● Phenomenon
○ The single small task runs on big CPU for long time, even though its
utilization is well below the tipping point
ENGINEERS AND DEVICES
WORKING TOGETHER
Global View For Task’s Placement
Small task run at big core for
about 3s, during this period the
system is not busy
ENGINEERS AND DEVICES
WORKING TOGETHER
Analyze task utilization
Filter only related tasks
Analyze task’s utilization signal
ENGINEERS AND DEVICES
WORKING TOGETHER
PELT Signals for task utilization
The task utilization is normalized to value ~95 on
big core, this utilization does not exceed the
LITTLE core’s tipping point of 447 * 80% = 358.
Thus the LITTLE core can meet the task’s
requirement for capacity, so scheduler should
place this task on a LITTLE core.
ENGINEERS AND DEVICES
WORKING TOGETHER
Use kernelshark to check wake up path
In energy aware path, we would expect to see
“sched_boost_task”, but in this case the event is missing,
implying the scheduler performed normal load balancing
because “overutilized” flag is set. Thus the balancer is run
to select an idle CPU in the lowest schedule domain. If
previous CPU is idle the task will stick on previous CPU so it
can benefit from a “hot cache”.
ENGINEERS AND DEVICES
WORKING TOGETHER
The “tipping point” has been set for long time
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{
unsigned long load;
int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
[...]
if (cpu_overutilized(i)) {
*overutilized = true;
if (!sgs->group_misfit_task && rq->misfit_task)
sgs->group_misfit_task = capacity_of(i);
}
[...]
}
}
*overutilized is initialized
as ‘false’ before we
commence the update,
so if any CPU is
over-utilized, then this is
enough to keep us over
the tipping-point.
So need analyze the
load of every CPU.
ENGINEERS AND DEVICES
WORKING TOGETHER
Plot for CPU utilization and idle state
ENGINEERS AND DEVICES
WORKING TOGETHER
CPU utilization does not update during idle
CPU utilization is updated when CPU is
woken up after long time
ENGINEERS AND DEVICES
WORKING TOGETHER
Fix Method: ignore overutilized state for idle CPUs
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{
unsigned long load;
int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
[...]
if (cpu_overutilized(i) && !idle_cpu(i)) {
*overutilized = true;
if (!sgs->group_misfit_task && rq->misfit_task)
sgs->group_misfit_task = capacity_of(i);
}
[...]
}
}
Code flow is altered so
we only consider the
overutilized state for
non-idle CPUs
ENGINEERS AND DEVICES
WORKING TOGETHER
After applying patch to fix this...
ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading
ENGINEERS AND DEVICES
WORKING TOGETHER
Related materials
● Notebooks and related materials for both worked examples
○ https://fileserver.linaro.org/owncloud/index.php/s/5gpVpzN0FdxMmGl
○ ipython notebooks for workload generation and analysis
○ Trace data before and after fixing together with platform.json
● Patches are under discussion on eas-dev mailing list
○ sched/fair: support to spread task in lowest schedule domain
○ sched/fair: avoid small task to migrate to higher capacity CPU
○ sched/fair: filter task for energy aware path
○ sched/fair: consider over utilized only for CPU is not idle
ENGINEERS AND DEVICES
WORKING TOGETHER
Next steps
● You can debug the scheduler
○ Try to focus on decision making, not hacks
○ New decisions should be as generic as possible (ideally based on normalized units)
○ Sharing resulting patches for review is highly recommended
■ Perhaps fix can be improved or is already expressed differently by someone
else
● Understanding tracepoint patches and the tooling from ARM
○ Basic python coding experience is needed to utilize LISA libraries
● Understanding SchedTune
○ SchedTune interferes with the task utilization levels for CPU selection and CPU
utilization levels to bias CPU and OPP selection decisions
○ Evaluate energy-performance trade-off
○ Without tools, it’s hard to define and debug SchedTune boost margin on a specific
platform
Thank You
#LAS16
For further information: www.linaro.org or support@linaro.org
LAS16 keynotes and videos on: connect.linaro.org

More Related Content

PDF
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
PDF
LAS16-307: Benchmarking Schedutil in Android
PDF
SFO15-302: Energy Aware Scheduling: Progress Update
PDF
BKK16-208 EAS
PDF
LAS16-TR02: Upstreaming 101
PDF
BKK16-104 sched-freq
PDF
LAS16-101: Efficient kernel backporting
PPT
Threading Successes 03 Gamebryo
LAS16-105: Walkthrough of the EAS kernel adaptation to the Android Common Kernel
LAS16-307: Benchmarking Schedutil in Android
SFO15-302: Energy Aware Scheduling: Progress Update
BKK16-208 EAS
LAS16-TR02: Upstreaming 101
BKK16-104 sched-freq
LAS16-101: Efficient kernel backporting
Threading Successes 03 Gamebryo

What's hot (20)

PDF
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
PPTX
Tech Days 2015: Embedded Product Update
PPT
Embedded system -Introduction to hardware designing
PPTX
Tech Days 2015: ARM Programming with GNAT and Ada 2012
PDF
BKK16-317 How to generate power models for EAS and IPA
PDF
Get Lower Latency and Higher Throughput for Java Applications
ODP
Performance: Observe and Tune
PPTX
Tech Day 2015: A Gentle Introduction to GPS and GNATbench
PDF
BUD17-309: IRQ prediction
PDF
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
PDF
LCU14-410: How to build an Energy Model for your SoC
PDF
BKK16-TR08 How to generate power models for EAS and IPA
PDF
Omp tutorial cpugpu_programming_cdac
PDF
Velocity 2015 linux perf tools
PDF
Continuous Performance Regression Testing with JfrUnit
PDF
Keeping Latency Low and Throughput High with Application-level Priority Manag...
PDF
Bpf performance tools chapter 4 bcc
PDF
eTPU to GTM Migration
PPTX
Onnc intro
PDF
Efficient execution of quantized deep learning models a compiler approach
LCA14: LCA14-306: CPUidle & CPUfreq integration with scheduler
Tech Days 2015: Embedded Product Update
Embedded system -Introduction to hardware designing
Tech Days 2015: ARM Programming with GNAT and Ada 2012
BKK16-317 How to generate power models for EAS and IPA
Get Lower Latency and Higher Throughput for Java Applications
Performance: Observe and Tune
Tech Day 2015: A Gentle Introduction to GPS and GNATbench
BUD17-309: IRQ prediction
TIP1 - Overview of C/C++ Debugging/Tracing/Profiling Tools
LCU14-410: How to build an Energy Model for your SoC
BKK16-TR08 How to generate power models for EAS and IPA
Omp tutorial cpugpu_programming_cdac
Velocity 2015 linux perf tools
Continuous Performance Regression Testing with JfrUnit
Keeping Latency Low and Throughput High with Application-level Priority Manag...
Bpf performance tools chapter 4 bcc
eTPU to GTM Migration
Onnc intro
Efficient execution of quantized deep learning models a compiler approach
Ad

Viewers also liked (13)

PDF
BUD17-218: Scheduler Load tracking update and improvement
PDF
BUD17-510: Power management in Linux together with secure firmware
PPTX
Pakistan in age of 3 g modified
PDF
Panda Adaptive Defense - The evolution of malware
PDF
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
PDF
LAS16-406: Android Widevine on OP-TEE
PDF
LAS16-403: GDB Linux Kernel Awareness
PDF
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
PDF
LAS16-504: Secure Storage updates in OP-TEE
PPTX
LAS16-203: Platform security architecture for embedded devices
PDF
BUD17-400: Secure Data Path with OPTEE
PDF
BUD17-416: Benchmark and profiling in OP-TEE
PDF
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
BUD17-218: Scheduler Load tracking update and improvement
BUD17-510: Power management in Linux together with secure firmware
Pakistan in age of 3 g modified
Panda Adaptive Defense - The evolution of malware
LAS16-407: Internet of Tiny Linux (IoTL): the sequel.
LAS16-406: Android Widevine on OP-TEE
LAS16-403: GDB Linux Kernel Awareness
LAS16-500: The Rise and Fall of Assembler and the VGIC from Hell
LAS16-504: Secure Storage updates in OP-TEE
LAS16-203: Platform security architecture for embedded devices
BUD17-400: Secure Data Path with OPTEE
BUD17-416: Benchmark and profiling in OP-TEE
LAS16-402: ARM Trusted Firmware – from Enterprise to Embedded
Ad

Similar to LAS16-TR04: Using tracing to tune and optimize EAS (English) (20)

PDF
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
PDF
HPC Facility Designing for next generation HPC systems Ram Nagappan Intel Final
PDF
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
PPTX
Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at...
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PPTX
Big Data Analytics from Edge to Core
PDF
DUG'20: 01 - Welcome & DAOS Update
PDF
Lauritz Thamsen – University of Glasgow – Overview – November 2022
PPTX
Building Efficient Edge Nodes for Content Delivery Networks
PDF
Platform Observability and Infrastructure Closed Loops
PPTX
Solve the colocation conundrum: Performance and density at scale with Kubernetes
PPTX
Profiling Multicore Systems to Maximize Core Utilization
PPTX
LEGaTO: Software Stack Runtimes
PDF
LAS16-207: Bus scaling QoS
PDF
E3 s binghamton
PPTX
Clear Linux OS - Introduction
PDF
Scaling systems for research computing
PDF
P4/FPGA, Packet Acceleration
PPTX
Optimizing High Performance Computing Applications for Energy
PPTX
Introduction to architecture exploration
Lynn Comp - Intel Big Data & Cloud Summit 2013 (2)
HPC Facility Designing for next generation HPC systems Ram Nagappan Intel Final
LAS16-210: Hardware Assisted Tracing on ARM with CoreSight and OpenCSD
Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at...
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
Big Data Analytics from Edge to Core
DUG'20: 01 - Welcome & DAOS Update
Lauritz Thamsen – University of Glasgow – Overview – November 2022
Building Efficient Edge Nodes for Content Delivery Networks
Platform Observability and Infrastructure Closed Loops
Solve the colocation conundrum: Performance and density at scale with Kubernetes
Profiling Multicore Systems to Maximize Core Utilization
LEGaTO: Software Stack Runtimes
LAS16-207: Bus scaling QoS
E3 s binghamton
Clear Linux OS - Introduction
Scaling systems for research computing
P4/FPGA, Packet Acceleration
Optimizing High Performance Computing Applications for Energy
Introduction to architecture exploration

More from Linaro (20)

PDF
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
PDF
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
PDF
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
PDF
Bud17 113: distribution ci using qemu and open qa
PDF
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
PDF
HPC network stack on ARM - Linaro HPC Workshop 2018
PDF
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
PDF
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
PDF
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
PDF
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-100K1 - George Grey: Opening Keynote
PDF
HKG18-318 - OpenAMP Workshop
PDF
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
PDF
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
PDF
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
PDF
HKG18-TR08 - Upstreaming SVE in QEMU
PDF
HKG18-113- Secure Data Path work with i.MX8M
PPTX
HKG18-120 - Devicetree Schema Documentation and Validation
PPTX
HKG18-223 - Trusted FirmwareM: Trusted boot
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Bud17 113: distribution ci using qemu and open qa
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-100K1 - George Grey: Opening Keynote
HKG18-318 - OpenAMP Workshop
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-113- Secure Data Path work with i.MX8M
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-223 - Trusted FirmwareM: Trusted boot

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Machine Learning_overview_presentation.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Cloud computing and distributed systems.
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Encapsulation theory and applications.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
Electronic commerce courselecture one. Pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine Learning_overview_presentation.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
MYSQL Presentation for SQL database connectivity
Cloud computing and distributed systems.
Encapsulation_ Review paper, used for researhc scholars
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
20250228 LYD VKU AI Blended-Learning.pptx
Review of recent advances in non-invasive hemoglobin estimation
Machine learning based COVID-19 study performance prediction
Encapsulation theory and applications.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
Electronic commerce courselecture one. Pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf

LAS16-TR04: Using tracing to tune and optimize EAS (English)

  • 1. Using tracing to tune and optimize EAS Leo Yan & Daniel Thompson Linaro Support and Solutions Engineering
  • 2. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Analyze for task ping-pong issue ○ Analyze for small task staying on big core ● Further reading
  • 3. ENGINEERS AND DEVICES WORKING TOGETHER Typical workflow for optimizing GTS This simple workflow is easy to understand but has problems in practice. Tunables are complex and interact with each other (making it hard to decide which tuneable to adjust). Tuning for multiple use-cases is difficult. Tuning is SoC specific, optimizations will not necessarily apply to other SoCs. Adjust tuneables Benchmark
  • 4. ENGINEERS AND DEVICES WORKING TOGETHER GTS tunables GTS up_threshold down_threshold packing_enable load_avg_period_ms frequency_invariant_load_scale CPUFreq Interactive Governor hispeed_freq go_hispeed_load target_loads Timer_rate min_sample_time above_hispeed_delay
  • 5. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 6. ENGINEERS AND DEVICES WORKING TOGETHER Typical workflow for optimizing EAS systems Trace a use-case Benchmark Examine traces Improve decisions Workflow is knowledge intensive. Decisions can be improved by improving the power model or by finding new opportunities in the scheduler (a.k.a. debugging). Optimizations are more portable. ● Can be shared for review ● Likely to benefit your new SoC
  • 7. ENGINEERS AND DEVICES WORKING TOGETHER Trace points for EAS Has a set of stock trace points in kernel for diving into debugging Trace points are added by patches marked “DEBUG” Not posted to LKML, currently only found in product focused patchsets Enable kernel config: CONFIG_FTRACE
  • 8. ENGINEERS AND DEVICES WORKING TOGETHER Trace points for EAS - cont. sched_contrib_scale_f sched_load_avg_task sched_load_avg_cpu PELT signals EAS core SchedTune sched_switch sched_migrate_task sched_wakeup sched_wakeup_new Scheduler default events SchedFreq LISA can be easily extended to support these trace points cpufreq_sched_throttled cpufreq_sched_request_opp cpufreq_sched_update_capacity sched_energy_diff sched_overutilized sched_tune_config sched_boost_cpu sched_tune_tasks_update sched_tune_boostgroup_update sched_boost_task sched_tune_filter Tracepoints in mainline kernel Tracepoints for EAS extension E.g. enable trace points: trace-cmd start -e sched_energy_diff -e sched_wakeup
  • 9. ENGINEERS AND DEVICES WORKING TOGETHER Summary Features EAS GTS Make decision strategy Power modeling Heuristics thresholds Frequency selection Sched-freq or sched-util, integrated with scheduler Governor’s cascaded parameters Scenario based tuning schedTune (CGroup) None Energy awared scheduling (EAS) has very few tunables and thus requires a significantly different approach to tuning and optimization when compared to global task scheduling (GTS).
  • 10. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 11. ENGINEERS AND DEVICES WORKING TOGETHER LISA - interactive analysis and testing ● “Distro” of python libraries for interactive analysis and automatic testing ● Libraries support includes ○ Target control and manipulation (set cpufreq mode, run this workload, initiate trace) ○ Gather power measurement data and calculate energy ○ Analyze and graph trace results ○ Test assertions about the trace results (e.g. big CPU does not run more than 20ms) ● Interactive analysis using ipython and jupyter ○ Provides a notebook framework similar to Maple, Mathematica or Sage ○ Notebooks mix together documentation with executable code fragments ○ Notebooks record the output of an interactive session ○ All permanent file storage is on the host ○ Trace files and graphs can be reexamined in the future without starting the target ● Automatic testing ○ Notebooks containing assertion based tests that can be converted to normal python
  • 12. ENGINEERS AND DEVICES WORKING TOGETHER General workflow for LISA http://events.linuxfoundation.org/sites/events/files/slides/ELC16_LISA_20160326.pdf
  • 13. ENGINEERS AND DEVICES WORKING TOGETHER LISA interactive test mode http://127.0.0.1:8888 with ipython file Menu & control buttons Markdown (headers) Execute box with python programming Result box, the results of experiments are recorded when next time reopen this file
  • 14. ENGINEERS AND DEVICES WORKING TOGETHER kernelshark Task scheduling Filters for events, tasks, and CPUs Details for events
  • 15. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 16. ENGINEERS AND DEVICES WORKING TOGETHER Development platform for the worked examples ● All examples use artificial workloads to provoke a specific behaviour ○ It turned out to be quite difficult to deliberately provoke undesired behavior! ● Examples are reproducible on 96Boards HiKey ○ Octo-A53 multi-cluster (2x4) SMP device with five OPPs per cluster ■ Not big.LITTLE, and not using a fast/slow silicon process ○ We are able to fake a fast/slow system by using asymmetric power modeling parameters and artificially reducing the running/runnable delta time for “fast” CPU so the metrics indicate that is has a higher performance ● Most plots shown in these slides are copied from a LISA notebook ○ Notebooks and trace files have been shared for use after training
  • 17. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 18. ENGINEERS AND DEVICES WORKING TOGETHER Testing environment ● CPU capacity info ○ The little core’s highest capacity is 447@850MHz ○ The big core’s highest capacity is 1024@1.1GHz ○ This case is running with correct power model parameters ● Test case ○ 16 small tasks are running with 15% utilization of little CPU (util ~= 67) ○ A single large task is running with 40% utilization of little CPU (util ~= 180)
  • 19. ENGINEERS AND DEVICES WORKING TOGETHER General analysis steps LISA:: Trace() LISA:: Filters() Generate workload Filter out tasks LISA:: TraceAnalysis() Analyze tasks LISA:: TestEnv() Connect with target Analyze events LISA:: rtapp() Step 1: Run workload and generate trace data Step 2: Analyze trace data platform.json Platform description file
  • 20. ENGINEERS AND DEVICES WORKING TOGETHER Connect with target board Specify target board info for connection Calibration for CPUs Tools copied to target board Enable ftrace events Create connection
  • 21. ENGINEERS AND DEVICES WORKING TOGETHER Generate and execute workload Define workload Capture Ftrace data Execute workload Capture energy data
  • 22. ENGINEERS AND DEVICES WORKING TOGETHER Graph showing task placement in LISA Specify events to be extracted Specify time interval Display task placement graph
  • 23. ENGINEERS AND DEVICES WORKING TOGETHER First make a quick graph showing task placement... Too many tasks, need method to quickly filter out statistics for every task.
  • 24. ENGINEERS AND DEVICES WORKING TOGETHER … and decide how to tackle step 2 analysis LISA:: Trace() LISA:: Filters() Generate workload Filter out tasks LISA:: TraceAnalysis() Analyze tasks LISA:: TestEnv() Connect with target Analyze events LISA:: rtapp() Step 1: Run workload and generate trace data Step 2: Analyze trace data platform.json Platform description file
  • 25. ENGINEERS AND DEVICES WORKING TOGETHER Analyze trace data for events events_to_parse = [ "sched_switch", "sched_wakeup", "sched_wakeup_new", "sched_contrib_scale_f", "sched_load_avg_cpu", "sched_load_avg_task", "sched_tune_config", "sched_tune_tasks_update", "sched_tune_boostgroup_update", "sched_tune_filter", "sched_boost_cpu", "sched_boost_task", "sched_energy_diff", "cpu_frequency", "cpu_capacity", "" ] platform.json { "clusters": { "big": [ 4, 5, 6, 7 ], "little": [ 0, 1, 2, 3 ] }, "cpus_count": 8, "freqs": { "big": [ 208000, 432000, 729000, 960000, 1200000 ], "little": [ 208000, 432000, 729000, 960000, 1200000 ] }, [...] }trace.dat Format: “SYSTRACE” or “Ftrace”
  • 26. ENGINEERS AND DEVICES WORKING TOGETHER Selecting only task of interest (big tasks) top_big_tasks {'mmcqd/0': 733, 'Task01' : 2441, 'task010': 2442, 'task011': 2443, 'task012': 2444, 'task015': 2447, 'task016': 2448, 'task017': 2449, 'task019': 2451, 'task020': 2452, 'task021': 2453, 'task022': 2454, 'task023': 2455, 'task024': 2456, 'task025': 2457}
  • 27. ENGINEERS AND DEVICES WORKING TOGETHER Plot big tasks with TraceAnalysis
  • 28. ENGINEERS AND DEVICES WORKING TOGETHER TraceAnalysis graph of task residency on CPUs At beginning task is placed on big core1 Then it ping-pongs between big cores and LITTLE cores 2
  • 29. ENGINEERS AND DEVICES WORKING TOGETHER TraceAnalysis graph of task PELT signals Big core’s highest capacity LITTLE core’s highest capacity Big core’s tipping point LITTLE core’s tipping point util_avg = PELT(running time) load_avg = PELT(running time + runnable time) * weight = PELT(running time + runnable time) (if NICE = 0) The difference between load_avg and util_avg is task’s runnable time on rq (for NICE=0)
  • 30. ENGINEERS AND DEVICES WORKING TOGETHER static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { [...] if (!sd) { if (energy_aware() && !cpu_rq(cpu)->rd->overutilized) new_cpu = energy_aware_wake_cpu(p, prev_cpu); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); } else while (sd) { [...] } } System cross tipping point for “over-utilized” static void enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) { [...] if (!se) { add_nr_running(rq, 1); if (!task_new && !rq->rd->overutilized && cpu_overutilized(rq->cpu)) rq->rd->overutilized = true; [...] } } Over tipping point EAS path SMP load balance static struct sched_group *find_busiest_group(struct lb_env *env) { if (energy_aware() && !env->dst_rq->rd->overutilized) goto out_balanced; [...] } EAS path SMP load balance
  • 31. ENGINEERS AND DEVICES WORKING TOGETHER Write function to Analyze tipping point If the LISA toolkit does not include the plotting function you need, you can write a plot function yourself
  • 32. ENGINEERS AND DEVICES WORKING TOGETHER Plot for tipping point System is over tipping point, migrate task from CPU3 (little core) to CPU4 (big core) System is under tipping point, migrate task from CPU4 (big core) to CPU3 (little core)
  • 33. ENGINEERS AND DEVICES WORKING TOGETHER Detailed trace log for migration to big core nohz_idle_balance() for tasks migration Migrate big task from CPU3 to CPU4
  • 34. ENGINEERS AND DEVICES WORKING TOGETHER Issue 1: migration big task back to LITTLE core Migrate task to LITTLE core Migrate big task from CPU4 to CPU3
  • 35. ENGINEERS AND DEVICES WORKING TOGETHER Issue 2: migration small tasks to big core Migrate small tasks to big core CPU is overutilized again
  • 36. ENGINEERS AND DEVICES WORKING TOGETHER Tipping point criteria Over tipping point Util: 90% Any CPU: cpu_util(cpu) > cpu_capacity(cpu) * 80% Under tipping point 80% of capacity E.g. LITTLE core: cpu_capacity(cpu0) = 447 ALL CPUs: cpu_util(cpu) < cpu_capacity(cpu) * 80% Util: 90% 80% of capacity E.g. Big core: cpu_capacity(cpu4) = 1024 Util: 70% 80% of capacity E.g. LITTLE core: cpu_capacity(cpu0) = 447 Util: 70% 80% of capacity E.g. Big core: cpu_capacity(cpu4) = 1024
  • 37. ENGINEERS AND DEVICES WORKING TOGETHER Phenomenon for ping-pong issue LITTLE cluster Over tipping point Under tipping point big cluster LITTLE cluster big cluster Task1 Task1 Task1 Task1 Migration
  • 38. ENGINEERS AND DEVICES WORKING TOGETHER Fixes for ping-pong issue LITTLE cluster Over tipping point Under tipping point big cluster LITTLE cluster big cluster Task1 Task1 Task1 Migration Filter out small tasks to avoid migrate to big core Avoid migrating big task back to LITTLE cluster
  • 39. ENGINEERS AND DEVICES WORKING TOGETHER Fallback to LITTLE cluster after it is idle LITTLE cluster Over tipping point Under tipping point big cluster LITTLE cluster big cluster Task1 Task1 Task1 Migration Migrate big task back to LITTLE cluster if it’s idle Task1
  • 40. ENGINEERS AND DEVICES WORKING TOGETHER Filter out small tasks for (tick, idle) load balance static int can_migrate_task(struct task_struct *p, struct lb_env *env) { [...] if (energy_aware() && (capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) { if (task_util(p) * 4 < capacity_orig_of(env->src_cpu)) return 0; } [...] } Filter out small tasks: task running time < ¼ LITTLE CPU capacity. These tasks will NOT be migrated to big core after return 0. Result: Only big tasks has a chance to migrate to big core.
  • 41. ENGINEERS AND DEVICES WORKING TOGETHER Avoid migrating big task to LITTLE cluster static bool need_spread_task(int cpu) { struct sched_domain *sd; int spread = 0, i; if (cpu_rq(cpu)->rd->overutilized) return 1; sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); if (!sd) return 0; for_each_cpu(i, sched_domain_span(sd)) { if (cpu_rq(i)->cfs.h_nr_running >= 1 && cpu_halfutilized(i)) { spread = 1; Break; } } return spread; } static int select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags) { [...] if (!sd) { if (energy_aware() && (!need_spread_task(cpu) || need_filter_task(p))) new_cpu = energy_aware_wake_cpu(p, prev_cpu); else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */ new_cpu = select_idle_sibling(p, new_cpu); } else while (sd) { [...] } } Check if cluster is busy or not as well as checking system tipping point: ● Easier to spread tasks within cluster if cluster is busy ● Fallback to migrating big task when cluster is idle
  • 42. ENGINEERS AND DEVICES WORKING TOGETHER static bool need_filter_task(struct task_struct *p) { int cpu = task_cpu(p); int origin_max_cap = capacity_orig_of(cpu); int target_max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val; struct sched_domain *sd; struct sched_group *sg; sd = rcu_dereference(per_cpu(sd_ea, cpu)); sg = sd->groups; do { int first_cpu = group_first_cpu(sg); if (capacity_orig_of(first_cpu) < target_max_cap && task_util(p) * 4 < capacity_orig_of(first_cpu)) target_max_cap = capacity_orig_of(first_cpu); } while (sg = sg->next, sg != sd->groups); if (target_max_cap < origin_max_cap) return 1; return 0; } Filter out small tasks for wake up balance Two purposes of this function: ● Select small tasks (task running time < ¼ LITTLE CPU capacity) and keep them on the energy aware path ● Prevent energy aware path for big tasks on the big core from doing harm to little tasks.
  • 43. ENGINEERS AND DEVICES WORKING TOGETHER Results after applying patches The big task always run on CPU6 and small tasks run on LITTLE cores!
  • 44. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 45. ENGINEERS AND DEVICES WORKING TOGETHER Testing environment ● Testing environment ○ The LITTLE core’s highest capacity is 447@850MHz ○ The big core’s highest capacity is 1024@1.1GHz ○ Single small task is running with 9% utilization of big CPU (util ~= 95) ● Phenomenon ○ The single small task runs on big CPU for long time, even though its utilization is well below the tipping point
  • 46. ENGINEERS AND DEVICES WORKING TOGETHER Global View For Task’s Placement Small task run at big core for about 3s, during this period the system is not busy
  • 47. ENGINEERS AND DEVICES WORKING TOGETHER Analyze task utilization Filter only related tasks Analyze task’s utilization signal
  • 48. ENGINEERS AND DEVICES WORKING TOGETHER PELT Signals for task utilization The task utilization is normalized to value ~95 on big core, this utilization does not exceed the LITTLE core’s tipping point of 447 * 80% = 358. Thus the LITTLE core can meet the task’s requirement for capacity, so scheduler should place this task on a LITTLE core.
  • 49. ENGINEERS AND DEVICES WORKING TOGETHER Use kernelshark to check wake up path In energy aware path, we would expect to see “sched_boost_task”, but in this case the event is missing, implying the scheduler performed normal load balancing because “overutilized” flag is set. Thus the balancer is run to select an idle CPU in the lowest schedule domain. If previous CPU is idle the task will stick on previous CPU so it can benefit from a “hot cache”.
  • 50. ENGINEERS AND DEVICES WORKING TOGETHER The “tipping point” has been set for long time static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, bool *overload, bool *overutilized) { unsigned long load; int i, nr_running; memset(sgs, 0, sizeof(*sgs)); for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { [...] if (cpu_overutilized(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); } [...] } } *overutilized is initialized as ‘false’ before we commence the update, so if any CPU is over-utilized, then this is enough to keep us over the tipping-point. So need analyze the load of every CPU.
  • 51. ENGINEERS AND DEVICES WORKING TOGETHER Plot for CPU utilization and idle state
  • 52. ENGINEERS AND DEVICES WORKING TOGETHER CPU utilization does not update during idle CPU utilization is updated when CPU is woken up after long time
  • 53. ENGINEERS AND DEVICES WORKING TOGETHER Fix Method: ignore overutilized state for idle CPUs static inline void update_sg_lb_stats(struct lb_env *env, struct sched_group *group, int load_idx, int local_group, struct sg_lb_stats *sgs, bool *overload, bool *overutilized) { unsigned long load; int i, nr_running; memset(sgs, 0, sizeof(*sgs)); for_each_cpu_and(i, sched_group_cpus(group), env->cpus) { [...] if (cpu_overutilized(i) && !idle_cpu(i)) { *overutilized = true; if (!sgs->group_misfit_task && rq->misfit_task) sgs->group_misfit_task = capacity_of(i); } [...] } } Code flow is altered so we only consider the overutilized state for non-idle CPUs
  • 54. ENGINEERS AND DEVICES WORKING TOGETHER After applying patch to fix this...
  • 55. ENGINEERS AND DEVICES WORKING TOGETHER Agenda ● Background ○ Review of typical workflow for GTS tuning ○ Introduce a workflow for EAS tuning ○ Quick introduction of the tools that support the new workflow ● Worked examples ○ Development platform for the worked examples ○ Task ping-pong issue ○ Small task staying on big core ● Further reading
  • 56. ENGINEERS AND DEVICES WORKING TOGETHER Related materials ● Notebooks and related materials for both worked examples ○ https://fileserver.linaro.org/owncloud/index.php/s/5gpVpzN0FdxMmGl ○ ipython notebooks for workload generation and analysis ○ Trace data before and after fixing together with platform.json ● Patches are under discussion on eas-dev mailing list ○ sched/fair: support to spread task in lowest schedule domain ○ sched/fair: avoid small task to migrate to higher capacity CPU ○ sched/fair: filter task for energy aware path ○ sched/fair: consider over utilized only for CPU is not idle
  • 57. ENGINEERS AND DEVICES WORKING TOGETHER Next steps ● You can debug the scheduler ○ Try to focus on decision making, not hacks ○ New decisions should be as generic as possible (ideally based on normalized units) ○ Sharing resulting patches for review is highly recommended ■ Perhaps fix can be improved or is already expressed differently by someone else ● Understanding tracepoint patches and the tooling from ARM ○ Basic python coding experience is needed to utilize LISA libraries ● Understanding SchedTune ○ SchedTune interferes with the task utilization levels for CPU selection and CPU utilization levels to bias CPU and OPP selection decisions ○ Evaluate energy-performance trade-off ○ Without tools, it’s hard to define and debug SchedTune boost margin on a specific platform
  • 58. Thank You #LAS16 For further information: www.linaro.org or support@linaro.org LAS16 keynotes and videos on: connect.linaro.org