LAS16-TR04: Using tracing to tune and optimize EAS (English)

Using tracing to tune and optimize EAS
Leo Yan & Daniel Thompson
Linaro Support and Solutions Engineering

ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Analyze for task ping-pong issue
○ Analyze for small task staying on big core
● Further reading

ENGINEERS AND DEVICES
WORKING TOGETHER
Typical workflow for optimizing GTS
This simple workflow is easy to understand
but has problems in practice.
Tunables are complex and interact with each
other (making it hard to decide which
tuneable to adjust).
Tuning for multiple use-cases is difficult.
Tuning is SoC specific, optimizations will not
necessarily apply to other SoCs.
Adjust
tuneables
Benchmark

WORKING TOGETHER
GTS tunables
GTS
up_threshold
down_threshold
packing_enable
load_avg_period_ms
frequency_invariant_load_scale
CPUFreq
Interactive
Governor
hispeed_freq
go_hispeed_load
target_loads
Timer_rate
min_sample_time
above_hispeed_delay

ENGINEERS
AND DEVICES
WORKING
TOGETHER
Agenda
● Background
○ Review of typical workflow for GTS tuning
○ Introduce a workflow for EAS tuning
○ Quick introduction of the tools that support the new
workflow
● Worked examples
○ Development platform for the worked examples
○ Task ping-pong issue
○ Small task staying on big core
● Further reading

WORKING TOGETHER
Typical workflow for optimizing EAS systems
Trace a
use-case
Benchmark
Examine
traces
Improve
decisions
Workflow is knowledge intensive.
Decisions can be improved by
improving the power model or by
finding new opportunities in the
scheduler (a.k.a. debugging).
Optimizations are more portable.
● Can be shared for review
● Likely to benefit your new SoC

WORKING TOGETHER
Trace points for EAS
Has a set of stock trace
points in kernel for diving
into debugging
Trace points are added by
patches marked “DEBUG”
Not posted to LKML,
currently only found
in product focused
patchsets
Enable kernel config:
CONFIG_FTRACE

WORKING TOGETHER
Trace points for EAS - cont.
sched_contrib_scale_f
sched_load_avg_task
sched_load_avg_cpu
PELT signals
EAS core SchedTune
sched_switch
sched_migrate_task
sched_wakeup
sched_wakeup_new
Scheduler
default events
SchedFreq
LISA can be easily
extended to support
these trace points
cpufreq_sched_throttled
cpufreq_sched_request_opp
cpufreq_sched_update_capacity
sched_energy_diff
sched_overutilized
sched_tune_config
sched_boost_cpu
sched_tune_tasks_update
sched_tune_boostgroup_update
sched_boost_task
sched_tune_filter
Tracepoints in mainline kernel
Tracepoints for EAS extension
E.g. enable trace points:
trace-cmd start -e sched_energy_diff -e sched_wakeup

WORKING TOGETHER
Summary
Features EAS GTS
Make decision strategy Power modeling Heuristics thresholds
Frequency selection Sched-freq or sched-util,
integrated with scheduler
Governor’s cascaded
parameters
Scenario based tuning schedTune (CGroup) None
Energy awared scheduling (EAS) has very few tunables and thus requires a
significantly different approach to tuning and optimization when compared to
global task scheduling (GTS).

WORKING TOGETHER
LISA - interactive analysis and testing
● “Distro” of python libraries for interactive analysis and automatic testing
● Libraries support includes
○ Target control and manipulation (set cpufreq mode, run this workload, initiate trace)
○ Gather power measurement data and calculate energy
○ Analyze and graph trace results
○ Test assertions about the trace results (e.g. big CPU does not run more than 20ms)
● Interactive analysis using ipython and jupyter
○ Provides a notebook framework similar to Maple, Mathematica or Sage
○ Notebooks mix together documentation with executable code fragments
○ Notebooks record the output of an interactive session
○ All permanent file storage is on the host
○ Trace files and graphs can be reexamined in the future without starting the target
● Automatic testing
○ Notebooks containing assertion based tests that can be converted to normal python

WORKING TOGETHER
General workflow for LISA
http://events.linuxfoundation.org/sites/events/files/slides/ELC16_LISA_20160326.pdf

WORKING TOGETHER
LISA interactive test mode
http://127.0.0.1:8888 with
ipython file
Menu & control buttons
Markdown (headers)
Execute box with python
programming
Result box, the results of
experiments are
recorded when next
time reopen this file

WORKING TOGETHER
kernelshark
Task scheduling
Filters for events, tasks, and CPUs
Details for events

WORKING TOGETHER
Development platform for the worked examples
● All examples use artificial workloads to provoke a specific behaviour
○ It turned out to be quite difficult to deliberately provoke undesired behavior!
● Examples are reproducible on 96Boards HiKey
○ Octo-A53 multi-cluster (2x4) SMP device with five OPPs per cluster
■ Not big.LITTLE, and not using a fast/slow silicon process
○ We are able to fake a fast/slow system by using asymmetric power
modeling parameters and artificially reducing the running/runnable delta
time for “fast” CPU so the metrics indicate that is has a higher performance
● Most plots shown in these slides are copied from a LISA notebook
○ Notebooks and trace files have been shared for use after training

WORKING TOGETHER
Testing environment
● CPU capacity info
○ The little core’s highest capacity is 447@850MHz
○ The big core’s highest capacity is 1024@1.1GHz
○ This case is running with correct power model parameters
● Test case
○ 16 small tasks are running with 15% utilization of little CPU (util ~= 67)
○ A single large task is running with 40% utilization of little CPU (util ~= 180)

WORKING TOGETHER
General analysis steps
LISA::
Trace()
LISA::
Filters()
Generate
workload
Filter out
tasks LISA::
TraceAnalysis()
Analyze
tasks
LISA::
TestEnv()
Connect with
target
Analyze
events
LISA::
rtapp()
Step 1: Run workload and generate trace data
Step 2: Analyze trace data
platform.json
Platform description file

WORKING TOGETHER
Connect with target board
Specify target board info for
connection
Calibration for CPUs
Tools copied to target board
Enable ftrace events
Create connection

WORKING TOGETHER
Generate and execute workload
Define workload
Capture Ftrace data
Execute workload
Capture energy data

WORKING TOGETHER
Graph showing task placement in LISA
Specify events to be extracted
Specify time interval
Display task placement graph

WORKING TOGETHER
First make a quick graph showing task placement...
Too many tasks, need
method to quickly filter
out statistics for every
task.

WORKING TOGETHER
… and decide how to tackle step 2 analysis
LISA::
Trace()
LISA::
Filters()
Generate
workload
Filter out
tasks LISA::
TraceAnalysis()
Analyze
tasks
LISA::
TestEnv()
Connect with
target
Analyze
events
LISA::
rtapp()
Step 1: Run workload and generate trace data
Step 2: Analyze trace data
platform.json
Platform description file

WORKING TOGETHER
Analyze trace data for events
events_to_parse = [
"sched_switch",
"sched_wakeup",
"sched_wakeup_new",
"sched_contrib_scale_f",
"sched_load_avg_cpu",
"sched_load_avg_task",
"sched_tune_config",
"sched_tune_tasks_update",
"sched_tune_boostgroup_update",
"sched_tune_filter",
"sched_boost_cpu",
"sched_boost_task",
"sched_energy_diff",
"cpu_frequency",
"cpu_capacity",
""
]
platform.json
{
"clusters": {
"big": [
4,
5,
6,
7
],
"little": [
0,
1,
2,
3
]
},
"cpus_count": 8,
"freqs": {
"big": [
208000,
432000,
729000,
960000,
1200000
],
"little": [
208000,
432000,
729000,
960000,
1200000
]
},
[...]
}trace.dat
Format:
“SYSTRACE” or “Ftrace”

WORKING TOGETHER
Selecting only task of interest (big tasks)
top_big_tasks
{'mmcqd/0': 733,
'Task01' : 2441,
'task010': 2442,
'task011': 2443,
'task012': 2444,
'task015': 2447,
'task016': 2448,
'task017': 2449,
'task019': 2451,
'task020': 2452,
'task021': 2453,
'task022': 2454,
'task023': 2455,
'task024': 2456,
'task025': 2457}

WORKING TOGETHER
Plot big tasks with TraceAnalysis

WORKING TOGETHER
TraceAnalysis graph of task residency on CPUs
At beginning task is placed on big core1
Then it ping-pongs between big
cores and LITTLE cores
2

WORKING TOGETHER
TraceAnalysis graph of task PELT signals
Big core’s highest capacity
LITTLE core’s highest capacity
Big core’s tipping point
LITTLE core’s tipping point
util_avg = PELT(running time)
load_avg
= PELT(running time + runnable time) * weight
= PELT(running time + runnable time) (if NICE = 0)
The difference
between load_avg
and util_avg is
task’s runnable time
on rq (for NICE=0)

WORKING TOGETHER
static int select_task_rq_fair(struct task_struct *p, int
prev_cpu, int sd_flag, int wake_flags)
{
[...]
if (!sd) {
if (energy_aware() && !cpu_rq(cpu)->rd->overutilized)
new_cpu = energy_aware_wake_cpu(p, prev_cpu);
else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, new_cpu);
} else while (sd) {
[...]
}
}
System cross tipping point for “over-utilized”
static void
enqueue_task_fair(struct rq *rq, struct task_struct
*p, int flags)
{
[...]
if (!se) {
add_nr_running(rq, 1);
if (!task_new && !rq->rd->overutilized &&
cpu_overutilized(rq->cpu))
rq->rd->overutilized = true;
[...]
}
}
Over tipping point
EAS path
SMP load balance
static struct sched_group *find_busiest_group(struct lb_env
*env)
{
if (energy_aware() && !env->dst_rq->rd->overutilized)
goto out_balanced;
[...]
}
EAS path
SMP load balance

WORKING TOGETHER
Write function to Analyze tipping point
If the LISA toolkit does not include the
plotting function you need, you can
write a plot function yourself

WORKING TOGETHER
Plot for tipping point
System is over tipping point, migrate task
from CPU3 (little core) to CPU4 (big core)
System is under tipping point, migrate task
from CPU4 (big core) to CPU3 (little core)

WORKING TOGETHER
Detailed trace log for migration to big core
nohz_idle_balance() for tasks migration
Migrate big task from CPU3 to CPU4

WORKING TOGETHER
Issue 1: migration big task back to LITTLE core
Migrate task to LITTLE core
Migrate big task from CPU4 to CPU3

WORKING TOGETHER
Issue 2: migration small tasks to big core
Migrate small tasks to big core
CPU is overutilized again

WORKING TOGETHER
Tipping point criteria
Over tipping point
Util: 90%
Any CPU: cpu_util(cpu) > cpu_capacity(cpu) *
80%
Under tipping point
80% of capacity
E.g. LITTLE core:
cpu_capacity(cpu0) = 447
ALL CPUs: cpu_util(cpu) < cpu_capacity(cpu) * 80%
Util: 90% 80% of capacity
E.g. Big core:
Util: 70% 80% of capacity
E.g. LITTLE core:
Util: 70%
80% of capacity
E.g. Big core:

WORKING TOGETHER
Phenomenon for ping-pong issue
LITTLE cluster
Over tipping point Under tipping point
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Task1
Migration

WORKING TOGETHER
Fixes for ping-pong issue
LITTLE cluster
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Migration
Filter out small tasks to
avoid migrate to big core
Avoid migrating big task
back to LITTLE cluster

WORKING TOGETHER
Fallback to LITTLE cluster after it is idle
LITTLE cluster
big cluster
LITTLE cluster
big cluster
Task1
Task1
Task1
Migration
Migrate big task back to
LITTLE cluster if it’s idle
Task1

WORKING TOGETHER
Filter out small tasks for (tick, idle) load balance
static
int can_migrate_task(struct task_struct *p, struct lb_env *env)
{
[...]
if (energy_aware() &&
(capacity_orig_of(env->dst_cpu) > capacity_orig_of(env->src_cpu))) {
if (task_util(p) * 4 < capacity_orig_of(env->src_cpu))
return 0;
}
[...]
}
Filter out small tasks: task running time < ¼ LITTLE CPU capacity.
These tasks will NOT be migrated to big core after return 0.
Result: Only big tasks has a chance to migrate to big core.

WORKING TOGETHER
Avoid migrating big task to LITTLE cluster
static bool need_spread_task(int cpu)
{
struct sched_domain *sd;
int spread = 0, i;
if (cpu_rq(cpu)->rd->overutilized)
return 1;
sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
if (!sd)
return 0;
for_each_cpu(i, sched_domain_span(sd)) {
if (cpu_rq(i)->cfs.h_nr_running >= 1 &&
cpu_halfutilized(i)) {
spread = 1;
Break;
}
}
return spread;
}
static int select_task_rq_fair(struct task_struct *p, int prev_cpu,
int sd_flag, int wake_flags)
{
[...]
if (!sd) {
if (energy_aware() &&
(!need_spread_task(cpu) || need_filter_task(p)))
new_cpu = energy_aware_wake_cpu(p, prev_cpu);
else if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
new_cpu = select_idle_sibling(p, new_cpu);
} else while (sd) {
[...]
}
}
Check if cluster is busy or not as well as
checking system tipping point:
● Easier to spread tasks within cluster
if cluster is busy
● Fallback to migrating big task when
cluster is idle

WORKING TOGETHER
static bool need_filter_task(struct task_struct *p)
{
int cpu = task_cpu(p);
int origin_max_cap = capacity_orig_of(cpu);
int target_max_cap = cpu_rq(cpu)->rd->max_cpu_capacity.val;
struct sched_domain *sd;
struct sched_group *sg;
sd = rcu_dereference(per_cpu(sd_ea, cpu));
sg = sd->groups;
do {
int first_cpu = group_first_cpu(sg);
if (capacity_orig_of(first_cpu) < target_max_cap &&
task_util(p) * 4 < capacity_orig_of(first_cpu))
target_max_cap = capacity_orig_of(first_cpu);
} while (sg = sg->next, sg != sd->groups);
if (target_max_cap < origin_max_cap)
return 1;
return 0;
}
Filter out small tasks for wake up balance
Two purposes of this
function:
● Select small tasks
(task running time < ¼
LITTLE CPU capacity)
and keep them on the
energy aware path
● Prevent energy aware
path for big tasks on
the big core from
doing harm to little
tasks.

WORKING TOGETHER
Results after applying patches
The big task always
run on CPU6 and
small tasks run on
LITTLE cores!

WORKING TOGETHER
Testing environment
● Testing environment
○ The LITTLE core’s highest capacity is 447@850MHz
○ The big core’s highest capacity is 1024@1.1GHz
○ Single small task is running with 9% utilization of big CPU (util ~= 95)
● Phenomenon
○ The single small task runs on big CPU for long time, even though its
utilization is well below the tipping point

WORKING TOGETHER
Global View For Task’s Placement
Small task run at big core for
about 3s, during this period the
system is not busy

WORKING TOGETHER
Analyze task utilization
Filter only related tasks
Analyze task’s utilization signal

WORKING TOGETHER
PELT Signals for task utilization
The task utilization is normalized to value ~95 on
big core, this utilization does not exceed the
LITTLE core’s tipping point of 447 * 80% = 358.
Thus the LITTLE core can meet the task’s
requirement for capacity, so scheduler should
place this task on a LITTLE core.

WORKING TOGETHER
Use kernelshark to check wake up path
In energy aware path, we would expect to see
“sched_boost_task”, but in this case the event is missing,
implying the scheduler performed normal load balancing
because “overutilized” flag is set. Thus the balancer is run
to select an idle CPU in the lowest schedule domain. If
previous CPU is idle the task will stick on previous CPU so it
can benefit from a “hot cache”.

WORKING TOGETHER
The “tipping point” has been set for long time
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{
unsigned long load;
int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
[...]
if (cpu_overutilized(i)) {
*overutilized = true;
if (!sgs->group_misfit_task && rq->misfit_task)
sgs->group_misfit_task = capacity_of(i);
}
[...]
}
}
*overutilized is initialized
as ‘false’ before we
commence the update,
so if any CPU is
over-utilized, then this is
enough to keep us over
the tipping-point.
So need analyze the
load of every CPU.

WORKING TOGETHER
Plot for CPU utilization and idle state

WORKING TOGETHER
CPU utilization does not update during idle
CPU utilization is updated when CPU is
woken up after long time

WORKING TOGETHER
Fix Method: ignore overutilized state for idle CPUs
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
int local_group, struct sg_lb_stats *sgs,
bool *overload, bool *overutilized)
{
unsigned long load;
int i, nr_running;
memset(sgs, 0, sizeof(*sgs));
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
[...]
if (cpu_overutilized(i) && !idle_cpu(i)) {
*overutilized = true;
if (!sgs->group_misfit_task && rq->misfit_task)
sgs->group_misfit_task = capacity_of(i);
}
[...]
}
}
Code flow is altered so
we only consider the
overutilized state for
non-idle CPUs

WORKING TOGETHER
After applying patch to fix this...

WORKING TOGETHER
Related materials
● Notebooks and related materials for both worked examples
○ https://fileserver.linaro.org/owncloud/index.php/s/5gpVpzN0FdxMmGl
○ ipython notebooks for workload generation and analysis
○ Trace data before and after fixing together with platform.json
● Patches are under discussion on eas-dev mailing list
○ sched/fair: support to spread task in lowest schedule domain
○ sched/fair: avoid small task to migrate to higher capacity CPU
○ sched/fair: filter task for energy aware path
○ sched/fair: consider over utilized only for CPU is not idle

WORKING TOGETHER
Next steps
● You can debug the scheduler
○ Try to focus on decision making, not hacks
○ New decisions should be as generic as possible (ideally based on normalized units)
○ Sharing resulting patches for review is highly recommended
■ Perhaps fix can be improved or is already expressed differently by someone
else
● Understanding tracepoint patches and the tooling from ARM
○ Basic python coding experience is needed to utilize LISA libraries
● Understanding SchedTune
○ SchedTune interferes with the task utilization levels for CPU selection and CPU
utilization levels to bias CPU and OPP selection decisions
○ Evaluate energy-performance trade-off
○ Without tools, it’s hard to define and debug SchedTune boost margin on a specific
platform

Thank You
#LAS16
For further information: www.linaro.org or support@linaro.org
LAS16 keynotes and videos on: connect.linaro.org

LAS16-TR04: Using tracing to tune and optimize EAS (English)

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to LAS16-TR04: Using tracing to tune and optimize EAS (English) (20)

More from Linaro (20)

Recently uploaded (20)

LAS16-TR04: Using tracing to tune and optimize EAS (English)