Process scheduling

Process Scheduling
Hao-Ran Liu

Objective
• Decide which process runs, when, and for
how long
• Considering the overhead of context
switches, we need to balance between
conflicting goal
– CPU utilization (high throughput)
– Better Interactive performance (low latency)

Multitasking
• Cooperative
– A process does not stop running until it voluntary
decides to do so
– Anyone can monopolize the processor; a hung
process that never yields can lock the entire system
– A technique used in many user-mode threading
libraries.
• Preemptive
– A running process can be suspended at any time
(usually because it exhausts its time slice)

Type of processes
• I/O-bound processes
– spend most of their time waiting for I/O
– should be executed often (for short durations)
when they are runnable
• CPU-bound processes
– spend most of their time executing code; tend
to run until they are preempted
– should be executed for longer durations (to
improve throughput)

Scheduling policies
• Check sched(7) man page for more details
• Normal
• Real-time
Name Description
SCHED_NORMAL The standard time-sharing policy for regular tasks
SCHED_BATCH For CPU-bound tasks that does not preempt often
SCHED_IDLE For running very low priority background jobs (lower
than a +19 nice value)
SCHED_FIFO FIFO without time slice
SCHED_RR Round robin with maximum time slice
SCHED_DEADLINE Earliest Deadline First + Constant Bandwidth Server
Accept a task only if its periodic job can be done
before deadline

Process priority
• Processes with a higher priority
– run before those with a lower priority
– receive a longer time slice
• Priority range [static, dynamic]
– Normal, batch: [always 0, -20~+19], default: [0, 0],
dynamic priority is the nice value you adjust in user
space. A larger “nice” value correspond to a lower
priority
– FIFO, RR: [0~99, 0], higher value means greater
priority. FIFO, RR processes are at a higher priority
than normal processes
– Deadline: Not applicable. Deadline processes are
always the highest priority in the system

Time slice
• How long a task can run until it is
preempted
• The value of the time slice:
– higher: better throughput
– lower: better interactive performance (shorter
scheduling latency), but more CPU time
wasted on context switches
– Default value is usally pretty small (for good
interactive performance)

Completely Fair Scheduler
• The scheduler for SCHED_NORMAL,
SCHED_BATCH, SCHED_IDLE classes
• CFS assigns a proportion of the processor,
instead of time slices, to processes
– A process with higher nice value receives a smaller
proportion of the CPU
• If a process enters runnable state and has
consumed a smaller proportion of the CPU than
the currently executing one, it runs immediately,
preempting the current one.

CFS scheduler in action
• Two processes
– Video encoder(CPU-bound) and text editor(I/O-bound)
– Both processes have the same nice value
• We want text editor to preempt video encoder
when the editor is runnable
– the text editor consumes a smaller proportion of the
CPU than the video encoder, so it will preempt the
video encoder once it is runnable.

“timeslice” in CFS
• Target latency
– /proc/sys/kernel/sched_latency_ns
– the period in which all run queue tasks are scheduled at least
once
• Timeslice_CFS = target latency / number of runnable
processes * nice_weight
– Ex: target latency = 20ms, two runnable processes at the same
priority, each will run for 10ms before preemption
• If the number of runnable processes =>∞,
timeslice_CFS => 0
– Unacceptable switching costs
– CFS imposes a floor on the “timeslice”:
/proc/sys/kernel/sched_min_granularity_ns, default value is 1ms
• CFS is not “fair” if the number of processes is extremely
large

CFS example again
• Two processes, nice value = 0, 5
– Weight for a nice value of 5 is 1/3
– If target latency = 20ms, the two processes receive 15,
5ms “”timeslice” respectively
– If we change nice value to 10,15, they still receive the
same “timeslice”
• The proportion of processor time that any
process receives is determined only by the
relative difference in niceness between it and
the other runnable processes

CFS group scheduling
• Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group
• Kernel config required:
– CONFIG_FAIR_GROUP_SCHED
– CONFIG_RT_GROUP_SCHED
• Example:
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu
# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks
# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group
# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares
# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks
# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks

Sporadic task model
deadline scheduling
• Each SCHED_DEADLINE task is characterized by the
"runtime", "deadline", and "period" parameters
• The kernel performs an admittance test when setting or
changing SCHED_DEADLINE policy and attributes with
sched_attr() system call.
arrival/wakeup absolute deadline
| start time |
| | |
v v v
-----x--------xooooooooooooooooo--------x--------x---
|<-- Runtime ------->|
|<----------- Deadline ----------->|
|<-------------- Period ------------------->|

Some tools for real-time tasks
• chrt sets or retrieves the real-time scheduling attributes
of an existing pid, or runs command with the given
attributes.
• Limiting the CPU usage of real-time and deadline
processes
– A nonblocking infinite loop in a thread scheduled under the
FIFO, RR, or DEADLINE policy will block all threads with lower
priority forever
– two /proc files can be used to reserve a certain amount of CPU
time to be used by non-real-time processes.
• /proc/sys/kernel/sched_rt_period_us (default: 1000000)
• /proc/sys/kernel/sched_rt_runtime_us (default: 950000)
chrt [options] [<policy>] <priority> [-p <pid> | <command> [<arg>...]]

Context switches
• schedule() called context_switch() after when a
new process has been selected to run
• context_switch()
– switch_mm(): switch virtual memory mapping
– switch_to(): switch processor state.
• Kernel are informed to reschedule if
need_resched variable is set true
– Set by scheduler_tick() when a process should be
preempted
– Set by try_to_wake_up() when a process with higher
priority than current process is awaken

Example: creating kernel thread
#include <linux/module.h>
#include <linux/kthread.h>
#define DPRINTK(fmt, args...)
printk("%s(): " fmt, __func__, ##args)
static struct task_struct *kth_test_task;
static int data;
static int kth_test(void *arg)
{
unsigned int timeout;
int *d = (int *) arg;
while (!kthread_should_stop()) {
DPRINTK("data=%dn", ++(*d));
set_current_state(TASK_INTERRUPTIBLE);
timeout = schedule_timeout(10 * HZ);
if (timeout)
DPRINTK("schedule_timeout return early.n");
}
DPRINTK("exit.n");
return 0;
}
static int __init init_modules(void)
{
int ret;
kth_test_task = kthread_create(kth_test,
&data, "kth_test");
if (IS_ERR(kth_test_task)) {
ret = PTR_ERR(kth_test_task);
kth_test_task = NULL;
goto out;
}
wake_up_process(kth_test_task);
return 0;
out:
return ret;
}
static void __exit exit_modules(void)
{
/* block until kth_test_task exit */
kthread_stop(kth_test_task);
}
module_init(init_modules);
module_exit(exit_modules);

Process sleeping
 Processes need to sleep when requests cannot be
satisfied immediately
 Kernel output buffer is full or no data is available
 Rule for sleeping
 Never sleep in an atomic context
 Holding a spinlock, seqlock or RCU lock
 Interrupts are disabled
 Always check to ensure that the condition the process
was waiting for is indeed true after the process wakes up

Wait queue
 Wait queue contains a list of processes, all
waiting for a specific event
 Declaration and initialization of wait queue
// defined and initialized statically with
DECLARE_WAIT_QUEUE_HEAD(name);
// initialized dynamically
Wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);

wait_event macros
// queue: the wait queue head to use. Note that it is passed “by value”
// condition: arbitrary boolean expression, evaluated by the macro before
// and after sleeping until the condition becomes true. It may
// be evaluated an arbitrary number of times, so it should not
// have any side effects.
// timeout: wait for the specific number of clock ticks (in jiffies)
// uninterruptible sleep until a condition gets true
wait_event(queue, condition);
// interruptible sleep until a condition gets true, return –ERESTARTSYS if
// interrupted by a signal, return 0 if condition evaluated to be true
wait_event_interruptible(queue, condition);
// uninterruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, and the remaining jiffies if the
// condition evaluated to true before the timout elapsed
wait_event_timeout(queue, condition, timeout);
// interruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, -ERESTARTSYS if interrupted by a
// signal, and the remaining jiffies if the condition evaluated to true
// before the timout elapsed
wait_event_interruptible_timeout(queue, condition, timeout);

wake_up macros
// Wake processes that are sleeping on the queue q. The _interruptible
// form wakes only interruptible processes. Normally, only one exclusive
// waiter is awakened (to avoid thundering herd problem), but that
// behavior can be changed with the _nr or _all forms. The _sync version
// does not reschedule the CPU before returning.
void wake_up(struct wait_queue_head_t *q);
void wake_up_interruptible(struct wait_queue_head_t *q);
void wake_up_nr(struct wait_queue_head_t *q, int nr);
void wake_up_interruptible_nr(struct wait_queue_head_t *q, int nr);
void wake_up_all(struct wait_queue_head_t *q);
void wake_up_interruptible_all(struct wait_queue_head_t *q);
void wake_up_interruptible_sync(struct wait_queue_head_t *q);
 Within a real device driver, a process blocked in a read call is
awaken when data arrives; usually the hardware issues an
interrupt to signal such an event, and the driver awakens
waiting processes as part of handling the interrupt

A simple example of putting
processes to sleep
 sleepy device behavior: any process that
attempts to read from the device is put to
sleep. Whenever a process writes to the
device, all sleeping processes are awaken
 Note that on single processor, the second
process to wake up would immediately go
back to sleep

sleepy’s read and write
ssize_t sleepy_read (struct file *filp, char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) going to sleepn",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
flag = 0;
printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm);
return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) awakening the readers...n",
current->pid, current->comm);
flag = 1;
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}

Implementation of wait_event:
How to implement sleep manually
#define wait_event(wq, condition)
do {
if (condition)
break;
__wait_event(wq, condition);
} while (0)
#define __wait_event(wq, condition)
do {
DEFINE_WAIT(__wait);

for (;;) {
prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE);
if (condition)
break;
schedule();
}
finish_wait(&wq, &__wait);
} while (0)

Implementation of wait_event:
How to implement sleep manually
 prepare_to_wait
 add wait queue entry to the wait queue and set the
process state
 finish_wait
 set task state to TASK_RUNNING and remove wait queue
entry from wait queue
 Questions:
 What if the ‘if (condition) ..’ statement is moved to
the front of prepare_to_wait()?
 What if the ‘wake_up’ event happens just after the ’if
(condition) ..‘ statement but before the execution of
the schedule() function?

User Preemption
• It can occur if need_resched is true when
returning to user-space
– from a system call
– from an interrupt handler

Kernel Preemption
• In nonpreemptive kernels, kernel code runs until
completion.
– The scheduler cannot reschedule a task while it is in
the kernel
– kernel code is scheduled cooperatively, not
preemptively
• In the 2.6+ kernel, however, the Linux kernel
became preemptive:
– It is now possible to preempt a task at any point, so
long as the kernel is in a state in which it is safe to
reschedule
• Safe => preempt_count == 0 (kernel doesn’t hold any lock
and isn’t in any atomic context like softirq or hardirq)

Kernel Preemption
• preempt_count
– a variable in each process’s thread_info
– Begins at zero and increments when kernel
enters any atomic contexts, decrements when
leaves.
– If this counter is zero, kernel is preemptible

Cases that needs preemption disable
• Per-CPU data structures
• Some registers must be protected
– On x86, kernel does not save FPU state
except for user tasks. Entering and exiting
FPU mode is a critical section that must occur
while preemption is disabled
struct this_needs_locking tux[NR_CPUS];
tux[smp_processor_id()] = some_value;
/* task is preempted here... */
something = tux[smp_processor_id()];

preempt_count
/*
* We put the hardirq and softirq counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
*
* The hardirq count can in theory reach the same as NR_IRQS.
* In reality, the number of nested IRQS is limited to the stack
* size as well. For archs with over 1000 IRQS it is not practical
* to expect that they will all nest. We give a max of 10 bits for
* hardirq nesting. An arch may choose to give less than 10 bits.
* m68k expects it to be 8.
*
* - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
* - bit 26 is the NMI_MASK
* - bit 28 is the PREEMPT_ACTIVE flag
*
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x03ff0000
* NMI_MASK: 0x04000000
*/
include/linux/hardirq.h

References
• Linux Kernel Development, 3rd Edition,
Robert Love, 2010
• Linux kernel source, http://lxr.free-
electrons.com

Process scheduling

More Related Content

What's hot (20)

Similar to Process scheduling (20)

Recently uploaded (20)

Process scheduling