SlideShare a Scribd company logo
Process Scheduling
Hao-Ran Liu
Objective
• Decide which process runs, when, and for
how long
• Considering the overhead of context
switches, we need to balance between
conflicting goal
– CPU utilization (high throughput)
– Better Interactive performance (low latency)
Multitasking
• Cooperative
– A process does not stop running until it voluntary
decides to do so
– Anyone can monopolize the processor; a hung
process that never yields can lock the entire system
– A technique used in many user-mode threading
libraries.
• Preemptive
– A running process can be suspended at any time
(usually because it exhausts its time slice)
Type of processes
• I/O-bound processes
– spend most of their time waiting for I/O
– should be executed often (for short durations)
when they are runnable
• CPU-bound processes
– spend most of their time executing code; tend
to run until they are preempted
– should be executed for longer durations (to
improve throughput)
Scheduling policies
• Check sched(7) man page for more details
• Normal
• Real-time
Name Description
SCHED_NORMAL The standard time-sharing policy for regular tasks
SCHED_BATCH For CPU-bound tasks that does not preempt often
SCHED_IDLE For running very low priority background jobs (lower
than a +19 nice value)
SCHED_FIFO FIFO without time slice
SCHED_RR Round robin with maximum time slice
SCHED_DEADLINE Earliest Deadline First + Constant Bandwidth Server
Accept a task only if its periodic job can be done
before deadline
Process priority
• Processes with a higher priority
– run before those with a lower priority
– receive a longer time slice
• Priority range [static, dynamic]
– Normal, batch: [always 0, -20~+19], default: [0, 0],
dynamic priority is the nice value you adjust in user
space. A larger “nice” value correspond to a lower
priority
– FIFO, RR: [0~99, 0], higher value means greater
priority. FIFO, RR processes are at a higher priority
than normal processes
– Deadline: Not applicable. Deadline processes are
always the highest priority in the system
Time slice
• How long a task can run until it is
preempted
• The value of the time slice:
– higher: better throughput
– lower: better interactive performance (shorter
scheduling latency), but more CPU time
wasted on context switches
– Default value is usally pretty small (for good
interactive performance)
Completely Fair Scheduler
• The scheduler for SCHED_NORMAL,
SCHED_BATCH, SCHED_IDLE classes
• CFS assigns a proportion of the processor,
instead of time slices, to processes
– A process with higher nice value receives a smaller
proportion of the CPU
• If a process enters runnable state and has
consumed a smaller proportion of the CPU than
the currently executing one, it runs immediately,
preempting the current one.
CFS scheduler in action
• Two processes
– Video encoder(CPU-bound) and text editor(I/O-bound)
– Both processes have the same nice value
• We want text editor to preempt video encoder
when the editor is runnable
– the text editor consumes a smaller proportion of the
CPU than the video encoder, so it will preempt the
video encoder once it is runnable.
“timeslice” in CFS
• Target latency
– /proc/sys/kernel/sched_latency_ns
– the period in which all run queue tasks are scheduled at least
once
• Timeslice_CFS = target latency / number of runnable
processes * nice_weight
– Ex: target latency = 20ms, two runnable processes at the same
priority, each will run for 10ms before preemption
• If the number of runnable processes =>∞,
timeslice_CFS => 0
– Unacceptable switching costs
– CFS imposes a floor on the “timeslice”:
/proc/sys/kernel/sched_min_granularity_ns, default value is 1ms
• CFS is not “fair” if the number of processes is extremely
large
CFS example again
• Two processes, nice value = 0, 5
– Weight for a nice value of 5 is 1/3
– If target latency = 20ms, the two processes receive 15,
5ms “”timeslice” respectively
– If we change nice value to 10,15, they still receive the
same “timeslice”
• The proportion of processor time that any
process receives is determined only by the
relative difference in niceness between it and
the other runnable processes
CFS group scheduling
• Sometimes, it may be desirable to group tasks and
provide fair CPU time to each such task group
• Kernel config required:
– CONFIG_FAIR_GROUP_SCHED
– CONFIG_RT_GROUP_SCHED
• Example:
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu
# mount -t cgroup -ocpu none /sys/fs/cgroup/cpu
# cd /sys/fs/cgroup/cpu
# mkdir multimedia # create "multimedia" group of tasks
# mkdir browser # create "browser" group of tasks
# #Configure the multimedia group to receive twice the CPU bandwidth
# #that of browser group
# echo 2048 > multimedia/cpu.shares
# echo 1024 > browser/cpu.shares
# firefox & # Launch firefox and move it to "browser" group
# echo <firefox_pid> > browser/tasks
# #Launch gmplayer (or your favourite movie player)
# echo <movie_player_pid> > multimedia/tasks
Sporadic task model
deadline scheduling
• Each SCHED_DEADLINE task is characterized by the
"runtime", "deadline", and "period" parameters
• The kernel performs an admittance test when setting or
changing SCHED_DEADLINE policy and attributes with
sched_attr() system call.
arrival/wakeup absolute deadline
| start time |
| | |
v v v
-----x--------xooooooooooooooooo--------x--------x---
|<-- Runtime ------->|
|<----------- Deadline ----------->|
|<-------------- Period ------------------->|
Some tools for real-time tasks
• chrt sets or retrieves the real-time scheduling attributes
of an existing pid, or runs command with the given
attributes.
• Limiting the CPU usage of real-time and deadline
processes
– A nonblocking infinite loop in a thread scheduled under the
FIFO, RR, or DEADLINE policy will block all threads with lower
priority forever
– two /proc files can be used to reserve a certain amount of CPU
time to be used by non-real-time processes.
• /proc/sys/kernel/sched_rt_period_us (default: 1000000)
• /proc/sys/kernel/sched_rt_runtime_us (default: 950000)
chrt [options] [<policy>] <priority> [-p <pid> | <command> [<arg>...]]
Context switches
• schedule() called context_switch() after when a
new process has been selected to run
• context_switch()
– switch_mm(): switch virtual memory mapping
– switch_to(): switch processor state.
• Kernel are informed to reschedule if
need_resched variable is set true
– Set by scheduler_tick() when a process should be
preempted
– Set by try_to_wake_up() when a process with higher
priority than current process is awaken
Example: creating kernel thread
#include <linux/module.h>
#include <linux/kthread.h>
#define DPRINTK(fmt, args...) 
printk("%s(): " fmt, __func__, ##args)
static struct task_struct *kth_test_task;
static int data;
static int kth_test(void *arg)
{
unsigned int timeout;
int *d = (int *) arg;
while (!kthread_should_stop()) {
DPRINTK("data=%dn", ++(*d));
set_current_state(TASK_INTERRUPTIBLE);
timeout = schedule_timeout(10 * HZ);
if (timeout)
DPRINTK("schedule_timeout return early.n");
}
DPRINTK("exit.n");
return 0;
}
static int __init init_modules(void)
{
int ret;
kth_test_task = kthread_create(kth_test, 
&data, "kth_test");
if (IS_ERR(kth_test_task)) {
ret = PTR_ERR(kth_test_task);
kth_test_task = NULL;
goto out;
}
wake_up_process(kth_test_task);
return 0;
out:
return ret;
}
static void __exit exit_modules(void)
{
/* block until kth_test_task exit */
kthread_stop(kth_test_task);
}
module_init(init_modules);
module_exit(exit_modules);
Process sleeping
 Processes need to sleep when requests cannot be
satisfied immediately
 Kernel output buffer is full or no data is available
 Rule for sleeping
 Never sleep in an atomic context
 Holding a spinlock, seqlock or RCU lock
 Interrupts are disabled
 Always check to ensure that the condition the process
was waiting for is indeed true after the process wakes up
Wait queue
 Wait queue contains a list of processes, all
waiting for a specific event
 Declaration and initialization of wait queue
// defined and initialized statically with
DECLARE_WAIT_QUEUE_HEAD(name);
// initialized dynamically
Wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);
wait_event macros
// queue: the wait queue head to use. Note that it is passed “by value”
// condition: arbitrary boolean expression, evaluated by the macro before
// and after sleeping until the condition becomes true. It may
// be evaluated an arbitrary number of times, so it should not
// have any side effects.
// timeout: wait for the specific number of clock ticks (in jiffies)
// uninterruptible sleep until a condition gets true
wait_event(queue, condition);
// interruptible sleep until a condition gets true, return –ERESTARTSYS if
// interrupted by a signal, return 0 if condition evaluated to be true
wait_event_interruptible(queue, condition);
// uninterruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, and the remaining jiffies if the
// condition evaluated to true before the timout elapsed
wait_event_timeout(queue, condition, timeout);
// interruptible sleep until a condition gets true or a timeout elapses
// return 0 if the timeout elapsed, -ERESTARTSYS if interrupted by a
// signal, and the remaining jiffies if the condition evaluated to true
// before the timout elapsed
wait_event_interruptible_timeout(queue, condition, timeout);
wake_up macros
// Wake processes that are sleeping on the queue q. The _interruptible
// form wakes only interruptible processes. Normally, only one exclusive
// waiter is awakened (to avoid thundering herd problem), but that
// behavior can be changed with the _nr or _all forms. The _sync version
// does not reschedule the CPU before returning.
void wake_up(struct wait_queue_head_t *q);
void wake_up_interruptible(struct wait_queue_head_t *q);
void wake_up_nr(struct wait_queue_head_t *q, int nr);
void wake_up_interruptible_nr(struct wait_queue_head_t *q, int nr);
void wake_up_all(struct wait_queue_head_t *q);
void wake_up_interruptible_all(struct wait_queue_head_t *q);
void wake_up_interruptible_sync(struct wait_queue_head_t *q);
 Within a real device driver, a process blocked in a read call is
awaken when data arrives; usually the hardware issues an
interrupt to signal such an event, and the driver awakens
waiting processes as part of handling the interrupt
A simple example of putting
processes to sleep
 sleepy device behavior: any process that
attempts to read from the device is put to
sleep. Whenever a process writes to the
device, all sleeping processes are awaken
 Note that on single processor, the second
process to wake up would immediately go
back to sleep
sleepy’s read and write
ssize_t sleepy_read (struct file *filp, char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) going to sleepn",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
flag = 0;
printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm);
return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char __user *buf,
size_t count, loff_t *pos) {
printk(KERN_DEBUG "process %i (%s) awakening the readers...n",
current->pid, current->comm);
flag = 1;
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}
Implementation of wait_event:
How to implement sleep manually
#define wait_event(wq, condition) 
do { 
if (condition) 
break; 
__wait_event(wq, condition); 
} while (0)
#define __wait_event(wq, condition) 
do { 
DEFINE_WAIT(__wait); 

for (;;) { 
prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); 
if (condition) 
break; 
schedule(); 
} 
finish_wait(&wq, &__wait); 
} while (0)
Implementation of wait_event:
How to implement sleep manually
 prepare_to_wait
 add wait queue entry to the wait queue and set the
process state
 finish_wait
 set task state to TASK_RUNNING and remove wait queue
entry from wait queue
 Questions:
 What if the ‘if (condition) ..’ statement is moved to
the front of prepare_to_wait()?
 What if the ‘wake_up’ event happens just after the ’if
(condition) ..‘ statement but before the execution of
the schedule() function?
User Preemption
• It can occur if need_resched is true when
returning to user-space
– from a system call
– from an interrupt handler
Kernel Preemption
• In nonpreemptive kernels, kernel code runs until
completion.
– The scheduler cannot reschedule a task while it is in
the kernel
– kernel code is scheduled cooperatively, not
preemptively
• In the 2.6+ kernel, however, the Linux kernel
became preemptive:
– It is now possible to preempt a task at any point, so
long as the kernel is in a state in which it is safe to
reschedule
• Safe => preempt_count == 0 (kernel doesn’t hold any lock
and isn’t in any atomic context like softirq or hardirq)
Kernel Preemption
• preempt_count
– a variable in each process’s thread_info
– Begins at zero and increments when kernel
enters any atomic contexts, decrements when
leaves.
– If this counter is zero, kernel is preemptible
Cases that needs preemption disable
• Per-CPU data structures
• Some registers must be protected
– On x86, kernel does not save FPU state
except for user tasks. Entering and exiting
FPU mode is a critical section that must occur
while preemption is disabled
struct this_needs_locking tux[NR_CPUS];
tux[smp_processor_id()] = some_value;
/* task is preempted here... */
something = tux[smp_processor_id()];
preempt_count
/*
* We put the hardirq and softirq counter into the preemption
* counter. The bitmask has the following meaning:
*
* - bits 0-7 are the preemption count (max preemption depth: 256)
* - bits 8-15 are the softirq count (max # of softirqs: 256)
*
* The hardirq count can in theory reach the same as NR_IRQS.
* In reality, the number of nested IRQS is limited to the stack
* size as well. For archs with over 1000 IRQS it is not practical
* to expect that they will all nest. We give a max of 10 bits for
* hardirq nesting. An arch may choose to give less than 10 bits.
* m68k expects it to be 8.
*
* - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024)
* - bit 26 is the NMI_MASK
* - bit 28 is the PREEMPT_ACTIVE flag
*
* PREEMPT_MASK: 0x000000ff
* SOFTIRQ_MASK: 0x0000ff00
* HARDIRQ_MASK: 0x03ff0000
* NMI_MASK: 0x04000000
*/
include/linux/hardirq.h
References
• Linux Kernel Development, 3rd Edition,
Robert Love, 2010
• Linux kernel source, http://lxr.free-
electrons.com

More Related Content

PPT
Linux kernel modules
PPT
Linux kernel memory allocators
PPTX
Linux kernel debugging
PDF
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
PDF
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
PPT
Process scheduling linux
ODP
Performance: Observe and Tune
PDF
JavaOne 2015 Java Mixed-Mode Flame Graphs
Linux kernel modules
Linux kernel memory allocators
Linux kernel debugging
Kernel Recipes 2016 - Understanding a Real-Time System (more than just a kernel)
Kernel Recipes 2016 - entry_*.S: A carefree stroll through kernel entry code
Process scheduling linux
Performance: Observe and Tune
JavaOne 2015 Java Mixed-Mode Flame Graphs

What's hot (20)

PDF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
PDF
Linux Systems Performance 2016
DOCX
Using cgroups in docker container
PDF
New Ways to Find Latency in Linux Using Tracing
PPT
Linux Troubleshooting
PPT
Kgdb kdb modesetting
PDF
Understanding of linux kernel memory model
PDF
Introduction to Perf
PDF
What Linux can learn from Solaris performance and vice-versa
PDF
eBPF Perf Tools 2019
PPT
Cache profiling on ARM Linux
PDF
FreeBSD 2014 Flame Graphs
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
PDF
Bpf performance tools chapter 4 bcc
PDF
DTrace Topics: Introduction
PDF
eBPF Trace from Kernel to Userspace
PDF
Velocity 2017 Performance analysis superpowers with Linux eBPF
PDF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Kernel Recipes 2017 - Understanding the Linux kernel via ftrace - Steven Rostedt
Linux Systems Performance 2016
Using cgroups in docker container
New Ways to Find Latency in Linux Using Tracing
Linux Troubleshooting
Kgdb kdb modesetting
Understanding of linux kernel memory model
Introduction to Perf
What Linux can learn from Solaris performance and vice-versa
eBPF Perf Tools 2019
Cache profiling on ARM Linux
FreeBSD 2014 Flame Graphs
Linux Performance Analysis: New Tools and Old Secrets
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017 - What's new in the world of storage for Linux - Jens Axboe
Bpf performance tools chapter 4 bcc
DTrace Topics: Introduction
eBPF Trace from Kernel to Userspace
Velocity 2017 Performance analysis superpowers with Linux eBPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
Ad

Similar to Process scheduling (20)

PDF
Linux kernel development ch4
PPTX
Process and CPU Scheduling.pptx it is about Operating system
PDF
Scheduling
PDF
Linux scheduler
PPTX
Process Scheduling Algorithms | Interviews | Operating system
PDF
cpu scheduling.pdfoieheoirwuojorkjp;ooooo
PPT
cpu sechduling
PPTX
Cpu_sheduling.pptx
PPT
06-scheduling.ppt including multiple CPUs
PDF
Process Scheduler and Balancer in Linux Kernel
PDF
Linux Scheduler Latest_ viresh Kumar.pdf
PDF
CH06.pdf
PDF
Ch6 cpu scheduling
DOCX
Process scheduling
PPT
Cpu scheduling(suresh)
PPTX
UNIPROCESS SCHEDULING.pptx
DOCX
Cpu scheduling final
PPT
Operating Systems Process Scheduling Algorithms
PPTX
Linux Process & CF scheduling
Linux kernel development ch4
Process and CPU Scheduling.pptx it is about Operating system
Scheduling
Linux scheduler
Process Scheduling Algorithms | Interviews | Operating system
cpu scheduling.pdfoieheoirwuojorkjp;ooooo
cpu sechduling
Cpu_sheduling.pptx
06-scheduling.ppt including multiple CPUs
Process Scheduler and Balancer in Linux Kernel
Linux Scheduler Latest_ viresh Kumar.pdf
CH06.pdf
Ch6 cpu scheduling
Process scheduling
Cpu scheduling(suresh)
UNIPROCESS SCHEDULING.pptx
Cpu scheduling final
Operating Systems Process Scheduling Algorithms
Linux Process & CF scheduling
Ad

Recently uploaded (20)

PPTX
Current and future trends in Computer Vision.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
introduction to high performance computing
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
Soil Improvement Techniques Note - Rabbi
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
PDF
Abrasive, erosive and cavitation wear.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
737-MAX_SRG.pdf student reference guides
Current and future trends in Computer Vision.pptx
UNIT 4 Total Quality Management .pptx
introduction to high performance computing
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
III.4.1.2_The_Space_Environment.p pdffdf
Soil Improvement Techniques Note - Rabbi
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
COURSE DESCRIPTOR OF SURVEYING R24 SYLLABUS
Abrasive, erosive and cavitation wear.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Exploratory_Data_Analysis_Fundamentals.pdf
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Categorization of Factors Affecting Classification Algorithms Selection
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Fundamentals of safety and accident prevention -final (1).pptx
737-MAX_SRG.pdf student reference guides

Process scheduling

  • 2. Objective • Decide which process runs, when, and for how long • Considering the overhead of context switches, we need to balance between conflicting goal – CPU utilization (high throughput) – Better Interactive performance (low latency)
  • 3. Multitasking • Cooperative – A process does not stop running until it voluntary decides to do so – Anyone can monopolize the processor; a hung process that never yields can lock the entire system – A technique used in many user-mode threading libraries. • Preemptive – A running process can be suspended at any time (usually because it exhausts its time slice)
  • 4. Type of processes • I/O-bound processes – spend most of their time waiting for I/O – should be executed often (for short durations) when they are runnable • CPU-bound processes – spend most of their time executing code; tend to run until they are preempted – should be executed for longer durations (to improve throughput)
  • 5. Scheduling policies • Check sched(7) man page for more details • Normal • Real-time Name Description SCHED_NORMAL The standard time-sharing policy for regular tasks SCHED_BATCH For CPU-bound tasks that does not preempt often SCHED_IDLE For running very low priority background jobs (lower than a +19 nice value) SCHED_FIFO FIFO without time slice SCHED_RR Round robin with maximum time slice SCHED_DEADLINE Earliest Deadline First + Constant Bandwidth Server Accept a task only if its periodic job can be done before deadline
  • 6. Process priority • Processes with a higher priority – run before those with a lower priority – receive a longer time slice • Priority range [static, dynamic] – Normal, batch: [always 0, -20~+19], default: [0, 0], dynamic priority is the nice value you adjust in user space. A larger “nice” value correspond to a lower priority – FIFO, RR: [0~99, 0], higher value means greater priority. FIFO, RR processes are at a higher priority than normal processes – Deadline: Not applicable. Deadline processes are always the highest priority in the system
  • 7. Time slice • How long a task can run until it is preempted • The value of the time slice: – higher: better throughput – lower: better interactive performance (shorter scheduling latency), but more CPU time wasted on context switches – Default value is usally pretty small (for good interactive performance)
  • 8. Completely Fair Scheduler • The scheduler for SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE classes • CFS assigns a proportion of the processor, instead of time slices, to processes – A process with higher nice value receives a smaller proportion of the CPU • If a process enters runnable state and has consumed a smaller proportion of the CPU than the currently executing one, it runs immediately, preempting the current one.
  • 9. CFS scheduler in action • Two processes – Video encoder(CPU-bound) and text editor(I/O-bound) – Both processes have the same nice value • We want text editor to preempt video encoder when the editor is runnable – the text editor consumes a smaller proportion of the CPU than the video encoder, so it will preempt the video encoder once it is runnable.
  • 10. “timeslice” in CFS • Target latency – /proc/sys/kernel/sched_latency_ns – the period in which all run queue tasks are scheduled at least once • Timeslice_CFS = target latency / number of runnable processes * nice_weight – Ex: target latency = 20ms, two runnable processes at the same priority, each will run for 10ms before preemption • If the number of runnable processes =>∞, timeslice_CFS => 0 – Unacceptable switching costs – CFS imposes a floor on the “timeslice”: /proc/sys/kernel/sched_min_granularity_ns, default value is 1ms • CFS is not “fair” if the number of processes is extremely large
  • 11. CFS example again • Two processes, nice value = 0, 5 – Weight for a nice value of 5 is 1/3 – If target latency = 20ms, the two processes receive 15, 5ms “”timeslice” respectively – If we change nice value to 10,15, they still receive the same “timeslice” • The proportion of processor time that any process receives is determined only by the relative difference in niceness between it and the other runnable processes
  • 12. CFS group scheduling • Sometimes, it may be desirable to group tasks and provide fair CPU time to each such task group • Kernel config required: – CONFIG_FAIR_GROUP_SCHED – CONFIG_RT_GROUP_SCHED • Example: # mount -t tmpfs cgroup_root /sys/fs/cgroup # mkdir /sys/fs/cgroup/cpu # mount -t cgroup -ocpu none /sys/fs/cgroup/cpu # cd /sys/fs/cgroup/cpu # mkdir multimedia # create "multimedia" group of tasks # mkdir browser # create "browser" group of tasks # #Configure the multimedia group to receive twice the CPU bandwidth # #that of browser group # echo 2048 > multimedia/cpu.shares # echo 1024 > browser/cpu.shares # firefox & # Launch firefox and move it to "browser" group # echo <firefox_pid> > browser/tasks # #Launch gmplayer (or your favourite movie player) # echo <movie_player_pid> > multimedia/tasks
  • 13. Sporadic task model deadline scheduling • Each SCHED_DEADLINE task is characterized by the "runtime", "deadline", and "period" parameters • The kernel performs an admittance test when setting or changing SCHED_DEADLINE policy and attributes with sched_attr() system call. arrival/wakeup absolute deadline | start time | | | | v v v -----x--------xooooooooooooooooo--------x--------x--- |<-- Runtime ------->| |<----------- Deadline ----------->| |<-------------- Period ------------------->|
  • 14. Some tools for real-time tasks • chrt sets or retrieves the real-time scheduling attributes of an existing pid, or runs command with the given attributes. • Limiting the CPU usage of real-time and deadline processes – A nonblocking infinite loop in a thread scheduled under the FIFO, RR, or DEADLINE policy will block all threads with lower priority forever – two /proc files can be used to reserve a certain amount of CPU time to be used by non-real-time processes. • /proc/sys/kernel/sched_rt_period_us (default: 1000000) • /proc/sys/kernel/sched_rt_runtime_us (default: 950000) chrt [options] [<policy>] <priority> [-p <pid> | <command> [<arg>...]]
  • 15. Context switches • schedule() called context_switch() after when a new process has been selected to run • context_switch() – switch_mm(): switch virtual memory mapping – switch_to(): switch processor state. • Kernel are informed to reschedule if need_resched variable is set true – Set by scheduler_tick() when a process should be preempted – Set by try_to_wake_up() when a process with higher priority than current process is awaken
  • 16. Example: creating kernel thread #include <linux/module.h> #include <linux/kthread.h> #define DPRINTK(fmt, args...) printk("%s(): " fmt, __func__, ##args) static struct task_struct *kth_test_task; static int data; static int kth_test(void *arg) { unsigned int timeout; int *d = (int *) arg; while (!kthread_should_stop()) { DPRINTK("data=%dn", ++(*d)); set_current_state(TASK_INTERRUPTIBLE); timeout = schedule_timeout(10 * HZ); if (timeout) DPRINTK("schedule_timeout return early.n"); } DPRINTK("exit.n"); return 0; } static int __init init_modules(void) { int ret; kth_test_task = kthread_create(kth_test, &data, "kth_test"); if (IS_ERR(kth_test_task)) { ret = PTR_ERR(kth_test_task); kth_test_task = NULL; goto out; } wake_up_process(kth_test_task); return 0; out: return ret; } static void __exit exit_modules(void) { /* block until kth_test_task exit */ kthread_stop(kth_test_task); } module_init(init_modules); module_exit(exit_modules);
  • 17. Process sleeping  Processes need to sleep when requests cannot be satisfied immediately  Kernel output buffer is full or no data is available  Rule for sleeping  Never sleep in an atomic context  Holding a spinlock, seqlock or RCU lock  Interrupts are disabled  Always check to ensure that the condition the process was waiting for is indeed true after the process wakes up
  • 18. Wait queue  Wait queue contains a list of processes, all waiting for a specific event  Declaration and initialization of wait queue // defined and initialized statically with DECLARE_WAIT_QUEUE_HEAD(name); // initialized dynamically Wait_queue_head_t my_queue; init_waitqueue_head(&my_queue);
  • 19. wait_event macros // queue: the wait queue head to use. Note that it is passed “by value” // condition: arbitrary boolean expression, evaluated by the macro before // and after sleeping until the condition becomes true. It may // be evaluated an arbitrary number of times, so it should not // have any side effects. // timeout: wait for the specific number of clock ticks (in jiffies) // uninterruptible sleep until a condition gets true wait_event(queue, condition); // interruptible sleep until a condition gets true, return –ERESTARTSYS if // interrupted by a signal, return 0 if condition evaluated to be true wait_event_interruptible(queue, condition); // uninterruptible sleep until a condition gets true or a timeout elapses // return 0 if the timeout elapsed, and the remaining jiffies if the // condition evaluated to true before the timout elapsed wait_event_timeout(queue, condition, timeout); // interruptible sleep until a condition gets true or a timeout elapses // return 0 if the timeout elapsed, -ERESTARTSYS if interrupted by a // signal, and the remaining jiffies if the condition evaluated to true // before the timout elapsed wait_event_interruptible_timeout(queue, condition, timeout);
  • 20. wake_up macros // Wake processes that are sleeping on the queue q. The _interruptible // form wakes only interruptible processes. Normally, only one exclusive // waiter is awakened (to avoid thundering herd problem), but that // behavior can be changed with the _nr or _all forms. The _sync version // does not reschedule the CPU before returning. void wake_up(struct wait_queue_head_t *q); void wake_up_interruptible(struct wait_queue_head_t *q); void wake_up_nr(struct wait_queue_head_t *q, int nr); void wake_up_interruptible_nr(struct wait_queue_head_t *q, int nr); void wake_up_all(struct wait_queue_head_t *q); void wake_up_interruptible_all(struct wait_queue_head_t *q); void wake_up_interruptible_sync(struct wait_queue_head_t *q);  Within a real device driver, a process blocked in a read call is awaken when data arrives; usually the hardware issues an interrupt to signal such an event, and the driver awakens waiting processes as part of handling the interrupt
  • 21. A simple example of putting processes to sleep  sleepy device behavior: any process that attempts to read from the device is put to sleep. Whenever a process writes to the device, all sleeping processes are awaken  Note that on single processor, the second process to wake up would immediately go back to sleep
  • 22. sleepy’s read and write ssize_t sleepy_read (struct file *filp, char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) going to sleepn", current->pid, current->comm); wait_event_interruptible(wq, flag != 0); flag = 0; printk(KERN_DEBUG "awoken %i (%s)n", current->pid, current->comm); return 0; /* EOF */ } ssize_t sleepy_write (struct file *filp, const char __user *buf, size_t count, loff_t *pos) { printk(KERN_DEBUG "process %i (%s) awakening the readers...n", current->pid, current->comm); flag = 1; wake_up_interruptible(&wq); return count; /* succeed, to avoid retrial */ }
  • 23. Implementation of wait_event: How to implement sleep manually #define wait_event(wq, condition) do { if (condition) break; __wait_event(wq, condition); } while (0) #define __wait_event(wq, condition) do { DEFINE_WAIT(__wait); for (;;) { prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE); if (condition) break; schedule(); } finish_wait(&wq, &__wait); } while (0)
  • 24. Implementation of wait_event: How to implement sleep manually  prepare_to_wait  add wait queue entry to the wait queue and set the process state  finish_wait  set task state to TASK_RUNNING and remove wait queue entry from wait queue  Questions:  What if the ‘if (condition) ..’ statement is moved to the front of prepare_to_wait()?  What if the ‘wake_up’ event happens just after the ’if (condition) ..‘ statement but before the execution of the schedule() function?
  • 25. User Preemption • It can occur if need_resched is true when returning to user-space – from a system call – from an interrupt handler
  • 26. Kernel Preemption • In nonpreemptive kernels, kernel code runs until completion. – The scheduler cannot reschedule a task while it is in the kernel – kernel code is scheduled cooperatively, not preemptively • In the 2.6+ kernel, however, the Linux kernel became preemptive: – It is now possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule • Safe => preempt_count == 0 (kernel doesn’t hold any lock and isn’t in any atomic context like softirq or hardirq)
  • 27. Kernel Preemption • preempt_count – a variable in each process’s thread_info – Begins at zero and increments when kernel enters any atomic contexts, decrements when leaves. – If this counter is zero, kernel is preemptible
  • 28. Cases that needs preemption disable • Per-CPU data structures • Some registers must be protected – On x86, kernel does not save FPU state except for user tasks. Entering and exiting FPU mode is a critical section that must occur while preemption is disabled struct this_needs_locking tux[NR_CPUS]; tux[smp_processor_id()] = some_value; /* task is preempted here... */ something = tux[smp_processor_id()];
  • 29. preempt_count /* * We put the hardirq and softirq counter into the preemption * counter. The bitmask has the following meaning: * * - bits 0-7 are the preemption count (max preemption depth: 256) * - bits 8-15 are the softirq count (max # of softirqs: 256) * * The hardirq count can in theory reach the same as NR_IRQS. * In reality, the number of nested IRQS is limited to the stack * size as well. For archs with over 1000 IRQS it is not practical * to expect that they will all nest. We give a max of 10 bits for * hardirq nesting. An arch may choose to give less than 10 bits. * m68k expects it to be 8. * * - bits 16-25 are the hardirq count (max # of nested hardirqs: 1024) * - bit 26 is the NMI_MASK * - bit 28 is the PREEMPT_ACTIVE flag * * PREEMPT_MASK: 0x000000ff * SOFTIRQ_MASK: 0x0000ff00 * HARDIRQ_MASK: 0x03ff0000 * NMI_MASK: 0x04000000 */ include/linux/hardirq.h
  • 30. References • Linux Kernel Development, 3rd Edition, Robert Love, 2010 • Linux kernel source, http://lxr.free- electrons.com