Process Scheduling with Linux

This post is continuing the OS series loosely based on Operating Systems: Three Easy Pieces and diving into concepts discussed therein. In particular, I look at the Linux scheduler CFS and use ftrace to track some of its functions to see how runtimes and weights of tasks are updated.

Scheduling Principles

The scheduling of tasks on a CPU running Linux is a compromise between fairness and overhead. While frequent switches between tasks increases the overall fairness by ensuring that they can all chew on their chores, the switching overhead starts to waste away our compute. Hence, the kernel tries to strike a fine balance between the two and the user is expected to leave the scheduler do its job. Modern Linux systems come equipped with the completely fair scheduler CFS. On a high level, CFS executes a task for a small amount of time and then either intervenes itself or the task stops on its own to account for CPU usage. If after accounting the task still has the lowest runtime + granularity among the tasks in the run-queue it continues running otherwise it is preempted and the next task in line is scheduled.

Only a few control parameters can be tweaked by the user without re-compiling the kernel. The time sched_latency within which scheduled tasks are guaranteed to run at least once is split between tasks on a given CPU. On my system it happens to be defined in /sys/kernel/debug/sched/latency_ns and is set to 23ms. High priority tasks can benefit from more CPU time by being assigned a high weight as sched_latency is distributed across the processes according to the formula

1
2
3
                              curr->load.weight              
time_slice = sched_latency * -------------------  
                               sum(load.weight)

At task initialization a weight is obtained according to the nice value of the task ranging from -20 to 19. The bigger of a bully you are as a process the lower your niceness value are. Nice values are mapped to weights according to a weight table constructed to preserve CPU proportionally ratios when the difference in nice values is constant. On 32bit or 64bit architecture a nice value of 0, NICE_LOAD_0, has an associated weight of 1024 or 1048576 (1024 « 10) respectively. Furthermore, min_granularity specifies a lower limit that sched_latency can be divided into and can be queried from /sys/kernel/debug/sched/sched_min_granularity_ns - 3ms in my system. These parameters used to be accessible through sysctl -A | grep sched. Not only are higher priority tasks assigned a longer time-slice, their accounted time (virtual runtime) is also lower than for lower priority tasks that ran for the same amount of real CPU time. This is again accomplished by using weights.

1
2
3
                             NICE_LOAD_0              
vruntime += delta_exec * -------------------  
                          curr->load.weight

Once a task has either voluntarily or non-voluntarily stopped execution, its accumulated vruntime is added to a data-structure called red-black tree with its nodes arranged by vruntime. Being an approximately balanced binary-search tree, lookups can be performed in logarithmic time. What makes red-black trees a good choice, are fast and provably bounded insertions and deletions. The task next in line to be executed on the CPU is the left-most entry in the red-black tree and corresponds to the node with the lowest runtime and is for performance reasons cached. To avoid newly queued tasks using up all the CPU time until they catch up with senior tasks, their vruntime is initialized with min_vruntime.

Important Data Structures

Several data structures are used extensively in the scheduler: rq, cfs_rq, sched_entity, load_weight and task_struct.

rq

The per CPU run queue data structure that links to the individual schedulers CFS, real-time and deadline specified in cfs_rq, rt_rq and dl_rq. Some fields of interest with their offsets are

1
2
3
u64                        nr_switches;          /*   112     8 */
struct task_struct *       curr;                 /*  2528     8 */
int                        cpu;                  /*  2784     4 */

We can for example hook into update_curr to see the monotonically increasing context-switch counter nr_switches by setting up a kprobe.

1
2
3
4
5
6
7
8
9
10
PROMPT> echo 'p:update_curr update_curr nr_switches=+112(+304(%di)):u64 ' > /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e sched_switch -e update_curr sleep 0.2
PROMPT> trace-cmd report --cpu 9
  MediaSu~isor #3-273892 [009] 13235.594093: update_curr:          (ffffffffa26e5710) nr_switches=10915771
  MediaSu~isor #3-273892 [009] 13235.594099: sched_switch:         prev_comm=MediaSu~isor #3 prev_pid=273892 prev_prio=120 prev_state=S ==> next_comm=swapper/9 next_pid=0 next_prio=120
          <idle>-0     [009] 13235.598506: update_curr:          (ffffffffa26e5710) nr_switches=10915772
          <idle>-0     [009] 13235.598511: sched_switch:         prev_comm=swapper/9 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/u24:0 next_pid=287430 next_prio=120
   kworker/u24:0-287430 [009] 13235.598519: update_curr:          (ffffffffa26e5710) nr_switches=10915773
   kworker/u24:0-287430 [009] 13235.598522: sched_switch:         prev_comm=kworker/u24:0 prev_pid=287430 prev_prio=120 prev_state=I ==> next_comm=swapper/9 next_pid=0 next_prio=120
          <idle>-0     [009] 13235.600406: update_curr:          (ffffffffa26e5710) nr_switches=10915774

The timestamps, e.g. 13235.594093, capture the time in seconds the event occured with the zero corresponding to the booting of the system (compare to /proc/uptime). The sched_switch event lists the state of the previous process in prev_state. The states are defined in /sys/kernel/tracing/events/sched/sched_switch/format. We can see that only eight states are parsed from the ones defined in include/linux/sched.h. Interestingly, in the listing above: when the event sched_switch occurred in the kworker process, the task was dead already (prev_state=I).

state comment
S TASK_INTERRUPTIBLE
R TASK_RUNNING
D TASK_UNINTERRUPTIBLE uninterruptible sleep
Z TASK_ZOMBIE child waits to be reaped by parent
T TASK_STOPPED
P TASK_PARKED
X EXIT_DEAD
I TASK_DEAD
t TASK_traced

Further information can be added to these states, and a command such as ps l may reveal modifiers such as s and + to denote session leaders and being a member of the foreground process group respectively. Also, the command top displays information such as process state, PIDs and priorities.

1
2
3
4
5
6
7
8
9
10
11
12
13
PROMPT> top -u root
top - 21:39:41 up 14:15,  1 user,  load average: 0.28, 0.42, 0.39
Tasks: 266 total,   1 running, 265 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.7 us,  0.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.2 hi,  0.1 si,  0.0 st
MiB Mem :  31947.9 total,  17313.0 free,   2498.2 used,  12136.7 buff/cache
MiB Swap:   8192.0 total,   8192.0 free,      0.0 used.  28246.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  28408 root      -2   0       0      0      0 S   1.0   0.0   2:03.01 mt76-usb-rx phy2
  37695 root      20   0       0      0      0 I   0.7   0.0   0:01.61 kworker/u24:3-events_power_efficient
  38304 root      20   0       0      0      0 I   0.3   0.0   0:00.01 kworker/4:1-events
      1 root      20   0  167916  13592   9140 S   0.0   0.0   0:01.61 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.02 kthreadd

The priority of -2 of the process mt76-usb-rx identifies it as being managed by the real-time scheduler. The listed priority is mapped to niceness level by calculating PR - 20. The use of the real-time scheduler for the process with PID 28408 can also be corroborated through

1
2
3
PROMPT> chrt -p 28408
pid 28408's current scheduling policy: SCHED_FIFO
pid 28408's current scheduling priority: 1

Here SCHED_FIFO indicates a real-time process, SCHED_OTHER would indicate a process handled by e.g. CFS.

cfs_rq

Each CPU and task group has their own running queue cfs_rq. Some important members with their offsets are

1
2
3
4
5
struct load_weight         load;                  /*    0      16   sum of se weights on queue */
unsigned int               nr_running             /*    16     4    number of se on queue */
struct sched_entity *      curr;                 /*    80     8 */
struct sched_entity *      next;                 /*    88     8 */
struct sched_entity *      last;                 /*    96     8 */

sched_entity

Scheduling entities keep track of runtime and loads. They were introduced in order to implement group scheduling, so that CFS will provide fair CPU time not only for individual tasks but also to groups of tasks. The member my_rq points to a cfs_rq structure for a group, and to NULL for a task. While nice values are used to define the weights of tasks, task groups rely on cpu.shares files stored in /sys/fs/cgroup/cpu/, an example is given here. The default value of cpu.shares is 1024. A few offsets of interest in the sched_enity structure are

1
2
3
4
5
struct load_weight         load;                 /*     0    16 */
u64                        exec_start;           /*    64     8 */
u64                        sum_exec_runtime;     /*    72     8 */
u64                        vruntime;             /*    80     8 */
u64                        prev_sum_exec_runtime; /*    88     8 */

The struct members vruntime and sum_exec_runtime account time in ns. All updates to the runtime are initiated by the function update_curr().

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
static void update_curr(struct cfs_rq *cfs_rq)
{
  struct sched_entity *curr = cfs_rq->curr;
  u64 now = rq_clock_task(rq_of(cfs_rq));
  u64 delta_exec;

  if (unlikely(!curr))
    return;

  delta_exec = now - curr->exec_start;
  if (unlikely((s64)delta_exec <= 0))
    return;

  curr->exec_start = now;

  if (schedstat_enabled()) {
    struct sched_statistics *stats;

    stats = __schedstats_from_se(curr);
    __schedstat_set(stats->exec_max,
                          max(delta_exec, stats->exec_max));
  }

  curr->sum_exec_runtime += delta_exec;
  schedstat_add(cfs_rq->exec_clock, delta_exec);

  curr->vruntime += calc_delta_fair(delta_exec, curr);
  update_min_vruntime(cfs_rq);

  if (entity_is_task(curr)) {
    struct task_struct *curtask = task_of(curr);

    trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
    cgroup_account_cputime(curtask, delta_exec);
    account_group_exec_runtime(curtask, delta_exec);
  }

  account_cfs_rq_runtime(cfs_rq, delta_exec);
}

sum_exec_runtime will be updated with delta_exec directly, while delta_exec might be scaled before being added to vruntime. Checking the calc_delta_fair routine scaling will only occur if load.weight doesn’t match NICE_0_LOAD. The unlikely macro tells the branch predictor that this is, well, unlikely to occur.

1
2
3
4
5
6
7
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
  if (unlikely(se->load.weight != NICE_0_LOAD))
    delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

  return delta;
}

We can hook a new kprobe into update_curr() to extract runtimes of a task. update_curr() has the call signature static void update_curr(struct cfs_rq *cfs_rq) and enables us to offset (+80) from the first argument %rdi to the struct sched_entity and its members exec_start (+64), sum_exec_runtime (+72), vruntime (+80) and prev_sum_exec_runtime (+88). Furthermore, we can walk through the structs to output the task priority also: cfs_rq->rq->task_struct.

1
2
3
4
5
PROMPT> echo 'p:uc_timers update_curr exec_start=+64(+80(%di)):u64 sum_exec=+72(+80(%di)):u64 vruntime=+80(+80(%di)):u64 prev_sum_exec=+88(+80(%di)):u64 prio=+108(+2528(+304(%di))):u32' > /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e uc_timers sleep 1
PROMPT> trace-cmd report --cpu 7 | grep -E "kworker.*uc_timers"
  kworker/u24:3-11374 [007]  8726.032729: uc_timers:            (ffffffff992e5710) exec_start=8719581063365 sum_exec=631138571 vruntime=194806346164 prev_sum_exec=631138571 prio=120
  kworker/u24:3-11374 [007]  8726.032943: uc_timers:            (ffffffff992e5710) exec_start=8719581277954 sum_exec=631214362 vruntime=194806421955 prev_sum_exec=631214362 prio=120

In the case of the kworker the same delta was added to both sum_exec and vruntime in the update_curr() calls. From the context switch event we can see that the kworker process with PID=11374 has the DEFAULT_PRIORITY defined in include/linux/sched/prio.h of 120 which corresponds to a niceness value of 0 and associated load on 64bit architecture of 1048576 (1024 << 10). Feeling lucky, we can also attach a return probe to update_curr() and use the rdi register to access cfs_rq , given that it was not overwritten previously.

1
2
3
4
5
6
7
8
9
10
11
12
PROMPT> echo 'p:uc_other update_curr q_weight=+0(%di):u64 prio=+108(+2528(+304(%di))):u32 vruntime_curr=+80(+80(%di)):u64 exec_start=+64(+80(%di)):u64' > /sys/kernel/tracing/kprobe_events
PROMPT> echo 'r:uc_out update_curr weight=+0(%di):u64 vruntime_curr=+80(+80(%di)):u64 exec_start=+64(+80(%di)):u64' >> /sys/kernel/tracing/kprobe_events
PROMPT> echo 'p:delta_exec cpuacct_charge delta_time=%si:u64 pid=+1512(%di):u32 tgid=+1516(%di):u32' >> /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e uc_other -e delta_exec -e uc_out sleep 1
PROMPT> trace-cmd record
   StyleThread#4-9384  [005] 21198.133807: uc_other:             (ffffffff914e5710) weight=1048576 prio=120 vruntime_curr=786288984603 exec_start=21157551308351
   StyleThread#4-9384  [005] 21198.133809: delta_exec:           (ffffffff915022d4) delta_time=23743 pid=9384 tgid=9307
   StyleThread#4-9384  [005] 21198.133811: uc_out:               (ffffffff914e5710 <- ffffffff914ee241) q_weight=18446634536175554560 vruntime_curr=225908 exec_start=154539425890761520
   StyleThread#4-9384  [005] 21198.133830: uc_other:             (ffffffff914e5710) weight=1048576 prio=120 vruntime_curr=786289008346 exec_start=21157551332094
   StyleThread#4-9384  [005] 21198.133832: delta_exec:           (ffffffff915022d4) delta_time=29782 pid=9384 tgid=9307
   StyleThread#4-9384  [005] 21198.133833: uc_out:               (ffffffff914e5710 <- ffffffff914e7941) q_weight=18446634536175554560 vruntime_curr=0 exec_start=9098485526505481549
   StyleThread#4-9384  [005] 21198.133847: uc_other:             (ffffffff914e5710) weight=736953 prio=120 vruntime_curr=1927733407963 exec_start=21157551332094

We can actually see that whenever the delta_exec event occurs our rdi register doesn’t point to the cfs_rq struct as the values for weight, vruntime and exec_start are all messed up. However, the delta_time = 23743 (units of ns) corresponds to the difference between the exec_start reported in the surrounding uc_other events. I hooked into the call cpuacct_charge() that receives delta_exec as the second argument and lets me avoid doing the math myself. As the weight is 1048576, corresponding again to a niceness of zero, vruntime is updated by the same amount.

load_weight

The weight of a task that is initialized according to its niceness level. The data structure load_weight is used by both cfs_rq and sched_entity

1
2
long unsigned int          weight;               /*     0     8 */
u32                        inv_weight;           /*     8     4 */

For 32bit architecture weight is set to the entry corresponding to its niceness level from the array sched_prio_to_weight in kernel/sched/core.c.

1
2
3
4
5
6
7
8
9
10
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45, 
 /*  15 */        36,        29,        23,        18,        15,
};

inv_weight are the precomputed inverses 2^32/weight retrieved from sched_prio_to_wmult[40]. Insight about the origins of the weights can be found here. During their lifetime the weights of a task can be adjusted with the function reweight_entity(). If we run a a stack-trace, reweight_entity is generally called from either dequeue_entity() or enqueue_entity(). For example

1
2
3
4
5
6
7
8
9
PROMPT> trace-cmd record -p function -l reweight_entity --func-stack sleep 1
PROMPT> trace-cmd report
  => reweight_entity
  => dequeue_entity
  => dequeue_task_fair
  => __schedule
  => schedule
  => futex_wait_queue
  => futex_wait

When a task gives up control it will call schedule and reweight_entity is called along the way. reweight_entity is calling update_curr, as this snippet shows

1
2
3
4
5
6
7
8
9
10
11
12
13
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
                            unsigned long weight)
{
        if (se->on_rq) {
                /* commit outstanding execution time */
                if (cfs_rq->curr == se)
                        update_curr(cfs_rq);
                update_load_sub(&cfs_rq->load, se->load.weight);
        }
        dequeue_load_avg(cfs_rq, se);

        update_load_set(&se->load, weight);
...

All calls but the one to update_curr() are inlined, this shows also in the disassembly of reweight_entity(). For example, update_load_sub() is translated without any additional calls to

1
2
3
4
5
0xffffffff810e5912 <+18>:	mov    edx,DWORD PTR [rsi+0x38]
0xffffffff810e5915 <+21>:	test   edx,edx
0xffffffff810e5917 <+23>:	je     0xffffffff810e5930 <reweight_entity+48>
0xffffffff810e5919 <+25>:	cmp    QWORD PTR [rdi+0x50],rsi
0xffffffff810e591d <+29>:	je     0xffffffff810e5a44 <reweight_entity+324>

We can also use the disassembly, to check which registers we can use for kprobes. mov edx,DWORD PTR [rsi+0x38] checks whether se is in the queue at the moment, as sched_entity is passed in rsi. After the call to update_curr(), rdi is restored from the stack and not overwritten anymore before the function exits.

1
2
3
4
0xffffffff810e5a44 <+324>:	mov    QWORD PTR [rsp],rdi
0xffffffff810e5a48 <+328>:	call   0xffffffff810e5710 <update_curr>
0xffffffff810e5a4d <+333>:	mov    rdi,QWORD PTR [rsp]
0xffffffff810e5a51 <+337>:	jmp    0xffffffff810e5923 <reweight_entity+35>

Hence, in the case of a dequeue event, se->curr will be the element being updated and we have access to it through the rdi register at the function call and when the function exits.

1
2
3
4
5
6
7
8
PROMPT> echo 'p:dequeue dequeue_entity' > /sys/kernel/tracing/kprobe_events
PROMPT> echo 'p:reweight reweight_entity se_weight=+0(%si):u64 new_weight=%dx:u64 se_curr_weight=+0(+80(%di)):u64 se_next_weight=+0(+88(%di)):u64 se_last_weight=+0(+96(%di)):u64' >> /sys/kernel/tracing/kprobe_events
PROMPT> echo 'r:reweight_return reweight_entity se_curr_weight=+0(+80(%di)):u64 se_next_weight=+0(+88(%di)):u64 se_last_weight=+0(+96(%di)):u64' >> /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e reweight -e reweight_return -e enqueue -e dequeue sleep 1
PROMPT> trace-cmd report
  MediaSu~sor #24-8052  [001] 17222.181895: dequeue:              (ffffffffb4ee7920)
  MediaSu~sor #24-8052  [001] 17222.181897: reweight:             (ffffffffb4ee5900) se_weight=998829 new_weight=76260 se_curr_weight=998829 se_next_weight=0 se_last_weight=0
  MediaSu~sor #24-8052  [001] 17222.181907: reweight_return:      (ffffffffb4ee5900 <- ffffffffb4ee7a01) se_curr_weight=76260 se_next_weight=0 se_last_weight=0

Here we can see that the weight of task with PID 8052 after the dequeue event is updated from 998829 to 76260. This return probe confirms that the update was succesful.

task_struct

This massive structure is used by every kernel and user thread in a Linux system. The structure encompasses information of task usage of memory, CPU scheduling details, file descriptors, credentials and much more. Some information such as PID, locks and timers are not inherited accross fork(). Some of task_struct members and offsets are as follows

1
2
3
pid_t                      pid;                  /*  1512     4 */
pid_t                      tgid;                 /*  1516     4 */
struct files_struct *files                       /*  2112     8 */

The task list is a circular doubly linked list that chains task structures together. Identifiers for tasks are swapped in user and kernel space. In kernel space every thread has a unique PID and if they belong to the same process they share a thread group identifier TGID.

1
2
3
4
5
6
7
8
9
10
PROMPT> echo 'p:uc_other update_curr pid=+1512(+2528(+304(%di))):u32 tgid=+1516(+2528(+304(%di))):u32' > /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e uc_other sleep 0.5
PROMPT> trace-cmd report
 MediaSu~isor #1-7303  [003]  8435.578595: uc_other:             (ffffffff896e5710) pid=7303 tgid=4774
          <idle>-0     [000]  8435.578602: uc_other:             (ffffffff896e5710) pid=0 tgid=0
   kworker/u24:2-11507 [000]  8435.578619: uc_other:             (ffffffff896e5710) pid=11507 tgid=11507
          <idle>-0     [011]  8435.578634: uc_other:             (ffffffff896e5710) pid=0 tgid=0
          <idle>-0     [011]  8435.578638: uc_other:             (ffffffff896e5710) pid=0 tgid=0
 MediaDe~hine #1-7302  [009]  8435.578658: uc_other:             (ffffffff896e5710) pid=7302 tgid=4774
 MediaDe~hine #1-7302  [009]  8435.578660: uc_other:             (ffffffff896e5710) pid=7302 tgid=4774

We see that several different tasks belong to the same thread group 4774. In user space threads belonging to the same process share the same PID, while having distinct LeightWeight Process LWP identifiers, see also Ref. Linux Kernel Programming. We can display these two identifiers PID and LWP using

1
2
3
4
PROMPT> ps -LA | grep 4774
    PID     LWP TTY          TIME CMD
   4774    7302 ?        00:00:43 MediaDe~hine #1
   4774    7303 ?        00:00:19 MediaSu~isor #1

As expected, the PID corresponds to the TGID from kernel space and the LWP matches the PID

How the Kernel Periodically Regains Control

The linux kernel is of monolithic nature. All kernel functions are either entered from a system call or from a peripheral hardware interrupt and hence kernel code runs in one of two exclusive contexts: process or interrupt. As the kernel doesn’t run in the background to watch over the next tasks to be executed, scheduling can be only initiated two ways: from the process itself schedule() or because of an interrupt scheduler_tick(). scheduler_tick() is called through the hrtimer_interrupt as the following stack trace shows.

1
2
3
4
5
6
7
8
9
10
11
PROMPT> trace-cmd record -p function -l scheduler_tick --func-stack sleep 0.5
PROMPT> trace-cmd report
    <idle>-0     [001] 40860.004710: function:             scheduler_tick <-- update_process_times
    <idle>-0     [001] 40860.004732: kernel_stack:         	=> ftrace_trampoline
  => scheduler_tick
  => update_process_times
  => tick_sched_handle
  => tick_sched_timer
  => __hrtimer_run_queues
  => hrtimer_interrupt
  => __sysvec_apic_timer_interrupt

This interrupt fires HZ-times a second. The period itself is defined in the kernel config options ( zcat /proc/config.gz) - for me as 300 times a second, i.e. every 3.3ms. (CONFIG_HZ_300=y; CONFIG_HZ=300). In addition, my kernel is configured with the flag CONFIG_NO_HZ=y per kernel documentation recommendation to omit scheduling on idle CPUs. Weirdly, against documentation my kernel is also configured with the flag CONFIG_NO_HZ_FULL=y that is recommended for running HPC workloads or real-time applications and keeps the kernel from bothering tasks. The following listing shows the events hrtimer_interrupt, schedule and sched_switch and the time difference between events (bracketed value behind the timestamp). We can see that in the idle-block in the middle hrtimer fires mostly at around 3.3ms as configured.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
PROMPT> echo 'p:hrtimer hrtimer_interrupt' > /sys/kernel/tracing/kprobe_events
PROMPT> echo 'p:schedule schedule' >> /sys/kernel/tracing/kprobe_events
PROMPT> trace-cmd record -e sched_switch -e hrtimer -e schedule sleep 1
PROMPT> trace-cmd report --cpu 1 --ts-diff
     <idle>-0     [001] 50575.103257: (+3333)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.203388: (+100131) hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.203409: (+21)    sched_switch:         prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=i3status next_pid=1059 next_prio=120
   i3status-1059  [001] 50575.203547: (+138)   schedule:             (ffffffff939e8894)
   i3status-1059  [001] 50575.203552: (+5)     sched_switch:         prev_comm=i3status prev_pid=1059 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
     <idle>-0     [001] 50575.206647: (+3095)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.210048: (+3401)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.213391: (+3343)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.216654: (+3263)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.220071: (+3417)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.223419: (+3348)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.226661: (+3242)  hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.302309: (+75648) hrtimer:              (ffffffff92f4ee54)
     <idle>-0     [001] 50575.302342: (+33)    sched_switch:         prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=Timer next_pid=36616 next_prio=120
      Timer-36616 [001] 50575.302412: (+70)    schedule:             (ffffffff939e8894)
      Timer-36616 [001] 50575.302422: (+10)    sched_switch:         prev_comm=Timer prev_pid=36616 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
     <idle>-0     [001] 50575.379271: (+76849) sched_switch:         prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=trace-cmd next_pid=37943 next_prio=120
  trace-cmd-37943 [001] 50575.379278: (+7)     schedule:             (ffffffff939e8894)
  trace-cmd-37943 [001] 50575.379286: (+8)     sched_switch:         prev_comm=trace-cmd prev_pid=37943 prev_prio=120 prev_state=S ==> next_comm=swapper/1 next_pid=0 next_prio=120
     <idle>-0     [001] 50575.650179: (+270893) sched_switch:         prev_comm=swapper/1 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/1:0 next_pid=37674 next_prio=120
kworker/1:0-37674 [001] 50575.650190: (+11)    schedule:             (ffffffff939e8894)
kworker/1:0-37674 [001] 50575.650210: (+20)    sched_switch:         prev_comm=kworker/1:0 prev_pid=37674 prev_prio=120 prev_state=I ==> next_comm=swapper/1 next_pid=0 next_prio=120

Once the context is switched, indicated by the event sched_switch, the other tasks such as trace-cmd, kworker and alike never run long enough to preempted by the kernel but instead initiate scheduling themselves through calls to schedule(). The initiation from user space shows also in the stack trace of schedule() that lists a syscall at the beginning of the stack.

1
2
3
4
5
6
7
8
9
10
11
PROMPT> trace-cmd record -p function -l schedule --func-stack sleep 0.5
PROMPT> trace-cmd report
  MediaDe~hine #3-29224 [006] 40994.048576: function:             schedule <-- futex_wait_queue
  MediaDe~hine #3-29224 [006] 40994.048589: kernel_stack:         	=> ftrace_trampoline
	=> schedule
	=> futex_wait_queue
	=> futex_wait
	=> do_futex
	=> __x64_sys_futex
	=> do_syscall_64
	=> entry_SYSCALL_64_after_hwframe

Conclusion

This post was a beast. I learned a bunch and feel like I have still only scratched the surface of the complex topic of the scheduler. ftrace truly proved its worth in this one.

Additional Resources

Evolution of the x86 context switch in Linux