PatchworkOS  da8a090
A non-POSIX operating system.
Loading...
Searching...
No Matches
The EEVDF Scheduler

The Earliest Eligible Virtual Deadline First (EEVDF) scheduler. More...

Collaboration diagram for The EEVDF Scheduler:

Detailed Description

The Earliest Eligible Virtual Deadline First (EEVDF) scheduler.

The scheduler is responsible for allocating CPU time to threads, it does this in such a way to create the illusion that multiple threads are running simultaneously on a single CPU. Consider that a video is in reality just a series of still images, rapidly displayed one after the other. The scheduler works in the same way, rapidly switching between threads to give the illusion of simultaneous execution.

PatchworkOS uses the Earliest Eligible Virtual Deadline First (EEVDF) algorithm for its scheduler, which is a proportional share scheduling algorithm that aims to fairly distribute CPU time among threads based on their weights. This is in contrast to more traditional scheduling algorithms like round-robin or priority queues.

The algorithm is relatively simple conceptually, but it is also very fragile, even small mistakes can easily result in highly unfair scheduling. Therefore, if you find issues or bugs with the scheduler, please open an issue in the GitHub repository.

Included below is an overview of how the scheduler works and the relevant concepts. If you are unfamiliar with mathematical notation, don't worry, we will explain everything in plain English as well.

Weight and Priority

First, we need to assign each thread a "weight", denoted as \(w_i\) where \(i\) uniquely identifies the thread and, for completeness, let's define the set \(A(t)\) which contains all active threads at real time \(t\). To simplify, for thread \(i\), its weight is \(w_i\).

A thread's weight is calculated as the sum of the process's priority and a constant SCHED_WEIGHT_BASE, the constant is needed to ensure that all threads have a weight greater than zero, as that would result in division by zero errors later on.

The weight is what determines the share of CPU time a thread ought to receive, with a higher weight receiving a larger share. Specifically, the fraction of CPU time a thread receives is proportional to its weight relative to the total weight of all active threads. This is implemented using "virtual time", as described below.

See also
EEVDF page 2.

Virtual Time

The first relevant concept that the EEVDF algorithm introduces is "virtual time". Each scheduler maintains a "virtual clock" that runs at a rate inversely proportional to the total weight of all active threads (all threads in the runqueue). So, if the total weight is \(10\) then each unit of virtual time corresponds to \(10\) units of real CPU time.

Each thread should receive an amount of real time equal to its weight for each virtual time unit that passes. For example, if we have two threads, A and B, with weights \(2\) and \(3\) respectively, then for every \(1\) unit of virtual time, thread A should receive \(2\) units of real time and thread B should receive \(3\) units of real time. Which is equivalent to saying that for every \(5\) units of real time, thread A should receive \(2\) units of real time and thread B should receive \(3\) units of real time.

Using this definition of virtual time, we can determine the amount of virtual time \(v\) that has passed between two points in real time \(t_1\) and \(t_2\) as

\begin{equation*} v = \frac{t_2 - t_1}{\sum_{i \in A(t_2)} w_i} \end{equation*}

under the assumption that \(A(t_1) = A(t_2)\), i.e. the set of active threads has not changed between \(t_1\) and \(t_2\).

Note how the denominator containing the \(\sum\) symbol evaluates to the sum of all weights \(w_i\) for each active thread \(i\) in \(A\) at \(t_2\), i.e. the total weight of the scheduler cached in sched->totalWeight. In pseudocode, this can be expressed as

vclock_t vtime = (sys_time_uptime() - oldTime) / sched->totalWeight;
int128_t vclock_t
Virtual clock type.
Definition sched.h:316
clock_t sys_time_uptime(void)
Time since boot.
Definition sys_time.c:99

Additionally, the amount of real time a thread should receive \(r_i\) in a given duration of virtual time \(v\) can be calculated as

\begin{equation*} r_i = v \cdot w_i. \end{equation*}

In practice, all we are doing is taking a duration of real time equal to the total weight of all active threads, and saying that each thread ought to receive a portion of that time equal to its weight. Virtual time is just a trick to simplify the math.

Note
All variables storing virtual time values will be prefixed with 'v' and use the vclock_t type. Variables storing real time values will use the clock_t type as normal.
See also
EEVDF pages 8-9.

Lag

Now we can move on to the metrics used to select threads. There are, as the name "Earliest Eligible Virtual Deadline First" suggests, two main concepts relevant to this process. Its "eligibility" and its "virtual deadline". We will start with "eligibility", which is determined by the concept of "lag".

The lag \(l_i\) of a thread \(i\) is defined as

\begin{equation*} l_i = r_{i}^{should} - r_{i}^{actual} \end{equation*}

where \(r_{i}^{should}\) is the amount of real time thread \(i\) should have received and \(r_{i}^{actual}\) is the amount of real time thread \(i\) has actually received.

If we assume that all real time is used by the threads in \(A(t)\), which is always true, and that all real time is allocated among these threads, we can see that

\begin{equation*} \sum_{i \in A(t)} \left(r_{i}^{should} - r_{i}^{actual}\right) = 0 \end{equation*}

or in other words, the sum of all lag values across all active threads is always zero.

As an example, let's say we have three threads A, B and C with equal weights. To start with each thread is supposed to have run for 0ms, and has actually run for 0ms, so their lag values are:

Thread Lag (ms)
A 0
B 0
C 0

Now, let's say we give a 30ms (in real time) time slice to thread A, while threads B and C do not run at all. After this, the lag values would be:

Thread Lag (ms)
A -20
B 10
C 10

What just happened is that each thread should have received one third of the real time (since they are all of equal weight such that each of their weights is 1/3 of the total weight) which is 10ms. Therefore, since thread A actually received 30ms of real time, it has run for 20ms more than it should have. Meanwhile, threads B and C have not received any real time at all, so they are "owed" 10ms each.

Additionally notice that \(0 + 0 + 0 = 0\) and \(-20 + 10 + 10 = 0\), i.e. the sum of all lag values is still zero.

Finally, this lets us determine the eligibility of a thread. A thread is considered eligible if, and only if, its lag is greater than or equal to zero. In the above example threads B and C are eligible to run, while thread A is not. Notice that due to the sum of all lag values being zero, this means that there will always be at least one eligible thread as long as there is at least one active thread, since if there is a thread with negative lag then there must be at least one thread with positive lag to balance it out.

Note
Fairness is achieved over some long period of time over which the proportion of real time each thread has received will converge to the share it ought to receive. It does not guarantee that each individual time slice is exactly correct, hence it's acceptable for thread A to receive 30ms of real time in the above example.
See also
EEVDF pages 3-5.
Completing the EEVDF Scheduler.

Eligible Time

In most cases, it's undesirable to track lag directly as it would require updating the lag of all threads whenever the scheduler's virtual time is updated, which would violate the desired \(O(\log n)\) complexity of the scheduler.

Instead, EEVDF defines the concept of "eligible time" as the virtual time at which a thread's lag becomes zero, which is equivalent to the virtual time at which the thread becomes eligible to run.

When a thread enters the scheduler for the first time, its eligible time \(v_{ei}\) is the current virtual time of the scheduler, which is equivalent to a lag of \(0\). Whenever the thread runs, its eligible time is advanced by the amount of virtual time corresponding to the real time it has used. This can be calculated as

\begin{equation*} v_{ei} = v_{ei} + \frac{t_{used}}{w_i} \end{equation*}

where \(t_{used}\) is the amount of real time the thread has used, and \(w_i\) is the thread's weight.

See also
EEVDF pages 10-12 and 14.

Virtual Deadlines

We can now move on to the other part of the name, "virtual deadline", which is defined as the earliest time at which a thread should have received its due share of CPU time, rounded to some quantum. The scheduler always selects the eligible thread with the earliest virtual deadline to run next.

We can calculate the virtual deadline \(v_{di}\) of a thread as

\begin{equation*} v_{di} = v_{ei} + \frac{Q}{w_i} \end{equation*}

where \(Q\) is a constant time slice defined by the scheduler, in our case CONFIG_TIME_SLICE.

See also
EEVDF page 3.

Rounding Errors

Before describing the implementation, it is important to note that due to the nature of integer division, rounding errors are inevitable when calculating virtual time and lag.

For example, when computing \(10/3 = 3.333...\) we instead get \(3\), losing the fractional part. Over time, these small errors can accumulate and lead to unfair scheduling.

It might be tempting to use floating point to mitigate these errors, however using floating point in a kernel is generally considered very bad practice, only user space should, ideally, be using floating point.

Instead, we use a simple technique to mitigate the impact of rounding errors. We represent virtual time and lag using 128-bit fixed-point arithmetic, where the lower 63 bits represent the fractional part.

There were two reasons for the decision to use 128 bits over 64 bits despite the performance cost. First, it means that even the maximum possible value of uptime, stored using 64 bits, can still be represented in the fixed-point format without overflowing the integer part, meaning we don't need to worry about overflow at all.

Second, testing shows that lag appears to accumulate an error of about \(10^3\) to \(10^4\) in the fractional part every second under heavy load, meaning that using 64 bits and a fixed point offset of 20 bits, would result in an error of approximately 1 nanosecond per minute, considering that the testing was not particularly rigorous, it might be significantly worse in practice. Note that at most every division can create an error equal to the divider minus one in the fractional part.

If we instead use 128 bits with a fixed point offset of 63 bits, the same error of \(10^4\) in the fractional part results in an error of approximately \(1.7 \cdot 10^{-9}\) nanoseconds per year, which is obviously negligible even if the actual error is in reality several orders of magnitude worse.

For comparisons between vclock_t values, we consider two values equal if the difference between their whole parts is less than or equal to VCLOCK_EPSILON.

Some might feel concerned about the performance impact of using 128-bit arithmetic. However, consider that by using 128-bit arithmetic, we no longer need any other means of reducing rounding errors. We dont need to worry about remainders from divisions, dividing to the nearest integer instead of rounding down, etc. This not only simplifies the code drastically, making it more approachable, but it also means that, in practice, the performance impact is negligible. Its a very simple brute force solution, but simple does not mean bad.

See also
Fixed Point Arithmetic

Scheduling

With the central concepts introduced, we can now describe how the scheduler works. As mentioned, the goal is to always run the eligible thread with the earliest virtual deadline. To achieve this, each scheduler maintains a runqueue in the form of a Red-Black tree sorted by each thread's virtual deadline.

To select the next thread to run, we find the first eligible thread in the runqueue and switch to it. If no eligible thread is found (which means the runqueue is empty), we switch to the idle thread. This process is optimized by storing the minimum eligible time of each subtree in each node of the runqueue, allowing us to skip entire subtrees that do not contain any eligible threads.

Preemption

If, at any point in time, a thread with an earlier virtual deadline becomes available to run (for example, when a thread is unblocked), the scheduler will preempt the currently running thread and switch to the newly available thread.

Idle Thread

The idle thread is a special thread that is not considered active (not stored in the runqueue) and simply runs an infinite loop that halts the CPU while waiting for an interrupt signaling that a non-idle thread is available to run. Each CPU has its own idle thread.

Load Balancing

Each CPU has its own scheduler and associated runqueue, as such we need to balance the load between each CPU, ideally without causing too many cache misses. Meaning we want to keep threads which have recently run on a CPU on the same CPU when possible. As such, we define a thread to be "cache-cold" on a CPU if the time since it last ran on that CPU is greater than CONFIG_CACHE_HOT_THRESHOLD, otherwise its considered "cache-hot".

We use two mechanisms to balance the load between CPUs, one push mechanism and one pull mechanism.

The push mechanism, also called work stealing, is used when a thread is submitted to the scheduler, as in it was created or unblocked. In this case, if the thread is cache-cold then the thread will be added to the runqueue of the CPU with the lowest weight. Otherwise, it will be added to the runqueue of the CPU it last ran on.

The pull mechanism is used when a CPU is about to become idle. The CPU will find the CPU with the highest weight and steal the first cache-cold thread from its runqueue. If no cache-cold threads are found, it will simply run the idle thread.

Note
The reason we want to avoid a global runqueue is to avoid lock contention. Even a small amount of lock contention in the scheduler will quickly degrade performance, as such it is only allowed to lock a single CPU's scheduler at a time. This does cause race conditions while pulling or pushing threads, but the worst case scenario is imperfect load balancing, which is acceptable.

Testing

The scheduler is tested using a combination of asserts and tests that are enabled in debug builds (NDEBUG not defined). These tests verify that the runqueue is sorted, that the lag does sum to zero (within a margin from rounding errors), and other invariants of the scheduler.

References

References were accessed on 2025-12-02.

Ion Stoica, Hussein Abdel-Wahab, "Earliest Eligible Virtual Deadline First", Old Dominion University, 1996.

Jonathan Corbet, "An EEVDF CPU scheduler for Linux", LWN.net, March 9, 2023.

Jonathan Corbet, "Completing the EEVDF Scheduler", LWN.net, April 11, 2024.

Modules

 Program Loader
 Program loading and user stack management.
 
 Processes
 Processes.
 
 Time subsystem
 System time.
 
 Threads
 Thread of execution.
 
 Timer subsystem
 Per-CPU timers.
 
 Waiting subsystem
 Wait queue implementation.
 

Data Structures

struct  sched_client_t
 Per-thread scheduler context. More...
 
struct  sched_t
 Per-CPU scheduler. More...
 
struct  lag_t
 Lag type. More...
 

Macros

#define SCHED_FIXED_POINT   (63LL)
 The bits used for the fractional part of a virtual clock or lag value.
 
#define SCHED_FIXED_ZERO   ((int128_t)0)
 Fixed-point zero.
 
#define SCHED_EPSILON   (10LL)
 The minimum difference between two virtual clock or lag values to consider then unequal.
 
#define SCHED_FIXED_TO(x)   (((int128_t)(x)) << SCHED_FIXED_POINT)
 Convert a regular integer to fixed-point representation.
 
#define SCHED_FIXED_FROM(x)   ((int64_t)((int128_t)(x) >> SCHED_FIXED_POINT))
 Convert a fixed-point value to a regular integer.
 
#define SCHED_WEIGHT_MAX   (PRIORITY_MAX + SCHED_WEIGHT_BASE)
 The maximum weight a thread can have.
 
#define SCHED_WEIGHT_BASE   1
 Base weight added to all threads.
 

Typedefs

typedef int128_t vclock_t
 Virtual clock type.
 
typedef int128_t lag_t
 

Functions

void sched_client_init (sched_client_t *client)
 Initialize the scheduler context for a thread.
 
void sched_init (sched_t *sched)
 Initialize the scheduler for a CPU.
 
_NORETURN void sched_start (thread_t *bootThread)
 Starts the scheduler by jumping to the boot thread.
 
void sched_submit (thread_t *thread)
 Submits a thread to the scheduler.
 
void sched_do (interrupt_frame_t *frame, cpu_t *self)
 Perform a scheduling operation.
 
bool sched_is_idle (cpu_t *cpu)
 Checks if the CPU is currently idle.
 
thread_tsched_thread (void)
 Retrieves the currently running thread.
 
process_tsched_process (void)
 Retrieves the process of the currently running thread.
 
thread_tsched_thread_unsafe (void)
 Retrieves the currently running thread without disabling interrupts.
 
process_tsched_process_unsafe (void)
 Retrieves the process of the currently running thread without disabling interrupts.
 
uint64_t sched_nanosleep (clock_t timeout)
 Sleeps the current thread for a specified duration in nanoseconds.
 
void sched_yield (void)
 Yield the current thread's time slice to allow other threads to run.
 
_NORETURN void sched_process_exit (int32_t status)
 Terminates the currently executing process and all it's threads.
 
_NORETURN void sched_thread_exit (void)
 Terminates the currently executing thread.
 
_NORETURN void sched_idle_loop (void)
 The idle loop for the scheduler.
 

Macro Definition Documentation

◆ SCHED_FIXED_POINT

#define SCHED_FIXED_POINT   (63LL)

The bits used for the fractional part of a virtual clock or lag value.

One sign bit, 64 integer bits, 63 fractional bits.

Definition at line 329 of file sched.h.

◆ SCHED_FIXED_ZERO

#define SCHED_FIXED_ZERO   ((int128_t)0)

Fixed-point zero.

Definition at line 334 of file sched.h.

◆ SCHED_EPSILON

#define SCHED_EPSILON   (10LL)

The minimum difference between two virtual clock or lag values to consider then unequal.

Definition at line 339 of file sched.h.

◆ SCHED_FIXED_TO

#define SCHED_FIXED_TO (   x)    (((int128_t)(x)) << SCHED_FIXED_POINT)

Convert a regular integer to fixed-point representation.

Definition at line 344 of file sched.h.

◆ SCHED_FIXED_FROM

#define SCHED_FIXED_FROM (   x)    ((int64_t)((int128_t)(x) >> SCHED_FIXED_POINT))

Convert a fixed-point value to a regular integer.

Definition at line 349 of file sched.h.

◆ SCHED_WEIGHT_MAX

#define SCHED_WEIGHT_MAX   (PRIORITY_MAX + SCHED_WEIGHT_BASE)

The maximum weight a thread can have.

Definition at line 354 of file sched.h.

◆ SCHED_WEIGHT_BASE

#define SCHED_WEIGHT_BASE   1

Base weight added to all threads.

Used to prevent division by zero.

Definition at line 361 of file sched.h.

Typedef Documentation

◆ vclock_t

Virtual clock type.

Definition at line 316 of file sched.h.

◆ lag_t

typedef int128_t lag_t

Definition at line 322 of file sched.h.

Function Documentation

◆ sched_client_init()

void sched_client_init ( sched_client_t client)

Initialize the scheduler context for a thread.

Parameters
clientThe scheduler context to initialize.

Definition at line 102 of file sched.c.

Here is the caller graph for this function:

◆ sched_init()

void sched_init ( sched_t sched)

Initialize the scheduler for a CPU.

Parameters
schedThe scheduler to initialize.

Definition at line 324 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_start()

_NORETURN void sched_start ( thread_t bootThread)

Starts the scheduler by jumping to the boot thread.

Will never return.

Parameters
bootThreadThe initial thread to switch to.

Definition at line 350 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_submit()

void sched_submit ( thread_t thread)

Submits a thread to the scheduler.

If the thread has previously ran within CONFIG_CACHE_HOT_THRESHOLD nanoseconds, it will be submitted to the same CPU it last ran on, otherwise it will be submitted to the least loaded CPU.

Parameters
threadThe thread to submit.

Definition at line 375 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_do()

void sched_do ( interrupt_frame_t frame,
cpu_t self 
)

Perform a scheduling operation.

This function is called on every interrupt to provide a scheduling opportunity.

Parameters
frameThe interrupt frame.
selfThe cpu performing the scheduling operation.

Definition at line 521 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_is_idle()

bool sched_is_idle ( cpu_t cpu)

Checks if the CPU is currently idle.

Parameters
cpuThe CPU to check.
Returns
true if the CPU is idle, false otherwise.

Definition at line 604 of file sched.c.

Here is the caller graph for this function:

◆ sched_thread()

thread_t * sched_thread ( void  )

Retrieves the currently running thread.

Returns
The currently running thread.

Definition at line 612 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_process()

process_t * sched_process ( void  )

Retrieves the process of the currently running thread.

Note
Will not increment the reference count of the returned process, as we consider the currently running thread to always be referencing its process.
Returns
The process of the currently running thread.

Definition at line 620 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_thread_unsafe()

thread_t * sched_thread_unsafe ( void  )

Retrieves the currently running thread without disabling interrupts.

Returns
The currently running thread.

Definition at line 626 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_process_unsafe()

process_t * sched_process_unsafe ( void  )

Retrieves the process of the currently running thread without disabling interrupts.

Note
Will not increment the reference count of the returned process, as we consider the currently running thread to always be referencing its process.
Returns
The process of the currently running thread.

Definition at line 632 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_nanosleep()

uint64_t sched_nanosleep ( clock_t  timeout)

Sleeps the current thread for a specified duration in nanoseconds.

Parameters
timeoutThe duration to sleep in nanoseconds.
Returns
On success, 0. On failure, ERR and errno is set.

Definition at line 638 of file sched.c.

Here is the caller graph for this function:

◆ sched_yield()

void sched_yield ( void  )

Yield the current thread's time slice to allow other threads to run.

Note
Currently just sleeps for 1ms as we cant really yield without weird lag math.

Definition at line 643 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_process_exit()

_NORETURN void sched_process_exit ( int32_t  status)

Terminates the currently executing process and all it's threads.

Note
Will never return, instead it triggers an interrupt that kills the current thread.
Parameters
statusThe exit status of the process.

Definition at line 648 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_thread_exit()

_NORETURN void sched_thread_exit ( void  )

Terminates the currently executing thread.

Note
Will never return, instead it triggers an interrupt that kills the thread.

Definition at line 657 of file sched.c.

Here is the call graph for this function:
Here is the caller graph for this function:

◆ sched_idle_loop()

_NORETURN void sched_idle_loop ( void  )
extern

The idle loop for the scheduler.

This is where idle threads will run when there is nothing else to do.

Here is the caller graph for this function: