The Linux Scheduling Algorithm
In the previous sections, we discussed process scheduling theory in the abstract, with only occasional mention of how Linux applies a given concept to reality. With the foundation of scheduling now built, we can dive into Linux's very own process scheduler.The Linux scheduler is defined in kernel/sched.c. The scheduler algorithm and supporting code went through a large rewrite early in the 2.5 kernel development series. Consequently, the scheduler code is entirely new and unlike the scheduler in previous kernels. The new scheduler was designed to accomplish specific goals: Implement fully O(1) scheduling. Every algorithm in the new scheduler completes in constant-time, regardless of the number of running processes. Implement perfect SMP scalability. Each processor has its own locking and individual runqueue. Implement improved SMP affinity. Attempt to group tasks to a specific CPU and continue to run them there. Only migrate tasks from one CPU to another to resolve imbalances in runqueue sizes. Provide good interactive performance. Even during considerable system load, the system should react and schedule interactive tasks immediately. Provide fairness. No process should find itself starved of timeslice for any reasonable amount of time. Likewise, no process should receive an unfairly high amount of timeslice. Optimize for the common case of only one or two runnable processes, yet scale well to multiple processors, each with many processes.
The new scheduler accomplished these goals.
Runqueues
The basic data structure in the scheduler is the runqueue . The runqueue is defined in kernel/sched.c[3] as struct runqueue. The runqueue is the list of runnable processes on a given processor; there is one runqueue per processor. Each runnable process is on exactly one runqueue. The runqueue additionally contains per-processor scheduling information. Consequently, the runqueue is the primary scheduling data structure for each processor. [3] Why kernel/sched.c and not <linux/sched.h> ? Because it is desired to abstract away the scheduler code and provide only certain interfaces to the rest of the kernel. Placing the runqueue code in a header file would allow code outside of the scheduler to get at the runqueues, and this is not desired.
Let's look at the structure, with comments describing each field:
struct runqueue {
spinlock_t lock; /* spin lock that protects this runqueue */
unsigned long nr_running; /* number of runnable tasks */
unsigned long nr_switches; /* context switch count */
unsigned long expired_timestamp; /* time of last array swap */
unsigned long nr_uninterruptible; /* uninterruptible tasks */
unsigned long long timestamp_last_tick; /* last scheduler tick */
struct task_struct *curr; /* currently running task */
struct task_struct *idle; /* this processor's idle task */
struct mm_struct *prev_mm; /* mm_struct of last ran task */
struct prio_array *active; /* active priority array */
struct prio_array *expired; /* the expired priority array */
struct prio_array arrays[2]; /* the actual priority arrays */
struct task_struct *migration_thread; /* migration thread */
struct list_head migration_queue; /* migration queue*/
atomic_t nr_iowait; /* number of tasks waiting on I/O */
}; Because runqueues are the core data structure in the scheduler, a group of macros are used to obtain the runqueue associated with a given processor or process. The macro cpu_rq(processor) returns a pointer to the runqueue associated with the given processor; the macro this_rq() returns the runqueue of the current processor; and the macro task_rq(task) returns a pointer to the runqueue on which the given task is queued.Chapter 8, "Kernel Synchronization Introduction"). Because each runqueue is unique to the current processor, it is rare when a processor desires to lock a different processor's runqueue. (It does happen, however, as we will see.) The locking of the runqueue prohibits any changes to it while the lock-holder is reading or writing the runqueue's members. The most common runqueue locking scenario is when you want to lock the runqueue on which a specific task runs. In that case, the task_rq_lock() and task_rq_unlock() functions are used:
struct runqueue *rq;
unsigned long flags;
rq = task_rq_lock(task, &flags);
/* manipulate the task's runqueue, rq */
task_rq_unlock(rq, &flags); Alternatively, the method this_rq_lock() locks the current runqueue and rq_unlock() unlocks the given runqueue:
struct runqueue *rq;
rq = this_rq_lock();
/* manipulate this process's current runqueue, rq */
rq_unlock(rq); To avoid deadlock, code that wants to lock multiple runqueues needs always to obtain the locks in the same order: by ascending runqueue address. (Again, Chapter 8 offers a full explanation.) For example,
/* to lock ... */
if (rq1 == rq2)
spinlock(&rq1->lock);
else {
if (rq1 < rq2) {
spin_lock(&rq1->lock);
spin_lock(&rq2->lock);
} else {
spin_lock(&rq2->lock);
spin_lock(&rq1->lock);
}
}
/* manipulate both runqueues ... */
/* to unlock ... */
spin_unlock(&rq1->lock);
if (rq1 != rq2)
spin_unlock(&rq2->lock); These steps are made automatic by the double_rq_lock() and double_rq_unlock() functions. The preceding steps would then become
double_rq_lock(rq1, rq2);
/* manipulate both runqueues ... */
double_rq_unlock(rq1, rq2); A quick example should help you see why the order of obtaining the locks is important. The topic of deadlock is covered in Chapters 8 and 9 because this is not a problem unique to the runqueues; nested locks always need to be obtained in the same order. The spin locks are used to prevent multiple tasks from simultaneously manipulating the runqueues. They work like a key to a door. The first task to reach the door grabs the key and enters the door, locking the door behind it. If another task reaches the door and finds it locked (because another task is already inside), it must sit and wait for the first task to exit the door and return the key. This waiting is called spinning because the task actually sits in a tight loop, repeatedly checking for the return of the key. Now, consider if one task wants to lock the first runqueue and then the second while another task wants to lock the second runqueue and then the first. Assume the first task succeeds in locking the first runqueue while simultaneously the second task succeeds in locking the second runqueue. Now the first task tries to lock the second runqueue and the second task tries to lock the first runqueue. Neither task succeeds because the other task holds the lock. Both tasks sit, waiting forever for each other. Like an impasse creating a traffic deadlock, this out-of-order locking results in the tasks waiting for each other, forever, and thus deadlocking. If both tasks obtained the locks in the same order, this scenario could not happen. See Chapters 8 and 9 for the full scoop on locking.
The Priority Arrays
Each runqueue contains two priority arrays , the active and the expired array. Priority arrays are defined in kernel/sched.c as struct prio_array. Priority arrays are the data structures that provide O(1) scheduling. Each priority array contains one queue of runnable processors per priority level. These queues contain lists of the runnable processes at each priority level. The priority arrays also contain a priority bitmap used to efficiently discover the highest-priority runnable task in the system.
struct prio_array {
int nr_active; /* number of tasks in the queues */
unsigned long bitmap[BITMAP_SIZE]; /* priority bitmap */
struct list_head queue[MAX_PRIO]; /* priority queues */
}; MAX_PRIO is the number of priority levels on the system. By default, this is 140. Thus, there is one struct list_head for each priority. BITMAP_SIZE is the size that an array of unsigned long typed variables would have to be to provide one bit for each valid priority level. With 140 priorities and 32-bit words, this is five. Thus, bitmap is an array with five elements and a total of 160 bits. Each priority array contains a bitmap field that has at least one bit for every priority on the system. Initially, all the bits are zero. When a task of a given priority becomes runnable (that is, its state is set to TASK_RUNNING), the corresponding bit in the bitmap is set to one. For example, if a task with priority seven is runnable, then bit seven is set. Finding the highest priority task on the system is therefore only a matter of finding the first set bit in the bitmap. Because the number of priorities is static, the time to complete this search is constant and unaffected by the number of running processes on the system. Furthermore, each supported architecture in Linux implements a fast find first set algorithm to quickly search the bitmap. This method is called sched_find_first_bit(). Many architectures provide a find-first-set instruction that operates on a given word[4]. On these systems, finding the first set bit is as trivial as executing this instruction at most a couple of times. [4] On the x86 architecture, this instruction is called bsfl . On PPC, cntlzw is used for this purpose.
Each priority array also contains an array named queue of struct list_head queues, one queue for each priority. Each list corresponds to a given priority and in fact contains all the runnable processes of that priority that are on this processor's runqueue. Finding the next task to run is as simple as selecting the next element in the list. Within a given priority, tasks are scheduled round robin. The priority array also contains a counter, nr_active. This is the number of runnable tasks in this priority array.
Recalculating Timeslices
Many operating systems (older versions of Linux included) have an explicit method for recalculating each task's timeslice when they have all reached zero. Typically, this is implemented as a loop over each task, such as
for (each task on the system) {
recalculate priority
recalculate timeslice
}
The priority and other attributes of the task are used to determine a new timeslice. This approach has some problems: It potentially can take a long time. Worse, it scales O(n) for n tasks on the system. The recalculation must occur under some sort of lock protecting the task list and the individual process descriptors. This results in high lock contention. The nondeterminism of a randomly occurring recalculation of the timeslices is a problem with deterministic real-time programs. It is just gross (which is a quite legitimate reason for improving something in the Linux kernel).
The new Linux scheduler alleviates the need for a recalculate loop. Instead, it maintains two priority arrays for each processor: both an active array and an expired array. The active array contains all the tasks in the associated runqueue that have timeslice left. The expired array contains all the tasks in the associated runqueue that have exhausted their timeslice. When each task's timeslice reaches zero, its timeslice is recalculated before it is moved to the expired array. Recalculating all the timeslices is then as simple as just switching the active and expired arrays. Because the arrays are accessed only via pointer, switching them is as fast as swapping two pointers. This is performed in schedule():
struct prio_array *array = rq->active;
if (!array->nr_active) {
rq->active = rq->expired;
rq->expired = array;
}
This swap is a key feature of the new O(1) scheduler. Instead of recalculating each processes priority and timeslice all the time, the O(1) scheduler performs a simple two-step array swap. This resolves the previously discussed problems.
schedule()
The act of picking the next task to run and switching to it is implemented via the schedule() function. This function is called explicitly by kernel code that wants to sleep and it is invoked whenever a task is to be preempted. The schedule() function is run independently by each processor. Consequently, each CPU makes its own decisions on what process to run next. The schedule() function is relatively simple for all it must accomplish. The following code determines the highest priority task:
struct task_struct *prev, *next;
struct list_head *queue;
struct prio_array *array;
int idx;
prev = current;
array = rq->active;
idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, struct task_struct, run_list); First, the active priority array is searched to find the first set bit. This bit corresponds to the highest priority task that is runnable. Next, the scheduler selects the first task in the list at that priority. This is the highest priority runnable task on the system and is the task the scheduler will run. See Figure 4.2.
Figure 4.2. The Linux O(1) scheduler algorithm.

Calculating Priority and Timeslice
At the beginning of this chapter, you saw how priority and timeslice are used to influence the decisions that the scheduler makes. Additionally, you learned about I/O-bound and processor-bound tasks and why it is beneficial to boost the priority of interactive tasks. Now it's time to look at the actual code that implements this design.Processes have an initial priority that is called the nice value. This value ranges from 20 to +19 with a default of zero. Nineteen is the lowest and 20 is the highest priority. This value is stored in the static_prio member of the process's task_struct. The variable is called the static priority because it does not change from what the user specifies. The scheduler, in turn, bases its decisions on the dynamic priority that is stored in prio. The dynamic priority is calculated as a function of the static priority and the task's interactivity.Chapter 10, Timers and Time Management), whichever is larger. Tasks with the default priority (a nice value of zero) receive a timeslice of 100 milliseconds. See Table 4.1.Chapter 10, "Timers and Time Management"):
struct task_struct *task;
struct runqueue *rq;
task = current;
rq = this_rq();
if (!--task->time_slice) {
if (!TASK_INTERACTIVE(task) || EXPIRED_STARVING(rq))
enqueue_task(task, rq->expired);
else
enqueue_task(task, rq->active);
}
First, the code decrements the process's timeslice and checks whether it is now zero. If it is, the task is expired and it needs to be inserted into an array, so this code first checks whether the task is interactive via the TASK_INTERACTIVE() macro. This macro computes whether a task is "interactive enough" based on its nice value. The lower the nice value (the higher the priority) the less interactive a task needs to be. A nice +19 task can never be interactive enough to be reinserted. Conversely, a nice 20 task would need to be a heavy processor hog not to be reinserted. A task at the default nice value, zero, needs to be relatively interactive to be reinserted, but it is not too difficult. Next, the EXPIRED_STARVING() macro checks whether there are processes on the expired array that are starving that is, if the arrays have not been switched in a relatively long time. If they have not been switched recently, reinserting the current task into the active array further delays the switch, additionally starving the tasks on the expired array. If this is not the case, the process can be inserted into the active array. Otherwise, it is inserted into the expired array, which is the normal practice.
Sleeping and Waking Up
Tasks that are sleeping (blocked) are in a special non-runnable state. This is important because without this special state, the scheduler would select tasks that did not want to run or, worse, sleeping would have to be implemented as busy looping. A task sleeps for a number of reasons, but always while it is waiting for some event. The event can be a specified amount of time, more data from a file I/O, or another hardware event. A task can also involuntarily go to sleep when it tries to obtain a contended semaphore in the kernel (this is covered in Chapter 9, "Kernel Synchronization Methods"). A common reason to sleep is file I/Ofor example, the task issued a read() request on a file, which needs to be read in from disk. As another example, the task could be waiting for keyboard input. Whatever the case, the kernel behavior is the same: The task marks itself as sleeping, puts itself on a wait queue, removes itself from the runqueue, and calls schedule() to select a new process to execute. Waking back up is the inverse: the task is set as runnable, removed from the wait queue, and added back to the runqueue. As discussed in the previous chapter, two states are associated with sleeping, TASK_INTERRUPTIBLE and TASK_UNINTERRUPTIBLE. They differ only in that tasks in the TASK_UNINTERRUPTIBLE state ignore signals, whereas tasks in the TASK_INTERRUPTIBLE state wake up prematurely and respond to a signal if one is issued. Both types of sleeping tasks sit on a wait queue, waiting for an event to occur, and are not runnable. Sleeping is handled via wait queues. A wait queue is a simple list of processes waiting for an event to occur. Wait queues are represented in the kernel by wake_queue_head_t. Wait queues are created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head(). Processes put themselves on a wait queue and mark themselves not runnable. When the event associated with the wait queue occurs, the processes on the queue are awakened. It is important to implement sleeping and waking correctly, to avoid race conditions. Some simple interfaces for sleeping used to be in wide use. These interfaces, however, have races: It is possible to go to sleep after the condition becomes true. In that case, the task might sleep indefinitely. Therefore, the recommended method for sleeping in the kernel is a bit more complicated:
/* 'q' is the wait queue we wish to sleep on */
DECLARE_WAITQUEUE(wait, current);
add_wait_queue(q, &wait);
while (!condition) { /* condition is the event that we are waiting for */
set_current_state(TASK_INTERRUPTIBLE); /* or TASK_UNINTERRUPTIBLE */
if (signal_pending(current))
/* handle signal */
schedule();
}
set_current_state(TASK_RUNNING);
remove_wait_queue(q, &wait); The task performs the following steps to add itself to a wait queue:
- Creates a wait queue entry via DECLARE_WAITQUEUE(). Adds itself to a wait queue via add_wait_queue(). This wait queue awakens the process when the condition for which it is waiting occurs. Of course, there needs to be code elsewhere that calls wake_up() on the queue when the event actually does occur. Changes the process state to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. If the state is set to TASK_INTERRUPTIBLE, a signal wakes the process up. This is called a spurious wake up (a wake-up not caused by the occurrence of the event). So check and handle signals. Tests whether the condition is true. If it is, there is no need to sleep. If it is not true, the task calls schedule(). When the task awakens, it again checks whether the condition is true. If it is, it exits the loop. Otherwise, it again calls schedule() and repeats. Now that the condition is true, the task can set itself to TASK_RUNNING and remove itself from the wait queue via remove_wait_queue().
Figure 4.3. Sleeping and waking up.

The Load Balancer
As discussed, the Linux scheduler implements separate runqueues and locking for each processor on a symmetrical multiprocessing system. That is, each processor maintains its own list of processes and operates the scheduler on only those tasks. The entire scheduling system is, in effect, unique to each processor. How, then, does the scheduler enforce any sort of global scheduling policy on multiprocessing systems? What if the runqueues become unbalanced, say with five processes on one processor's runqueue, but only one on another? The solution is the load balancer, which works to ensure that the runqueues are balanced. The load balancer compares the current processor's runqueue to the other runqueues in the system. If it finds an imbalance, it pulls processes from the busier runqueue to the current runqueue. Ideally, every runqueue will have the same number of processes. That is a lofty goal, but the load balancer comes close. The load balancer is implemented in kernel/sched.c as load_balance(). It has two methods of invocation. It is called by schedule() whenever the current runqueue is empty. It is also called via timer: every 1 millisecond when the system is idle and every 200 milliseconds otherwise. On uniprocessor systems, load_balance() is never called and in fact is not even compiled into the kernel image because there is only a single runqueue and thus no balancing is needed. The load balancer is called with the current processor's runqueue locked and with interrupts disabled to protect the runqueues from concurrent access. In the case where schedule() calls load_balance(), its job is pretty clear because the current runqueue is empty and finding any process and pulling it onto this runqueue is advantageous. When the load balancer is called via timer, however, its job might be less apparent: It needs to resolve any imbalance between the runqueues to keep them about even. See Figure 4.4.
Figure 4.4. The load balancer.

Here is load_balance(), slightly cleaned up but otherwise in all its glory:
static int load_balance(int this_cpu, runqueue_t *this_rq,
struct sched_domain *sd, enum idle_type idle)
{
struct sched_group *group;
runqueue_t *busiest;
unsigned long imbalance;
int nr_moved;
spin_lock(&this_rq->lock);
group = find_busiest_group(sd, this_cpu, &imbalance, idle);
if (!group)
goto out_balanced;
busiest = find_busiest_queue(group);
if (!busiest)
goto out_balanced;
nr_moved = 0;
if (busiest->nr_running > 1) {
double_lock_balance(this_rq, busiest);
nr_moved = move_tasks(this_rq, this_cpu, busiest,
imbalance, sd, idle);
spin_unlock(&busiest->lock);
}
spin_unlock(&this_rq->lock);
if (!nr_moved) {
sd->nr_balance_failed++;
if (unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2)) {
int wake = 0;
spin_lock(&busiest->lock);
if (!busiest->active_balance) {
busiest->active_balance = 1;
busiest->push_cpu = this_cpu;
wake = 1;
}
spin_unlock(&busiest->lock);
if (wake)
wake_up_process(busiest->migration_thread);
sd->nr_balance_failed = sd->cache_nice_tries;
}
} else
sd->nr_balance_failed = 0;
sd->balance_interval = sd->min_interval;
return nr_moved;
out_balanced:
spin_unlock(&this_rq->lock);
if (sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
return 0;
}