Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی

Linux Device Drivers (3rd Edition) [Electronic resources] - نسخه متنی

Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini

| نمايش فراداده ، افزودن یک نقد و بررسی
افزودن به کتابخانه شخصی
ارسال به دوستان
جستجو در متن کتاب
بیشتر
تنظیمات قلم

فونت

اندازه قلم

+ - پیش فرض

حالت نمایش

روز نیمروز شب
جستجو در لغت نامه
بیشتر
لیست موضوعات
توضیحات
افزودن یادداشت جدید








6.2. Blocking I/O


Back in Chapter 3, we looked at
how to



implement the read and
write driver methods. At that point, however, we
skipped over one important issue: how does a driver respond if it
cannot immediately satisfy the request? A call to
read may come when no data is available, but
more is expected in the future. Or a process could attempt to
write, but your device is not ready to accept
the data, because your output buffer is full. The calling process
usually does not care about such issues; the programmer simply
expects to call read or
write and have the call return after the
necessary work has been done. So, in such cases, your driver should
(by default) block the process, putting it to
sleep until the request can proceed.

This section shows how to put a process to sleep and wake it up again
later on. As usual, however, we have to explain a few concepts first.


6.2.1. Introduction to Sleeping


What does it mean for a process to
"sleep"? When a process is put to
sleep, it is marked as being in a special state and removed from the
scheduler's run queue. Until something comes along
to change that state, the process will not be scheduled on any CPU
and, therefore, will not run. A sleeping process has been shunted off
to the side of the system, waiting for some future event to happen.

Causing a process to sleep is an easy thing for a Linux device driver
to do. There are, however, a couple of rules that you must keep in
mind to be able to code sleeps in a safe manner.

The first of these rules is: never sleep when you are running in an
atomic context.
An atomic context is
simply a state where multiple steps must be performed without any
sort of concurrent access. What that means, with regard to sleeping,
is that your driver cannot sleep while holding a spinlock, seqlock,
or RCU lock. You also cannot sleep if you have disabled interrupts.
It is legal to sleep while holding a semaphore,
but you should look very carefully at any code that does so. If code
sleeps while holding a semaphore, any other thread waiting for that
semaphore also sleeps. So any sleeps that happen while holding
semaphores should be short, and you should convince yourself that, by
holding the semaphore, you are not blocking the process that will
eventually wake you up.

Another thing to remember with sleeping is that, when you wake up,
you never know how long your process may have been out of the CPU or
what may have changed in the mean time. You also do not usually know
if another process may have been sleeping for the same event; that
process may wake before you and grab whatever resource you were
waiting for. The end result is that you can make no assumptions about
the state of the system after you wake up, and you must check to
ensure that the condition you were waiting for is, indeed, true.

One other relevant point, of course, is that your process cannot
sleep unless it is assured that somebody else, somewhere, will wake
it up. The code doing the awakening must also be able to find your
process to be able to do its job. Making sure that a wakeup happens
is a matter of thinking through your code and knowing, for each
sleep, exactly what series of events will bring that sleep to an end.
Making it possible for your sleeping process to be found is, instead,
accomplished through a data structure called a wait
queue

.
A wait queue is just what it sounds like: a list of processes, all
waiting for a specific event.

In Linux, a wait queue is managed by means of a
"wait queue head," a structure of
type wait_queue_head_t, which is defined in
<linux/wait.h>. A wait queue head can be
defined and initialized statically with:

DECLARE_WAIT_QUEUE_HEAD(name);

or dynamicly as follows:

wait_queue_head_t my_queue;
init_waitqueue_head(&my_queue);

We will return to the structure of wait queues shortly, but we know
enough now to take a first look at sleeping and waking up.


6.2.2. Simple Sleeping


When a process sleeps, it does so in expectation that some condition
will become true in the future. As we noted before, any process that
sleeps must check to be sure that the condition it was waiting for is
really true when it wakes up again. The simplest way of sleeping in
the Linux kernel is a macro called
wait_event

(with a few variants); it combines handling the details of sleeping
with a check on the condition a process is waiting for. The forms of
wait_event are:

wait_event(queue, condition)
wait_event_interruptible(queue, condition)
wait_event_timeout(queue, condition, timeout)
wait_event_interruptible_timeout(queue, condition, timeout)

In all of the above forms, queue is the wait queue
head to use. Notice that it is passed "by
value." The condition is an
arbitrary boolean expression that is evaluated by the macro before
and after sleeping; until condition evaluates to a
true value, the process continues to sleep. Note that
condition may be evaluated an arbitrary number of
times, so it should not have any side effects.

If you use wait_event, your process is put into
an uninterruptible sleep which, as we have mentioned before, is
usually not what you want. The preferred alternative is
wait_event_interruptible, which can be
interrupted by signals. This version returns an integer value that
you should check; a nonzero value means your sleep was interrupted by
some sort of signal, and your driver should probably return
-ERESTARTSYS. The final versions
(wait_event_timeout and
wait_event_interruptible_timeout) wait for a
limited time; after that time period (expressed in jiffies, which we
will discuss in Chapter 7)
expires, the macros return with a value of 0
regardless of how condition evaluates.

The other half of the picture, of course, is waking up. Some other
thread of execution (a different process, or an interrupt handler,
perhaps) has to perform the wakeup for you, since your process is, of
course, asleep. The basic function that wakes up sleeping processes
is called wake_up
.
It comes in several forms (but we look at only two of them now):

void wake_up(wait_queue_head_t *queue);
void wake_up_interruptible(wait_queue_head_t *queue);

wake_up wakes up all processes waiting on the
given queue (though the situation is a little more
complicated than that, as we will see later). The other form
(wake_up_interruptible) restricts itself to
processes performing an interruptible sleep. In general, the two are
indistinguishable (if you are using interruptible sleeps); in
practice, the convention is to use wake_up if
you are using wait_event and
wake_up_interruptible if you use
wait_event_interruptible.

We now know enough to look at a simple example of sleeping and waking
up. In the sample source, you can find a module called
sleepy. It implements a device with simple
behavior: any process that attempts to read from the device is put to
sleep. Whenever a process writes to the device, all sleeping
processes are awakened. This behavior is implemented with the
following read and write
methods:

static DECLARE_WAIT_QUEUE_HEAD(wq);
static int flag = 0;
ssize_t sleepy_read (struct file *filp, char _ _user *buf, size_t count, loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) going to sleep\n",
current->pid, current->comm);
wait_event_interruptible(wq, flag != 0);
flag = 0;
printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
return 0; /* EOF */
}
ssize_t sleepy_write (struct file *filp, const char _ _user *buf, size_t count,
loff_t *pos)
{
printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
current->pid, current->comm);
flag = 1;
wake_up_interruptible(&wq);
return count; /* succeed, to avoid retrial */
}

Note the use of the flag variable in this example.
Since wait_event_interruptible checks for a
condition that must become true, we use flag to
create that condition.

It is interesting to consider what happens if
two processes are waiting when
sleepy_write is called. Since
sleepy_read resets flag to
0 once it wakes up, you might think that the
second process to wake up would immediately go back to sleep. On a
single-processor system, that is almost always what happens. But it
is important to understand why you cannot count on that behavior. The
wake_up_interruptible call
will cause both sleeping processes to wake up.
It is entirely possible that they will both note that
flag is nonzero before either has the opportunity
to reset it. For this trivial module, this race condition is
unimportant. In a real driver, this kind of race can create rare
crashes that are difficult to diagnose. If correct operation required
that exactly one process see the nonzero value, it would have to be
tested in an atomic manner. We will see how a real driver handles
such situations shortly. But first we have to cover one other topic.


6.2.3. Blocking and Nonblocking Operations


One last point we need to touch
on




before we look at the implementation of full-featured
read and write methods is
deciding when to put a process to sleep. There are times when
implementing proper Unix semantics requires that an operation not
block, even if it cannot be completely carried out.

There are also times when the calling process informs you that it
does not want to block, whether or not its I/O
can make any progress at all. Explicitly nonblocking I/O is indicated
by the O_NONBLOCK flag in
filp->f_flags. The flag is defined in
<linux/fcntl.h>, which is automatically
included by <linux/fs.h>. The flag gets
its name from "open-nonblock,"
because it can be specified at open time (and originally could be
specified only there). If you browse the source code, you find some
references to an O_NDELAY flag; this is an
alternate name for O_NONBLOCK, accepted for
compatibility with System V code. The flag is cleared by default,
because the normal behavior of a process waiting for data is just to
sleep. In the case of a blocking operation, which is the default, the
following behavior should be implemented in order to adhere to the
standard semantics:

  • If a process calls read but no data is (yet)
    available, the process must block. The process is awakened as soon as
    some data arrives, and that data is returned to the caller, even if
    there is less than the amount requested in the
    count argument to the method.

  • If a process calls
    write and there is no space in the buffer, the
    process must block, and it must be on a different wait queue from the
    one used for reading. When some data has been written to the hardware
    device, and space becomes free in the output buffer, the process is
    awakened and the write call succeeds, although
    the data may be only partially written if there
    isn't room in the buffer for the
    count bytes that were requested.


Both these statements assume that there are
both
input and output buffers; in practice, almost every device driver has
them. The input buffer is required to avoid losing data that arrives
when nobody is reading. In contrast, data can't be
lost on write, because if the system call
doesn't accept data bytes, they remain in the
user-space buffer. Even so, the output buffer is almost always useful
for squeezing more performance out of the hardware.


The performance gain of
implementing an output buffer in the driver results from the reduced
number of context switches and user-level/kernel-level transitions.
Without an output buffer (assuming a slow device), only one or a few
characters are accepted by each system call, and while one process
sleeps in write, another process runs
(that's one context switch). When the first process
is awakened, it resumes (another context switch),
write returns (kernel/user transition), and the
process reiterates the system call to write more data (user/kernel
transition); the call blocks and the loop continues. The addition of
an output buffer allows the driver to accept larger chunks of data
with each write call, with a corresponding
increase in performance. If that buffer is big enough, the
write call succeeds on the first
attemptthe buffered data will be pushed out to the device
laterwithout control needing to go back to user space for a
second or third write call. The choice of a
suitable size for the output buffer is clearly device-specific.

We don't use an input buffer in
scull, because data is already available when
read is issued. Similarly, no output buffer is
used, because data is simply copied to the memory area associated
with the device. Essentially, the device is a
buffer, so the implementation of additional buffers would be
superfluous. We'll see the use of buffers in Chapter 10.

The behavior of read and
write is different if
O_NONBLOCK is specified. In this case, the calls
simply return -EAGAIN ("TRy it
again") if a process calls read
when no data is available or if it calls write
when there's no space in the buffer.

As you might expect, nonblocking operations return immediately,
allowing the application to poll for data. Applications must be
careful when using the stdio functions while
dealing with nonblocking files, because they can easily mistake a
nonblocking return for EOF. They always have to
check errno.

Naturally, O_NONBLOCK is meaningful in the
open method also. This happens when the call can
actually block for a long time; for example, when opening (for read
access) a FIFO that has no writers (yet), or accessing a disk file
with a pending lock. Usually, opening a device either succeeds or
fails, without the need to wait for external events. Sometimes,
however, opening the device requires a long initialization, and you
may choose to support O_NONBLOCK in your
open method by returning immediately with
-EAGAIN if the flag is set, after starting the
device initialization process. The driver may also implement a
blocking open to support access policies in a
way similar to file locks. We'll see one such
implementation in Section 6.6.3 later in this chapter.

Some drivers may also implement special semantics for
O_NONBLOCK; for example, an open of a tape device
usually blocks until a tape has been inserted. If the tape drive is
opened with O_NONBLOCK, the open succeeds
immediately regardless of whether the media is present or not.

Only the read, write, and
open file operations are affected by the
nonblocking flag.


6.2.4. A Blocking I/O Example


Finally, we get to an example of a real driver method that implements
blocking I/O. This example is taken from the
scullpipe driver; it is a special form of
scull that implements a pipe-like device.

Within a driver, a process blocked in a read
call is awakened when data arrives; usually the hardware issues an
interrupt to signal such an event, and the driver awakens waiting
processes as part of handling the interrupt. The
scullpipe driver works differently, so that it
can be run without requiring any particular hardware or an interrupt
handler. We chose to use another process to generate the data and
wake the reading process; similarly, reading processes are used to
wake writer processes that are waiting for buffer space to become
available.

The device driver uses a device structure that contains two wait
queues and a buffer. The size of the buffer is configurable in the
usual ways (at compile time, load time, or runtime).

struct scull_pipe {
wait_queue_head_t inq, outq; /* read and write queues */
char *buffer, *end; /* begin of buf, end of buf */
int buffersize; /* used in pointer arithmetic */
char *rp, *wp; /* where to read, where to write */
int nreaders, nwriters; /* number of openings for r/w */
struct fasync_struct *async_queue; /* asynchronous readers */
struct semaphore sem; /* mutual exclusion semaphore */
struct cdev cdev; /* Char device structure */
};

The read implementation manages both blocking
and nonblocking input and looks like this:

static ssize_t scull_p_read (struct file *filp, char _ _user *buf, size_t count,
loff_t *f_pos)
{
struct scull_pipe *dev = filp->private_data;
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
while (dev->rp = = dev->wp) { /* nothing to read */
up(&dev->sem); /* release the lock */
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
PDEBUG("\"%s\" reading: going to sleep\n", current->comm);
if (wait_event_interruptible(dev->inq, (dev->rp != dev->wp)))
return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
/* otherwise loop, but first reacquire the lock */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
}
/* ok, data is there, return something */
if (dev->wp > dev->rp)
count = min(count, (size_t)(dev->wp - dev->rp));
else /* the write pointer has wrapped, return data up to dev->end */
count = min(count, (size_t)(dev->end - dev->rp));
if (copy_to_user(buf, dev->rp, count)) {
up (&dev->sem);
return -EFAULT;
}
dev->rp += count;
if (dev->rp = = dev->end)
dev->rp = dev->buffer; /* wrapped */
up (&dev->sem);
/* finally, awake any writers and return */
wake_up_interruptible(&dev->outq);
PDEBUG("\"%s\" did read %li bytes\n",current->comm, (long)count);
return count;
}

As you can see, we left some PDEBUG statements in
the code. When you compile the driver, you can enable messaging to
make it easier to follow the interaction of different processes.

Let us look carefully at how scull_p_read
handles waiting for data. The while loop tests the
buffer with the device semaphore held. If there is data there, we
know we can return it to the user immediately without sleeping, so
the entire body of the loop is skipped. If, instead, the buffer is
empty, we must sleep. Before we can do that, however, we must drop
the device semaphore; if we were to sleep holding it, no writer would
ever have the opportunity to wake us up. Once the semaphore has been
dropped, we make a quick check to see if the user has requested
non-blocking I/O, and return if so. Otherwise, it is time to call
wait_event_interruptible.

Once we get past that call, something has woken us up, but we do not
know what. One possibility is that the process received a signal.
The
if statement that contains the
wait_event_interruptible call checks for this
case. This statement ensures the proper and expected reaction to
signals, which could have been responsible for waking up the process
(since we were in an interruptible sleep). If a signal has arrived
and it has not been blocked by the process, the proper behavior is to
let upper layers of the kernel handle the event. To this end, the
driver returns -ERESTARTSYS to the caller; this
value is used internally by the virtual filesystem (VFS) layer, which
either restarts the system call or returns -EINTR
to user space. We use the same type of check to deal with signal
handling for every read and
write implementation.

However, even in the absence of a signal, we do not yet know for sure
that there is data there for the taking. Somebody else could have
been waiting for data as well, and they might win the race and get
the data first. So we must acquire the device semaphore again; only
then can we test the read buffer again (in the
while loop) and truly know that we can return the
data in the buffer to the user. The end result of all this code is
that, when we exit from the while loop, we know
that the semaphore is held and the buffer contains data that we can
use.

Just for completeness, let us note that
scull_p_read can sleep in another spot after we
take the device semaphore: the call to
copy_to_user. If scull
sleeps while copying data between kernel and user space, it sleeps
with the device semaphore held. Holding the semaphore in this case is
justified since it does not deadlock the system (we know that the
kernel will perform the copy to user space and wakes us up without
trying to lock the same semaphore in the process), and since it is
important that the device memory array not change while the driver
sleeps.


6.2.5. Advanced Sleeping


Many drivers are able to meet their sleeping requirements with the
functions we have covered so far. There are situations, however, that
call for a deeper understanding of how the Linux wait queue mechanism
works. Complex locking or performance requirements can force a driver
to use lower-level functions to effect a sleep. In this section, we
look at the lower level to get an understanding of what is really
going on when a process sleeps.


6.2.5.1 How a process sleeps

If you look inside <linux/wait.h>, you see
that the data structure behind the
wait_queue_head_t type is quite simple; it
consists of a spinlock and a linked list. What goes on to that list
is a wait queue entry, which is declared with the type
wait_queue_t. This structure contains information
about the sleeping process and exactly how it would like to be woken
up.

The first step in putting a process to sleep is usually the
allocation and initialization of a wait_queue_t
structure, followed by its addition to the proper wait queue. When
everything is in place, whoever is charged with doing the wakeup will
be able to find the right processes.

The next step is to set the state of the process to mark it as being
asleep. There are several task states defined in
<linux/sched.h>.
TASK_RUNNING means that the process is able to
run, although it is not necessarily executing in the processor at any
specific moment. There are two states that indicate that a process is
asleep: TASK_INTERRUPTIBLE and
TASK_UNINTERRUPTIBLE; they correspond, of course,
to the two types of sleep. The other states are not normally of
concern to driver writers.

In the 2.6 kernel, it is not normally necessary for driver code to
manipulate the process state directly. However, should you need to do
so, the call to use is:

void set_current_state(int new_state);

In older code, you often see something like this instead:

current->state = TASK_INTERRUPTIBLE;

But changing current directly in that manner is
discouraged; such code breaks easily when data structures change. The
above code does show, however, that changing the current state of a
process does not, by itself, put it to sleep. By changing the current
state, you have changed the way the scheduler treats a process, but
you have not yet yielded the processor.

Giving up the processor is the final step, but there is one thing to
do first: you must check the condition you are sleeping for first.
Failure to do this check invites a race condition; what happens if
the condition came true while you were engaged in the above process,
and some other thread has just tried to wake you up? You could miss
the wakeup altogether and sleep longer than you had intended.
Consequently, down inside code that sleeps, you typically see
something such as:

if (!condition)
schedule( );

By checking our condition after setting the
process state, we are covered against all possible sequences of
events. If the condition we are waiting for had come about before
setting the process state, we notice in this check and not actually
sleep. If the wakeup happens thereafter, the process is made runnable
whether or not we have actually gone to sleep yet.

The call to schedule is, of course, the way to
invoke the scheduler and yield the CPU. Whenever you call this
function, you are telling the kernel to consider which process should
be running and to switch control to that process if necessary. So you
never know how long it will be before schedule
returns to your code.

After the if test and possible call to (and return
from) schedule, there is some cleanup to be
done. Since the code no longer intends to sleep, it must ensure that
the task state is reset to TASK_RUNNING. If the
code just returned from schedule, this step is
unnecessary; that function does not return until the process is in a
runnable state. But if the call to schedule was
skipped because it was no longer necessary to sleep, the process
state will be incorrect. It is also necessary to remove the process
from the wait queue, or it may be awakened more than once.


6.2.5.2 Manual sleeps

In previous versions of

the Linux kernel, nontrivial sleeps required the programmer to handle
all of the above steps manually. It was a tedious process involving a
fair amount of error-prone boilerplate code. Programmers can still
code a manual sleep in that manner if they want to;
<linux/sched.h> contains all the requisite
definitions, and the kernel source abounds with examples. There is an
easier way, however.

The first step is the creation and initialization of a

wait
queue entry. That is usually done with this macro:

DEFINE_WAIT(my_wait);

in which name is the name of the wait queue entry
variable. You can also do things in two steps:

wait_queue_t my_wait;
init_wait(&my_wait);

But it is usually easier to put a DEFINE_WAIT line
at the top of the loop that implements your sleep.

The next step is to add your wait queue entry to the queue, and set
the process state. Both of those tasks are handled by this function:

void prepare_to_wait(wait_queue_head_t *queue,
wait_queue_t *wait,
int state);

Here, queue and wait are the
wait queue head and the process entry, respectively.
state is the new state for the process; it should
be either TASK_INTERRUPTIBLE (for interruptible
sleeps, which is usually what you want) or
TASK_UNINTERRUPTIBLE (for uninterruptible
sleeps).

After calling prepare_to_wait, the process can
call scheduleafter it has checked to be
sure it still needs to wait. Once schedule
returns, it is cleanup time. That task, too, is handled by a special
function:

void finish_wait(wait_queue_head_t *queue, wait_queue_t *wait);

Thereafter, your code can test its state and see if it needs to wait
again.

We are far past due for an example. Previously we looked at the
read method for scullpipe,
which uses wait_event. The
write method in the same driver does its waiting
with prepare_to_wait and
finish_wait, instead. Normally you would not mix
methods within a single driver in this way, but we did so in order to
be able to show both ways of handling sleeps.

First, for completeness, let's look at the
write method itself:

/* How much space is free? */
static int spacefree(struct scull_pipe *dev)
{
if (dev->rp = = dev->wp)
return dev->buffersize - 1;
return ((dev->rp + dev->buffersize - dev->wp) % dev->buffersize) - 1;
}
static ssize_t scull_p_write(struct file *filp, const char _ _user *buf, size_t count,
loff_t *f_pos)
{
struct scull_pipe *dev = filp->private_data;
int result;
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
/* Make sure there's space to write */
result = scull_getwritespace(dev, filp);
if (result)
return result; /* scull_getwritespace called up(&dev->sem) */
/* ok, space is there, accept something */
count = min(count, (size_t)spacefree(dev));
if (dev->wp >= dev->rp)
count = min(count, (size_t)(dev->end - dev->wp)); /* to end-of-buf */
else /* the write pointer has wrapped, fill up to rp-1 */
count = min(count, (size_t)(dev->rp - dev->wp - 1));
PDEBUG("Going to accept %li bytes to %p from %p\n", (long)count, dev->wp, buf);
if (copy_from_user(dev->wp, buf, count)) {
up (&dev->sem);
return -EFAULT;
}
dev->wp += count;
if (dev->wp = = dev->end)
dev->wp = dev->buffer; /* wrapped */
up(&dev->sem);
/* finally, awake any reader */
wake_up_interruptible(&dev->inq); /* blocked in read( ) and select( ) */
/* and signal asynchronous readers, explained late in chapter 5 */
if (dev->async_queue)
kill_fasync(&dev->async_queue, SIGIO, POLL_IN);
PDEBUG("\"%s\" did write %li bytes\n",current->comm, (long)count);
return count;
}

This code looks similar to the read method,
except that we have pushed the code that sleeps into a separate
function called
scull_getwritespace
.
Its job is to ensure that there is space in the buffer for new data,
sleeping if need be until that space comes available. Once the space
is there, scull_p_write can simply copy the
user's data there, adjust the pointers, and wake up
any processes that may have been waiting to read data.

The code that handles
the
actual sleep is:

/* Wait for space for writing; caller must hold device semaphore.  On
* error the semaphore will be released before returning. */
static int scull_getwritespace(struct scull_pipe *dev, struct file *filp)
{
while (spacefree(dev) = = 0) { /* full */
DEFINE_WAIT(wait);
up(&dev->sem);
if (filp->f_flags & O_NONBLOCK)
return -EAGAIN;
PDEBUG("\"%s\" writing: going to sleep\n",current->comm);
prepare_to_wait(&dev->outq, &wait, TASK_INTERRUPTIBLE);
if (spacefree(dev) = = 0)
schedule( );
finish_wait(&dev->outq, &wait);
if (signal_pending(current))
return -ERESTARTSYS; /* signal: tell the fs layer to handle it */
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
}
return 0;
}

Note once again the containing while loop. If
space is available without sleeping, this function simply returns.
Otherwise, it must drop the device semaphore and wait. The code uses
DEFINE_WAIT to set up a wait queue entry and
prepare_to_wait to get ready for the actual
sleep. Then comes the obligatory check on the buffer; we must handle
the case in which space becomes available in the buffer after we have
entered the while loop (and dropped the semaphore)
but before we put ourselves onto the wait queue. Without that check,
if the reader processes were able to completely empty the buffer in
that time, we could miss the only wakeup we would ever get and sleep
forever. Having satisfied ourselves that we must sleep, we can call
schedule.

It is worth looking again at this case: what happens if the wakeup
happens between the test in the if statement and
the call to schedule? In that case, all is well.
The wakeup resets the process state to
TASK_RUNNING and schedule
returnsalthough not necessarily right away. As long as the
test happens after the process has put itself on the wait queue and
changed its state, things will work.

To finish up, we call finish_wait. The call to
signal_pending tells us whether we were awakened
by a signal; if so, we need to return to the user and let them try
again later. Otherwise, we reacquire the semaphore, and test again
for free space as usual.


6.2.5.3 Exclusive waits

We have seen that when a process calls
wake_up

on a wait queue, all processes waiting on that queue are made
runnable. In many cases, that is the correct behavior. In others,
however, it is possible to know ahead of time that only one of the
processes being awakened will succeed in obtaining the desired
resource, and the rest will simply have to sleep again. Each one of
those processes, however, has to obtain the processor, contend for
the resource (and any governing locks), and explicitly go back to
sleep. If the number of processes in the wait queue is large, this
"thundering herd" behavior can
seriously degrade the performance of the system.

In response to real-world thundering herd problems, the
kernel developers added an
"exclusive wait" option to the
kernel. An exclusive wait acts very much like a normal sleep, with
two important differences:

  • When a wait queue entry has the
    WQ_FLAG_EXCLUSIVE flag set, it is added to the end of
    the wait queue. Entries without that flag are, instead, added to the
    beginning.

  • When wake_up is called on a wait queue, it stops
    after waking the first process that has the
    WQ_FLAG_EXCLUSIVE flag set.


The end result is that processes performing exclusive waits are
awakened one at a time, in an orderly manner, and do not create
thundering herds. The kernel still wakes up all nonexclusive waiters
every time, however.

Employing exclusive waits within a driver is worth considering if two
conditions are met: you expect significant contention for a resource,
and waking a single process is sufficient to completely consume the
resource when it becomes available. Exclusive waits work well for the
Apache web server, for example; when a new connection comes in,
exactly one of the (often many) Apache processes on the system should
wake up to deal with it. We did not use exclusive waits in the
scullpipe driver, however; it is rare to see
readers contending for data (or writers for buffer space), and we
cannot know that one reader, once awakened, will consume all of the
available data.

Putting a process into an interruptible wait is a simple matter of
calling prepare_to_wait_exclusive:

void prepare_to_wait_exclusive(wait_queue_head_t *queue,
wait_queue_t *wait,
int state);

This call, when used in place of
prepare_to_wait, sets the
"exclusive" flag in the wait queue
entry and adds the process to the end of the wait queue. Note that
there is no way to perform exclusive waits with
wait_event and its variants.


6.2.5.4 The details of waking up

The view we have presented of the wakeup process is simpler than what
really happens inside the kernel. The actual behavior that results
when a process is awakened is controlled by a function in the wait
queue entry. The default wakeup function[3] sets the process into a runnable state and, possibly,
performs a context switch to that process if it has a higher
priority. Device drivers should never need to supply a different wake
function; should yours prove to be the exception, see
<linux/wait.h> for information on how to
do it.

[3] It has the
imaginative name default_wake_function.


We have not yet seen all the variations of
wake_up. Most driver writers never need the
others, but, for completeness, here is the full set:

wake_up(wait_queue_head_t *queue);

wake_up_interruptible(wait_queue_head_t *queue);


wake_up awakens every process on the queue that
is not in an exclusive wait, and exactly one exclusive waiter, if it
exists. wake_up_interruptible does the same,
with the exception that it skips over processes in an uninterruptible
sleep. These functions can, before returning, cause one or more of
the processes awakened to be scheduled (although this does not happen
if they are called from an atomic context).


wake_up_nr(wait_queue_head_t *queue, int nr);

wake_up_interruptible_nr(wait_queue_head_t *queue, int nr);


These functions perform similarly to wake_up,
except they can awaken up to nr exclusive waiters,
instead of just one. Note that passing 0 is interpreted as asking for
all of the exclusive waiters to be awakened,
rather than none of them.


wake_up_all(wait_queue_head_t *queue);

wake_up_interruptible_all(wait_queue_head_t *queue);


This form of wake_up awakens all processes
whether they are performing an exclusive wait or not (though the
interruptible form still skips processes doing uninterruptible
waits).


wake_up_interruptible_sync(wait_queue_head_t *queue);


Normally, a process that is awakened may preempt the current process
and be scheduled into the processor before
wake_up returns. In other words, a call to
wake_up may not be atomic. If the process
calling wake_up is running in an atomic context
(it holds a spinlock, for example, or is an interrupt handler), this
rescheduling does not happen. Normally, that protection is adequate.
If, however, you need to explicitly ask to not be scheduled out of
the processor at this time, you can use the
"sync" variant of
wake_up_interruptible. This function is most
often used when the caller is about to reschedule anyway, and it is
more efficient to simply finish what little work remains first.



If all of the above is not entirely clear on a first reading,
don't worry. Very few drivers ever need to call
anything except wake_up_interruptible.


6.2.5.5 Ancient history: sleep_on

If you spend any time digging through the kernel source, you will
likely encounter two functions that we have neglected to discuss so
far:

void sleep_on(wait_queue_head_t *queue);
void interruptible_sleep_on(wait_queue_head_t *queue);

As you might expect, these functions unconditionally put the current
process to sleep on the given queue. These
functions are strongly deprecated, however, and you should never use
them. The problem is obvious if you think about it:
sleep_on

offers no way to protect against race conditions. There is always a
window between when your code decides it must sleep and when
sleep_on actually effects that sleep. A wakeup
that arrives during that window is missed. For this reason, code that
calls sleep_on is never entirely safe.

Current plans call for sleep_on and its variants
(there are a couple of time-out forms we haven't
shown) to be removed from the kernel in the not-too-distant future.


6.2.6. Testing the Scullpipe Driver


We have seen
how the scullpipe
driver implements blocking I/O. If you wish to try it out, the source
to this driver can be found with the rest of the book examples.
Blocking I/O in action can be seen by opening two windows. The first
can run a command such as cat
/dev/scullpipe. If you then, in another window,
copy a file to /dev/scullpipe, you should see
that file's contents appear in the first window.

Testing nonblocking activity is trickier, because the conventional
programs available to a shell don't perform
nonblocking operations. The
misc-progs
source directory contains the
following simple program, called
nbtest
,
for testing nonblocking operations. All it does is copy its input to
its output, using nonblocking I/O and delaying between retries. The
delay time is passed on the command line and is one second by
default.

int main(int argc, char **argv)
{
int delay = 1, n, m = 0;
if (argc > 1)
delay=atoi(argv[1]);
fcntl(0, F_SETFL, fcntl(0,F_GETFL) | O_NONBLOCK); /* stdin */
fcntl(1, F_SETFL, fcntl(1,F_GETFL) | O_NONBLOCK); /* stdout */
while (1) {
n = read(0, buffer, 4096);
if (n >= 0)
m = write(1, buffer, n);
if ((n < 0 || m < 0) && (errno != EAGAIN))
break;
sleep(delay);
}
perror(n < 0 ? "stdin" : "stdout");
exit(1);
}

If you run this program under a process tracing utility such
as

strace, you can see the success or failure of
each operation, depending on whether data is
available when the operation is tried.


    / 202