Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

6.3. poll and select

Applications that use nonblocking I/O

often
use the poll, select, and
epoll system calls as well.
poll, select, and
epoll have essentially the same functionality:
each allow a process to determine whether it can read from or write
to one or more open files without blocking. These calls can also
block a process until any of a given set of file descriptors becomes
available for reading or writing. Therefore, they are often used in
applications that must use multiple input or output streams without
getting stuck on any one of them. The same functionality is offered
by multiple functions, because two were implemented in Unix almost at
the same time by two different groups: select
was introduced in BSD Unix, whereas poll was the
System V solution. The epoll call^[4] was added in 2.5.45 as a way of making the polling
function scale to thousands of file descriptors.

^[4] Actually, epoll is a set of three calls
that together can be used to achieve the polling functionality. For
our purposes, though, we can think of it as a single call.

Support for any of these calls requires support from the device
driver. This support (for all three calls) is provided through the
driver's poll method. This
method has the following prototype:

unsigned int (*poll) (struct file *filp, poll_table *wait);

The driver method is called whenever the user-space program performs
a poll, select, or
epoll system call involving a file descriptor
associated with the driver. The device method is in charge of these
two steps:

Call
poll_wait on one or more wait queues that could
indicate a change in the poll status. If no file descriptors are
currently available for I/O, the kernel causes the process to wait on
the wait queues for all file descriptors passed to the system call.
Return a bit mask describing the operations (if any) that could be
immediately performed without blocking.

Both of these operations are usually straightforward and tend to look
very similar from one driver to the next. They rely, however, on
information that only the driver can provide and, therefore, must be
implemented individually by each driver.

The poll_table
structure, the second argument to the poll
method, is used within the kernel to implement the
poll, select, and
epoll calls; it is declared in
<linux/poll.h>, which must be included by
the driver source. Driver writers do not need to know anything about
its internals and must use it as an opaque object; it is passed to
the driver method so that the driver can load it with every wait
queue that could wake up the process and change the status of the
poll operation. The driver adds a wait queue to
the poll_table structure by calling the function
poll_wait:

 void poll_wait (struct file *, wait_queue_head_t *, poll_table *);

The second task performed by the poll method is
returning the bit mask describing which operations could be completed
immediately; this is also straightforward. For example, if the device
has data available, a read would complete
without sleeping; the poll method should
indicate this state of affairs. Several flags (defined via
<linux/poll.h>) are used to indicate the
possible operations:

POLLIN

This
bit must be set if the device can be read without blocking.

POLLRDNORM

This
bit must be set if "normal" data is
available for reading. A readable device returns
(POLLIN |
POLLRDNORM).

POLLRDBAND

This
bit indicates that out-of-band data is available for reading from the
device. It is currently used only in one place in the Linux kernel
(the DECnet code) and is not generally applicable to device drivers.

POLLPRI

High-priority
data (out-of-band) can be read without blocking. This bit causes
select to report that an exception condition
occurred on the file, because select reports
out-of-band data as an exception condition.

POLLHUP

When
a process reading this device sees end-of-file, the driver must set
POLLHUP (hang-up). A process calling
select is told that the device is readable, as
dictated by the select functionality.

POLLERR

An
error condition has occurred on the device. When
poll is invoked, the device is reported as both
readable and writable, since both read and
write return an error code without blocking.

POLLOUT

This
bit is set in the return value if the device can be written to
without blocking.

POLLWRNORM

This
bit has the same meaning as POLLOUT, and sometimes
it actually is the same number. A writable device returns
(POLLOUT | POLLWRNORM).

POLLWRBAND

Like
POLLRDBAND, this bit means that data with nonzero
priority can be written to the device. Only the datagram
implementation of poll uses this bit, since a
datagram can transmit out-of-band data.

It's worth repeating that
POLLRDBAND and POLLWRBAND are
meaningful only with file descriptors associated with sockets: device
drivers won't normally use these flags.

The description of poll takes up a lot of space
for something that is relatively simple to use in practice. Consider
the scullpipe implementation of the
poll method:

static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
struct scull_pipe *dev = filp->private_data;
unsigned int mask = 0;
/*
* The buffer is circular; it is considered full
* if "wp" is right behind "rp" and empty if the
* two are equal.
*/
down(&dev->sem);
poll_wait(filp, &dev->inq,  wait);
poll_wait(filp, &dev->outq, wait);
if (dev->rp != dev->wp)
mask |= POLLIN | POLLRDNORM;    /* readable */
if (spacefree(dev))
mask |= POLLOUT | POLLWRNORM;   /* writable */
up(&dev->sem);
return mask;
}

This code simply adds the two scullpipe wait
queues to the poll_table, then sets the
appropriate mask bits depending on whether data can be read or
written.

The poll code as
shown is missing end-of-file support, because
scullpipe does not support an end-of-file
condition. For most real devices, the poll
method should return POLLHUP if no more data is
(or will become) available. If the caller used the
select system call, the file is reported as
readable. Regardless of whether poll or
select is used, the application knows that it
can call read without waiting forever, and the
read method returns, 0 to
signal end-of-file.

With real FIFOs, for example, the reader
sees an end-of-file when all the writers close the file, whereas in
scullpipe the reader never sees end-of-file. The
behavior is different because a FIFO is intended to be a
communication channel between two processes, while
scullpipe is a trash can where everyone can put
data as long as there's at least one reader.
Moreover, it makes no sense to reimplement what is already available
in the kernel, so we chose to implement a different behavior in our
example.

Implementing end-of-file in the same way as FIFOs do would mean
checking dev->nwriters, both in
read and in poll, and
reporting end-of-file (as just described) if no process has the
device opened for writing. Unfortunately, though, with this
implementation, if a reader opened the scullpipe
device before the writer, it would see end-of-file without having a
chance to wait for data. The best way to fix this problem would be to
implement blocking within open like real FIFOs
do; this task is left as an exercise for the reader.

6.3.1. Interaction with read and write

The purpose of the poll and
select calls is to determine in advance if an
I/O operation will block. In that respect, they complement
read and write. More
important, poll and select
are useful, because they let the application wait simultaneously for
several data streams, although we are not exploiting this feature in
the scull examples.

A correct implementation of the three calls is essential to make
applications work correctly: although the following rules have more
or less already been stated, we summarize them here.

6.3.1.1 Reading data from the device

If there is data in the input
buffer, the read call should return immediately,
with no noticeable delay, even if less data is available than the
application requested, and the driver is sure the remaining data will
arrive soon. You can always return less data than
you're asked for if this is convenient for any
reason (we did it in scull), provided you return
at least one byte. In this case, poll should
return POLLIN|POLLRDNORM.
If there is no data in the input
buffer, by default read must block until at
least one byte is there. If O_NONBLOCK is set, on
the other hand, read returns immediately with a
return value of -EAGAIN (although some old
versions of System V return 0 in this case). In
these cases, poll must report that the device is
unreadable until at least one byte arrives. As soon as there is some
data in the buffer, we fall back to the previous case.
If we are at end-of-file, read should return
immediately with a return value of 0, independent
of O_NONBLOCK. poll should
report POLLHUP in this case.

6.3.1.2 Writing to the device

If there is space in the output buffer,
write should return without delay. It can accept
less data than the call requested, but it must accept at least one
byte. In this case, poll reports that the device
is writable by returning POLLOUT|POLLWRNORM.
If the output buffer is full, by default write
blocks until some space is freed. If O_NONBLOCK is
set, write returns immediately with a return
value of -EAGAIN (older System V Unices returned
0). In these cases, poll
should report that the file is not writable. If, on the other hand,
the device is not able to accept any more data,
write returns -ENOSPC
("No space left on device"),
independently of the setting of O_NONBLOCK.
Never make a write call wait for data
transmission before returning, even if O_NONBLOCK
is clear. This is because many applications use
select to find out whether a
write will block. If the device is reported as
writable, the call must not block. If the program using the device
wants to ensure that the data it enqueues in the output buffer is
actually transmitted, the driver must provide an
fsync method. For instance, a removable device
should have an fsync entry point.

Although this is a good set of general rules, one should also
recognize that each device is unique and that sometimes the rules
must be bent slightly. For example, record-oriented devices (such as
tape drives) cannot execute partial writes.

6.3.1.3 Flushing pending output

We've seen how
the write method by itself
doesn't account for all data output needs. The
fsync function, invoked by the system call of
the same name, fills the gap. This method's
prototype is

 int (*fsync) (struct file *file, struct dentry *dentry, int datasync);

If
some application ever needs to be assured that data has been sent to
the device, the fsync method must be implemented
regardless of whether O_NONBLOCK is set. A call to
fsync should return only when the device has
been completely flushed (i.e., the output buffer is empty), even if
that takes some time. The datasync argument is
used to distinguish between the fsync and
fdatasync system calls; as such, it is only of
interest to filesystem code and can be ignored by drivers.

The
fsync method has no unusual features. The call
isn't time critical, so every device driver can
implement it to the author's taste. Most of the
time, char drivers just have a NULL pointer in
their fops. Block devices, on the other hand,
always implement the method with the general-purpose
block_fsync, which, in turn, flushes all the
blocks of the device, waiting for I/O to complete.

6.3.2. The Underlying Data Structure

The actual implementation of the
poll and select system
calls is reasonably simple, for those who are interested in how it
works; epoll is a bit more complex but is built
on the same mechanism. Whenever a user application calls
poll, select, or
epoll_ctl,^[5] the kernel
invokes the poll method of all files referenced
by the system call, passing the same poll_table to
each of them. The poll_table structure is just a
wrapper around a function that builds the actual data structure. That
structure, for poll and
select, is a linked list of memory pages
containing poll_table_entry structures. Each
poll_table_entry holds the
struct file and
wait_queue_head_t pointers passed to
poll_wait, along with an associated wait queue
entry. The call to poll_wait sometimes also adds
the process to the given wait queue. The whole structure must be
maintained by the kernel so that the process can be removed from all
of those queues before poll or
select returns.

^[5] This is the function
that sets up the internal data structure for future calls to
epoll_wait.

If none of the drivers being polled indicates that I/O can occur
without blocking, the poll call simply sleeps
until one of the (perhaps many) wait queues it is on wakes it up.

What's interesting in the implementation of
poll is that the driver's
poll method may be called with a
NULL pointer as a poll_table
argument. This situation can come about for a couple of reasons. If
the application calling poll has provided a
timeout value of 0 (indicating that no wait should
be done), there is no reason to accumulate wait queues, and the
system simply does not do it. The poll_table
pointer is also set to NULL immediately after any
driver being polled indicates that I/O is
possible. Since the kernel knows at that point that no wait will
occur, it does not build up a list of wait queues.

When the poll call completes, the
poll_table structure is deallocated, and all wait
queue entries previously added to the poll table (if any) are removed
from the table and their wait queues.

We tried to show the data structures involved in polling in Figure 6-1; the figure is a
simplified representation of the real data structures, because it
ignores the multipage nature of a poll table and disregards the file
pointer that is part of each poll_table_entry. The
reader interested in the actual implementation is urged to look in
<linux/poll.h> and
fs/select.c.

Figure 6-1. The data structures behind poll

At this point, it is possible to understand the motivation behind the
new epoll system call. In a typical case, a call
to poll or select involves
only a handful of file descriptors, so the cost of setting up the
data structure is small. There are applications out there, however,
that work with thousands of file descriptors. At that point, setting
up and tearing down this data structure between every I/O operation
becomes prohibitively expensive. The epoll
system call family allows this sort of application to set up the
internal kernel data structure exactly once and to use it
many times.

Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی