Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

15.3. Performing Direct I/O

Most I/O operations are

buffered
through the kernel. The use of
a kernel-space buffer allows a degree of separation between user
space and the actual device; this separation can make programming
easier and can also yield performance benefits in many situations.
There are cases, however, where it can be beneficial to perform I/O
directly to or from a user-space buffer. If the amount of data being
transferred is large, transferring data directly without an extra
copy through kernel space can speed things up.

One example of direct I/O use in the 2.6 kernel is the SCSI tape
driver. Streaming tapes can pass a lot of data through the system,
and tape transfers are usually record-oriented, so there is little
benefit to buffering data in the kernel. So, when the conditions are
right (the user-space buffer is page-aligned, for example), the SCSI
tape driver performs its I/O without copying the data.

That said, it is important to recognize that direct I/O does not
always provide the performance boost that one might expect. The
overhead of setting up direct I/O (which involves faulting in and
pinning down the relevant user pages) can be significant, and the
benefits of buffered I/O are lost. For example, the use of direct I/O
requires that the write system call operate
synchronously; otherwise the application does not know when it can
reuse its I/O buffer. Stopping the application until each write
completes can slow things down, which is why applications that use
direct I/O often use asynchronous I/O operations as well.

The real moral of the story, in any case, is that implementing direct
I/O in a char driver is usually unnecessary and can be hurtful. You
should take that step only if you are sure that the overhead of
buffered I/O is truly slowing things down. Note also that block and
network drivers need not worry about implementing direct I/O at all;
in both cases, higher-level code in the kernel sets up and makes use
of direct I/O when it is indicated, and driver-level code need not
even know that direct I/O is being performed.

The key to implementing direct I/O in the 2.6 kernel is a function
called
get_user_pages
,
which is declared in <linux/mm.h> with the
following prototype:

int get_user_pages(struct task_struct *tsk, 
struct mm_struct *mm, 
unsigned long start,
int len, 
int write, 
int force, 
struct page **pages, 
struct vm_area_struct **vmas);

This function has several arguments:

tsk

A pointer to the task performing the I/O; its main purpose is to tell
the kernel who should be charged for any page faults incurred while
setting up the buffer. This argument is almost always passed as
current.

mm

A pointer to the memory management structure describing the address
space to be mapped. The mm_struct structure is the
piece that ties together all of the parts (VMAs) of a
process's virtual address space. For driver use,
this argument should always be current->mm.

start

len

start is the (page-aligned) address of the
user-space buffer, and len is the length of the
buffer in pages.

write

force

If write is nonzero, the pages are mapped for
write access (implying, of course, that user space is performing a
read operation). The force flag tells
get_user_pages to override the protections on
the given pages to provide the requested access; drivers should
always pass 0 here.

pages

vmas

Output parameters. Upon successful completion,
pages contain a list of pointers to the
struct page structures describing the user-space
buffer, and vmas contains pointers to the
associated VMAs. The parameters should, obviously, point to arrays
capable of holding at least len pointers. Either
parameter can be NULL, but you need, at least, the
struct page pointers to
actually operate on the buffer.

get_user_pages is a low-level memory management
function, with a suitably complex interface. It also requires that
the mmap reader/writer semaphore for the address space be obtained in
read mode before the call. As a result, calls to
get_user_pages usually look something like:

down_read(&current->mm->mmap_sem);
result = get_user_pages(current, current->mm, ...);
up_read(&current->mm->mmap_sem);

The return value is the number of pages actually mapped, which could
be fewer than the number requested (but greater than zero).

Upon successful completion, the caller has a pages
array pointing to the user-space
buffer, which is locked into memory. To
operate on the buffer directly, the kernel-space code must turn each
struct page pointer into a
kernel virtual address with kmap or
kmap_atomic. Usually, however, devices for which
direct I/O is justified are using DMA operations, so your driver will
probably want to create a scatter/gather list from the array of
struct page pointers. We
discuss how to do this in the section, Section 15.4.4.7.

Once your direct I/O operation is complete, you must release the user
pages. Before doing so, however, you must inform the kernel if you
changed the contents of those pages. Otherwise, the kernel may think
that the pages are "clean," meaning
that they match a copy found on the swap device, and free them
without writing them out to backing store. So, if you have changed
the pages (in response to a user-space read request), you must mark
each affected page dirty with a call to:

void SetPageDirty(struct page *page);

(This macro is defined in
<linux/page-flags.h>). Most code that
performs this operation checks first to ensure that the page is not
in the reserved part of the memory map, which is never swapped out.
Therefore, the code usually looks like:

if (! PageReserved(page))
SetPageDirty(page);

Since user-space memory is not normally marked reserved,
this check should not strictly be necessary, but when you are getting
your hands dirty deep within the memory management subsystem, it is
best to be thorough and careful.

Regardless of whether the pages have been changed, they must be freed
from the page cache, or they stay there forever. The call to use is:

void page_cache_release(struct page *page);

This call should, of course, be made after the
page has been marked dirty, if need be.

15.3.1. Asynchronous I/O

One of the new features added to the 2.6 kernel was the
asynchronous I/O
capability. Asynchronous I/O allows user space to initiate operations
without waiting for their completion; thus, an application can do
other processing while its I/O is in flight. A complex,
high-performance application can also use asynchronous I/O to have
multiple operations going at the same time.

The implementation of
asynchronous I/O is optional, and very few
driver authors bother; most devices do not benefit from this
capability. As we will see in the coming chapters, block and network
drivers are fully asynchronous at all times, so only char drivers are
candidates for explicit asynchronous I/O support. A char device can
benefit from this support if there are good reasons for having more
than one I/O operation outstanding at any given time. One good
example is streaming tape drives, where the drive can stall and slow
down significantly if I/O operations do not arrive quickly enough. An
application trying to get the best performance out of a streaming
drive could use asynchronous I/O to have multiple operations ready to
go at any given time.

For the rare driver author who needs to implement asynchronous I/O,
we present a quick overview of how it works. We cover asynchronous
I/O in this chapter, because its implementation almost always
involves direct I/O operations as well (if you are buffering data in
the kernel, you can usually implement asynchronous behavior without
imposing the added complexity on user space).

Drivers supporting asynchronous I/O should include
<linux/aio.h>. There are three
file_operations methods for the implementation
of asynchronous I/O:

ssize_t (*aio_read) (struct kiocb *iocb, char *buffer, 
size_t count, loff_t offset);
ssize_t (*aio_write) (struct kiocb *iocb, const char *buffer, 
size_t count, loff_t offset);
int (*aio_fsync) (struct kiocb *iocb, int datasync);

The aio_fsync

operation is only of interest to filesystem code, so we do not
discuss it further here. The other two, aio_read
and aio_write, look very much like the regular
read and write methods but
with a couple of exceptions. One is that the
offset parameter is passed by value; asynchronous
operations never change the file position, so there is no reason to
pass a pointer to it. These methods also take the
iocb ("I/O control
block") parameter, which we get to in a moment.

The purpose of the aio_read and
aio_write methods is to initiate a read or write
operation that may or may not be complete by the time they return. If
it is possible to complete the operation
immediately, the method should do so and return the usual status: the
number of bytes transferred or a negative error code. Thus, if your
driver has a read method called
my_read, the following
aio_read method is entirely correct (though
rather pointless):

static ssize_t my_aio_read(struct kiocb *iocb, char *buffer, 
ssize_t count, loff_t offset)
{
return my_read(iocb->ki_filp, buffer, count, &offset);
}

Note that the struct file
pointer is found in the ki_filp field of the
kiocb structure.

If you support asynchronous I/O, you must be aware of the fact that
the kernel can, on occasion, create "synchronous
IOCBs." These are, essentially, asynchronous
operations that must actually be executed synchronously. One may well
wonder why things are done this way, but it's best
to just do what the kernel asks. Synchronous operations are marked in
the IOCB; your driver should query that status with:

int is_sync_kiocb(struct kiocb *iocb);

If this function returns a nonzero value, your driver must execute
the operation synchronously.

In the end, however, the point of all this structure is to enable
asynchronous operations. If your driver is able to initiate the
operation (or, simply, to queue it until some future time when it can
be executed), it must do two things: remember everything it needs to
know about the operation, and return -EIOCBQUEUED
to the caller. Remembering the operation information includes
arranging access to the user-space buffer; once you return, you will
not again have the opportunity to access that buffer while running in
the context of the calling process. In general, that means you will
likely have to set up a direct kernel mapping (with
get_user_pages) or a DMA mapping. The
-EIOCBQUEUED error code indicates that the
operation is not yet complete, and its final status will be posted
later.

When "later" comes, your driver
must inform the kernel that the operation has completed. That is done
with a call to aio_complete:

int aio_complete(struct kiocb *iocb, long res, long res2);

Here, iocb is the same IOCB that was initially
passed to you, and res is the usual result status
for the operation. res2 is a second result code
that will be returned to user space; most asynchronous I/O
implementations pass res2 as 0.
Once you call aio_complete, you should not touch
the IOCB or user buffer again.

15.3.1.1 An asynchronous I/O example

The page-oriented scullp driver in the example
source implements asynchronous I/O. The implementation is simple, but
it is enough to show how asynchronous operations should be
structured.

The aio_read and aio_write
methods don't actually do much:

static ssize_t scullp_aio_read(struct kiocb *iocb, char *buf, size_t count,
loff_t pos)
{
return scullp_defer_op(0, iocb, buf, count, pos);
}
static ssize_t scullp_aio_write(struct kiocb *iocb, const char *buf,
size_t count, loff_t pos)
{
return scullp_defer_op(1, iocb, (char *) buf, count, pos);
}

These methods simply call a common function:

struct async_work {
struct kiocb *iocb;
int result;
struct work_struct work;
};
static int scullp_defer_op(int write, struct kiocb *iocb, char *buf,
size_t count, loff_t pos)
{
struct async_work *stuff;
int result;
/* Copy now while we can access the buffer */
if (write)
result = scullp_write(iocb->ki_filp, buf, count, &pos);
else
result = scullp_read(iocb->ki_filp, buf, count, &pos);
/* If this is a synchronous IOCB, we return our status now. */
if (is_sync_kiocb(iocb))
return result;
/* Otherwise defer the completion for a few milliseconds. */
stuff = kmalloc (sizeof (*stuff), GFP_KERNEL);
if (stuff =  = NULL)
return result; /* No memory, just complete now */
stuff->iocb = iocb;
stuff->result = result;
INIT_WORK(&stuff->work, scullp_do_deferred_op, stuff);
schedule_delayed_work(&stuff->work, HZ/100);
return -EIOCBQUEUED;
}

A more complete implementation would use
get_user_pages to map the user buffer into
kernel space. We chose to keep life simple by just copying over the
data at the outset. Then a call is made to
is_sync_kiocb to see if this operation must be
completed synchronously; if so, the result status is returned, and we
are done. Otherwise we remember the relevant information in a little
structure, arrange for "completion"
via a workqueue, and return -EIOCBQUEUED. At this
point, control returns to user space.

Later on, the workqueue executes our completion function:

static void scullp_do_deferred_op(void *p)
{
struct async_work *stuff = (struct async_work *) p;
aio_complete(stuff->iocb, stuff->result, 0);
kfree(stuff);
}

Here, it is simply a matter of calling
aio_complete with our saved information. A real
driver's asynchronous I/O implementation is somewhat
more complicated, of course, but it follows this sort of structure.

Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

فارسی

کردی

العربیه

اردو

Türkçe

Русский

English

Français

کانال فیلم من

تبیان من

فایلهای من

کتابخانه من

پنل پیامکی

وبلاگ من

اینجــــا یک کتابخانه دیجیتالی است

با بیش از 100000 منبع الکترونیکی رایگان به زبان فارسی ، عربی و انگلیسی