Linux Device Drivers (3rd Edition) [Electronic resources] نسخه متنی

16.3. Request Processing

The core of every block driver
is its
request function. This function is where the
real work gets doneor at least started; all the rest is
overhead. Consequently, we spend a fair amount of time looking at
request processing in block drivers.

A disk driver's performance can be a critical part
of the performance of the system as a whole. Therefore, the
kernel's block subsystem has been written with
performance very much in mind; it does everything possible to enable
your driver to get the most out of the devices it controls. This is a
good thing, in that it enables blindingly fast I/O. On the other
hand, the block subsystem unnecessarily exposes a great deal of
complexity in the driver API. It is possible to write a very simple
request function (we will see one shortly), but
if your driver must perform at a high level on complex hardware, it
will be anything but simple.

16.3.1. Introduction to the request Method

The block driver request method has the
following prototype:

void request(request_queue_t *queue);

This function is called whenever the kernel believes it is time for
your driver to process some reads, writes, or other operations on the
device. The request function does not need to
actually complete all of the requests on the queue before it returns;
indeed, it probably does not complete any of them for most real
devices. It must, however, make a start on those requests and ensure
that they are all, eventually, processed by the driver.

Every device has a request queue. This is because actual transfers to
and from a disk can take place far away from the time the kernel
requests them, and because the kernel needs the flexibility to
schedule each transfer at the most propitious moment (grouping
together, for instance, requests that affect sectors close together
on the disk). And the request function, you may
remember, is associated with a request queue when that queue is
created. Let us look back at how sbull makes its
queue:

dev->queue = blk_init_queue(sbull_request, &dev->lock);

Thus, when the queue
is created, the
request function is associated with it. We also
provided a spinlock as part of the queue creation process. Whenever
our request function is called, that lock is
held by the kernel. As a result, the request
function is running in an atomic context; it must follow all of the
usual rules for atomic code discussed in Chapter 5.

The queue lock also prevents the kernel from queuing any other
requests for your device while your request
function holds the lock. Under some conditions, you may want to
consider dropping that lock while the request
function runs. If you do so, however, you must be sure not to access
the request queue, or any other data structure protected by the lock,
while the lock is not held. You must also reacquire the lock before
the request function returns.

Finally, the invocation of the request function
is (usually) entirely asynchronous with respect to the actions of any
user-space process. You cannot assume that the kernel is running in
the context of the process that initiated the current request. You do
not know if the I/O buffer provided by the request is in kernel or
user space. So any sort of operation that explicitly accesses user
space is in error and will certainly lead to trouble. As you will
see, everything your driver needs to know about the request is
contained within the structures passed to you via the request queue.

16.3.2. A Simple request Method

The sbull example driver

provides a few different methods for
request processing. By default, sbull uses a
method called sbull_request, which is meant to
be an example of the simplest possible request
method. Without further ado, here it is:

static void sbull_request(request_queue_t *q)
{
struct request *req;
while ((req = elv_next_request(q)) != NULL) {
struct sbull_dev *dev = req->rq_disk->private_data;
if (! blk_fs_request(req)) {
printk (KERN_NOTICE "Skip non-fs request\n");
end_request(req, 0);
continue;
}
sbull_transfer(dev, req->sector, req->current_nr_sectors,
req->buffer, rq_data_dir(req));
end_request(req, 1);
}
}

This function introduces the
struct
request structure.
We will examine struct request
in great detail later on; for now, suffice it to say that it
represents a block I/O request for us to execute.

The kernel provides the function
elv_next_request

to obtain the first incomplete request on the queue; that function
returns NULL when there are no requests to be
processed. Note that elv_next_request does not
remove the request from the queue. If you call it twice with no
intervening operations, it returns the same
request structure both times. In this simple mode
of operation, requests are taken off the queue only when they are
complete.

A block request queue can contain requests that do not actually move
blocks to and from a disk. Such requests can include vendor-specific,
low-level diagnostics operations or instructions relating to
specialized device modes, such as the packet writing mode for
recordable media. Most block drivers do not know how to handle such
requests and simply fail them; sbull works in
this way as well. The call to block_fs_request
tells us whether we are looking at a filesystem requestone
that moves blocks of data. If a request is not a filesystem request,
we pass it to end_request:

void end_request(struct request *req, int succeeded);

When we dispose of nonfilesystem requests, we pass
succeeded as 0 to indicate that
we did not successfully complete the request. Otherwise, we call
sbull_transfer to actually move the data, using
a set of fields provided in the request structure:

sector_t sector;

The index of the beginning sector on our device. Remember that this
sector number, like all such numbers passed between the kernel and
the driver, is expressed in 512-byte sectors. If your hardware uses a
different sector size, you need to scale sector
accordingly. For example, if the hardware uses 2048-byte sectors, you
need to divide the beginning sector number by four before putting it
into a request for the hardware.

unsigned long nr_sectors;

The number of (512-byte) sectors to be transferred.

char *buffer;

A pointer to the buffer to or from which the data should be
transferred. This pointer is a kernel virtual address and can be
dereferenced directly by the driver if need be.

rq_data_dir(struct request *req);

This macro extracts the direction of the transfer from the request; a
zero return value denotes a read from the device, and a nonzero
return value denotes a write to the device.

Given this information, the sbull driver can
implement the actual data transfer with a simple
memcpy callour data is already in memory,
after all. The function that performs this copy operation
(sbull_transfer) also handles the scaling of
sector sizes and ensures that we do not try to copy beyond the end of
our virtual device:

static void sbull_transfer(struct sbull_dev *dev, unsigned long sector,
unsigned long nsect, char *buffer, int write)
{
unsigned long offset = sector*KERNEL_SECTOR_SIZE;
unsigned long nbytes = nsect*KERNEL_SECTOR_SIZE;
if ((offset + nbytes) > dev->size) {
printk (KERN_NOTICE "Beyond-end write (%ld %ld)\n", offset, nbytes);
return;
}
if (write)
memcpy(dev->data + offset, buffer, nbytes);
else
memcpy(buffer, dev->data + offset, nbytes);
}

With the code, sbull implements a complete,
simple RAM-based disk device. It is not, however, a realistic driver
for many types of devices, for a couple of reasons.

The first of those reasons is that sbull
executes requests synchronously, one at a time. High-performance disk
devices are capable of having numerous requests outstanding at the
same time; the disk's onboard controller can then
choose to execute them in the optimal order (one hopes). As long as
we process only the first request in the queue, we can never have
multiple requests being fulfilled at a given time. Being able to work
with more than one request requires a deeper understanding of request
queues and the request structure; the next few
sections help build that understanding.

There is another issue to consider, however. The best performance is
obtained from disk devices when the system performs large transfers
involving multiple sectors that are located together on the disk. The
highest cost in a disk operation is always the positioning of the
read and write heads; once that is done, the time required to
actually read or write the data is almost insignificant. The
developers who design and implement filesystems and virtual memory
subsystems understand this, so they do their best to locate related
data contiguously on the disk and to transfer as many sectors as
possible in a single request. The block subsystem also helps in this
regard; request queues contain a great deal of logic aimed at finding
adjacent requests and coalescing them into larger operations.

The sbull driver, however, takes all that work
and simply ignores it. Only one buffer is transferred at a time,
meaning that the largest single transfer is almost never going to
exceed the size of a single page. A block driver can do much better
than that, but it requires a deeper understanding of
request structures and the bio
structures from which requests are built.

The next few sections delve more deeply into how the block layer does
its job and the data structures that result from that work.

16.3.3. Request Queues

In the simplest sense,
a block request queue is exactly that: a
queue of block I/O requests. If you look under the hood, a request
queue turns out to be a surprisingly complex data structure.
Fortunately, drivers need not worry about most of that complexity.

Request queues keep track of outstanding block I/O requests. But they
also play a crucial role in the creation of those requests. The
request queue stores parameters that describe what kinds of requests
the device is able to service: their maximum size, how many separate
segments may go into a request, the hardware sector size, alignment
requirements, etc. If your request queue is properly configured, it
should never present you with a request that your device cannot
handle.

Request queues also implement a plug-in interface that allows the use
of
multiple

I/O schedulers (or
elevators) to be used. An I/O
scheduler's job is to present I/O requests to your
driver in a way that maximizes performance. To this end, most I/O
schedulers accumulate a batch of requests, sort them into increasing
(or decreasing) block index order, and present the requests to the
driver in that order. The disk head, when given a sorted list of
requests, works its way from one end of the disk to the other, much
like a full elevator moves in a single direction until all of its
"requests" (people waiting to get
off) have been satisfied. The 2.6 kernel includes a
"deadline scheduler,"
which makes an effort to ensure that every request is satisfied
within a preset maximum time, and an "anticipatory
scheduler," which actually stalls a device briefly
after a read request in anticipation that another, adjacent read will
arrive almost immediately. As of this writing, the default scheduler
is the anticipatory scheduler, which seems to give the best
interactive system performance.

The I/O scheduler is also charged with merging adjacent requests.
When a new I/O request is handed to the scheduler, it searches the
queue for requests involving adjacent sectors; if one is found and if
the resulting request would not be too large, the two requests are
merged.

Request queues have a type of struct request_queue
or request_queue_t. This type, and the many
functions that operate on it, are defined in
<linux/blkdev.h>. If you are interested in
the implementation of request queues, you can find most of the code
in drivers/block/ll_rw_block.c and
elevator.c.

16.3.3.1 Queue creation and deletion

As we saw in our
example

code, a request queue is a dynamic data structure that must be
created by the block I/O subsystem. The function to create and
initialize a request queue is:

request_queue_t *blk_init_queue(request_fn_proc *request, spinlock_t *lock);

The arguments are, of course, the request
function for this queue and a spinlock that controls access to the
queue. This function allocates memory (quite a bit of memory,
actually) and can fail because of this; you should always check the
return value before attempting to use the queue.

As part of the initialization of a request queue, you can set the
field queuedata (which is a void * pointer) to any value you like. This field is the request
queue's equivalent to the
private_data we have seen in other structures.

To return a request queue to the system (at module unload time,
generally), call
blk_cleanup_queue
:

void blk_cleanup_queue(request_queue_t *);

After this call, your driver sees no more requests from the given
queue and should not reference it again.

16.3.3.2 Queueing functions

There is a very small set

of
functions for the manipulation of requests on queuesat least,
as far as drivers are concerned. You must hold the queue lock before
you call these functions.

The function that returns the next request to process is
elv_next_request
:

struct request *elv_next_request(request_queue_t *queue);

We have already seen this function in the simple
sbull example. It returns a pointer to the next
request to process (as determined by the I/O scheduler) or
NULL if no more requests remain to be processed.
elv_next_request leaves the request on the queue
but marks it as being active; this mark prevents the I/O scheduler
from attempting to merge other requests with this one once you start
to execute it.

To actually remove a request from a queue, use
blkdev_dequeue_request
:

void blkdev_dequeue_request(struct request *req);

If your driver operates on multiple requests from the same queue
simultaneously, it must dequeue them in this manner.

Should you need to put a dequeued request back on the queue for some
reason, you can call:

void elv_requeue_request(request_queue_t *queue, struct request *req);

16.3.3.3 Queue control functions

The block layer exports a
set of functions that can be used
by a driver to control how a request queue operates. These functions
include:

void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);

If your device has reached a state where it can handle no more
outstanding commands, you can call
blk_stop_queue to tell the block layer. After
this call, your request function will not be
called until you call blk_start_queue. Needless
to say, you should not forget to restart the queue when your device
can handle more requests. The queue lock must be held when calling
either of these functions.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

Function that tells the kernel the highest physical address to which
your device can perform DMA. If a request comes in containing a
reference to memory above the limit, a bounce buffer will be used for
the operation; this is, of course, an expensive way to perform block
I/O and should be avoided whenever possible. You can provide any
reasonable physical address in this argument, or make use of the
predefined symbols
BLK_BOUNCE_HIGH

(use
bounce buffers for high-memory pages),
BLK_BOUNCE_ISA (the driver can DMA only into the
16-MB ISA zone), or BLK_BOUNCE_ANY (the driver can
perform DMA to any address). The default value is
BLK_BOUNCE_HIGH.

void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

Functions that set parameters describing the requests that can be
satisfied by this device. blk_queue_max_sectors
can be used to set the maximum size of any request in (512-byte)
sectors; the default is 255.
blk_queue_max_phys_segments and
blk_queue_max_hw_segments both control how many
physical segments (nonadjacent areas in system memory) may be
contained within a single request. Use
blk_queue_max_phys_segments to say how many
segments your driver is prepared to cope with; this may be the size
of a staticly allocated scatterlist, for example.
blk_queue_max_hw_segments, in contrast, is the
maximum number of segments that the device itself can handle. Both of
these parameters default to 128. Finally,
blk_queue_max_segment_size tells the kernel how
large any individual segment of a request can be in bytes; the
default is 65,536 bytes.

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

Some devices cannot handle requests that cross a particular size
memory boundary; if your device is one of those, use this function to
tell the kernel about that boundary. For example, if your device has
trouble with requests that cross a 4-MB boundary, pass in a mask of
0x3fffff. The default mask is
0xffffffff.

void blk_queue_dma_alignment(request_queue_t *queue, int mask);

Function that tells the kernel about the memory alignment constraints
your device imposes on DMA transfers. All requests are created with
the given alignment, and the length of the request also matches the
alignment. The default mask is 0x1ff, which causes
all requests to be aligned on 512-byte boundaries.

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

Tells the kernel about your device's hardware sector
size. All requests generated by the kernel are a multiple of this
size and are properly aligned. All communications between the block
layer and the driver continues to be expressed in 512-byte sectors,
however.

16.3.4. The Anatomy of a Request

In our simple example, we encountered the request
structure. However, we have barely scratched the surface of that
complicated data structure. In this section, we look, in some detail,
at how block I/O requests are represented in the Linux kernel.

Each request structure represents one block I/O
request, although it may have been formed through a merger of several
independent requests at a higher level. The sectors to be transferred
for any particular request may be distributed throughout main memory,
although they always correspond to a set of consecutive sectors on
the block device. The request is represented as a set of segments,
each of which corresponds to one in-memory buffer. The kernel may
join multiple requests that involve adjacent sectors on the disk, but
it never combines read and write operations within a single
request structure. The kernel also makes sure not
to combine requests if the result would violate any of the request
queue limits described in the previous section.

A request structure is implemented, essentially,
as a linked list of bio structures combined with
some housekeeping information to enable the driver to keep track of
its position as it works through the request. The
bio structure is a low-level description of a
portion of a block I/O request; we take a look at it now.

16.3.4.1 The bio structure

When the kernel, in the form of a filesystem, the virtual memory
subsystem, or a system call, decides that a set of blocks must be
transferred to or from a block I/O device; it puts together a
bio structure to describe that operation. That
structure is then handed to the block I/O code, which merges it into
an existing request structure or, if need be,
creates a new one. The bio structure contains
everything that a block driver needs to carry out the request without
reference to the user-space process that caused that request to be
initiated.

The bio

structure, which is defined in
<linux/bio.h>, contains a number of fields
that may be of use to driver authors:

sector_t bi_sector;

The first (512-byte) sector to be transferred for this
bio.

unsigned int bi_size;

The size of the data to be transferred, in bytes. Instead, it is
often easier to use bio_sectors(bio), a macro that
gives the size in sectors.

unsigned long bi_flags;

A set of flags describing the bio; the least
significant bit is set if this is a write request (although the macro
bio_data_dir(bio) should be used instead of
looking at the flags directly).

unsigned short bio_phys_segments;

unsigned short bio_hw_segments;

The number of physical segments contained within this BIO and the
number of segments seen by the hardware after DMA mapping is done,
respectively.

The core of a bio, however, is an array called
bi_io_vec
,
which is made up of the following structure:

struct bio_vec {
struct page     *bv_page;
unsigned int    bv_len;
unsigned int    bv_offset;
};

Figure 16-1 shows how these
structures all tie together. As you can see, by the time a block I/O
request is turned into a bio structure, it has
been broken down into individual pages of physical memory. All a
driver needs to do is to step through this array of structures (there
are bi_vcnt of them), and transfer data within
each page (but only len bytes starting at
offset).

Figure 16-1. The bio structure

Working directly with the bi_io_vec array is
discouraged in the interest of kernel developers being able to change
the bio structure in the future without breaking
things. To that end, a set of macros has been provided to ease the
process of working with the bio structure. The
place to start is with bio_for_each_segment, which
simply loops through every unprocessed entry in the
bi_io_vec array. This macro should be used as
follows:

int segno;
struct bio_vec *bvec;
bio_for_each_segment(bvec, bio, segno) {
/* Do something with this segment
}

Within this loop, bvec points to the current
bio_vec entry, and segno is the
current segment number. These values can be used to set up DMA
transfers (an alternative way using
blk_rq_map_sg is described in Section 16.3.5.2). If you need to access the pages directly,
you should first ensure that a proper kernel virtual address exists;
to that end, you can use:

char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);
void _ _bio_kunmap_atomic(char *buffer, enum km_type type);

This low-level function allows you to directly map the buffer found
in a given bio_vec, as indicated by the index
i. An atomic kmap is created; the caller must
provide the appropriate slot to use (as described in the section
Section 15.1.4).

The block layer also maintains a set of pointers within the
bio structure to keep track of the current

state of request processing.
Several macros exist to provide access to that state:

struct page *bio_page(struct bio *bio);

Returns a pointer to the page structure
representing the page to be transferred next.

int bio_offset(struct bio *bio);

Returns the offset within the page for the data to be transferred.

int bio_cur_sectors(struct bio *bio);

Returns the number of sectors to be transferred out of the current
page.

char *bio_data(struct bio *bio);

Returns a kernel logical address pointing to the data to be
transferred. Note that this address is available only if the page in
question is not located in high memory; calling it in other
situations is a bug. By default, the block subsystem does not pass
high-memory buffers to your driver, but if you have changed that
setting with blk_queue_bounce_limit, you
probably should not be using bio_data.

char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);

bio_kmap_irq returns a kernel virtual address
for any buffer, regardless of whether it resides in high or low
memory. An atomic kmap is used, so your driver cannot sleep while
this mapping is active. Use bio_kunmap_irq to
unmap the buffer. Note that the flags argument is
passed by pointer here. Note also that since an atomic kmap is used,
you cannot map more than one segment at a time.

All of the functions just described access the
"current" bufferthe first
buffer that, as far as the kernel knows, has not been transferred.
Drivers often want to work through several buffers in the
bio before signaling completion on any of them
(with end_that_request_first, to be described
shortly), so these functions are often not useful. Several other
macros exist for working with the internals of the
bio structure (see
<linux/bio.h> for details).

16.3.4.2 Request structure fields

Now that we have an idea of how the bio structure
works, we can get deep into struct
request and see how request processing works. The
fields of this structure include:

sector_t hard_sector;

unsigned long hard_nr_sectors;

unsigned int hard_cur_sectors;

Fields that track the sectors that the driver has yet to complete.
The first sector that has not been transferred
is stored in hard_sector, the total number of
sectors yet to transfer is in hard_nr_sectors, and
the number of sectors remaining in the current bio
is hard_cur_sectors. These fields are intended for
use only within the block subsystem; drivers should not make use of
them.

struct bio *bio;

bio is the linked list of bio
structures for this request. You should not access this field
directly; use rq_for_each_bio (described later)
instead.

char *buffer;

The simple driver example earlier in this chapter used this field to
find the buffer for the transfer. With our deeper understanding, we
can now see that this field is simply the result of calling
bio_data on the current bio.

unsigned short nr_phys_segments;

The number of distinct segments occupied by this request in physical
memory after adjacent pages have been merged.

struct list_head queuelist;

The linked-list structure (as described in Section 11.5)
that links the request
into the request queue. If (and only if) you remove the request from
the queue with blkdev_dequeue_request, you may
use this list head to track the request in an internal list
maintained by your driver.

Figure 16-2 shows how the
request structure and its component bio structures fit together. In
the figure, the request has been partially satisfied; the
cbio and buffer fields point to
the first bio that has not yet been transferred.

Figure 16-2. A request queue with a partially processed request

There are many other fields inside the request
structure, but the list in this section should be enough for most
driver writers.

16.3.4.3 Barrier requests

The block layer reorders
requests
before your driver sees them to improve I/O performance. Your driver,
too, can reorder requests if there is a reason to do so. Often, this
reordering happens by passing multiple requests to the drive and
letting the hardware figure out the optimal ordering. There is a
problem with unrestricted reordering of requests, however: some
applications require guarantees that certain operations will complete
before others are started. Relational database managers, for example,
must be absolutely sure that their journaling information has been
flushed to the drive before executing a transaction on the database
contents. Journaling filesystems, which are now in use on most Linux
systems, have very similar ordering constraints. If the wrong
operations are reordered, the result can be severe, undetected data
corruption.

The 2.6 block layer addresses this problem with the concept of a
barrier request. If a request is marked with
the REQ_HARDBARRER flag, it must be written to the
drive before any following request is initiated. By
"written to the drive," we mean
that the data must actually reside and be persistent on the physical
media. Many drives perform caching of write requests; this caching
improves performance, but it can defeat the purpose of barrier
requests. If a power failure occurs when the critical data is still
sitting in the drive's cache, that data is still
lost even if the drive has reported completion. So a driver that
implements barrier requests must take steps to force the drive to
actually write the data to the media.

If your driver honors barrier requests, the first step is to inform
the block layer of this fact. Barrier handling is another of the
request queues; it is set with:

void blk_queue_ordered(request_queue_t *queue, int flag);

To indicate that your driver implements barrier requests, set the
flag parameter to a nonzero value.

The actual implementation of barrier requests is simply a matter of
testing for the associated flag in the request
structure. A macro has been provided to perform this test:

int blk_barrier_rq(struct request *req);

If this macro returns a nonzero value, the request is a barrier
request. Depending on how your hardware works, you may have to stop
taking requests from the queue until the barrier request has been
completed. Other drives can understand barrier requests themselves;
in this case, all your driver has to do is to issue the proper
operations for those drives.

16.3.4.4 Nonretryable requests

Block drivers often attempt to retry requests that fail
the first time. This behavior can lead to a more reliable system and
help to avoid data loss. The kernel, however, sometimes marks
requests as not being retryable. Such requests should simply fail as
quickly as possible if they cannot be executed on the first try.

If your driver is considering retrying a failed request, it should
first make a call to:

int blk_noretry_request(struct request *req);

If this macro returns a nonzero value, your driver should simply
abort the request with an error code instead of retrying it.

16.3.5. Request Completion Functions

There are, as we will see,
several different ways of working
through a request structure. All of them make use
of a couple of common functions, however, which handle the completion
of an I/O request or parts of a request. Both of these functions are
atomic and can be safely called from an atomic context.

When your device has completed transferring some or all of the
sectors in an I/O request, it must inform the block subsystem with:

int end_that_request_first(struct request *req, int success, int count);

This function tells the block code that your driver has finished with
the transfer of count sectors starting where you
last left off. If the I/O was successful, pass
success as 1; otherwise pass
0. Note that you must signal completion in order
from the first sector to the last; if your driver and device somehow
conspire to complete requests out of order, you have to store the
out-of-order completion status until the intervening sectors have
been transferred.

The return value from end_that_request_first is
an indication of whether all sectors in this request have been
transferred or not. A return value of 0 means that
all sectors have been transferred and that the request is complete.
At that point, you must dequeue the request with
blkdev_dequeue_request (if you have not already
done so) and pass it to:

void end_that_request_last(struct request *req);

end_that_request_last informs whoever is waiting
for the request that it has completed and recycles the
request structure; it must be called with the
queue lock held.

In our simple sbull example, we
didn't use any of the above functions. That example,
instead, is called end_request. To show the
effects of this call, here is the entire
end_request function as seen in the 2.6.10
kernel:

void end_request(struct request *req, int uptodate)
{
if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
add_disk_randomness(req->rq_disk);
blkdev_dequeue_request(req);
end_that_request_last(req);
}
}

The function add_disk_randomness uses the timing
of block I/O requests to contribute entropy to the
system's random number pool; it should be called
only if the disk's timing is truly random. That is
true for most mechanical devices, but it is not true for a
memory-based virtual device, such as sbull. For
this reason, the more complicated version of
sbull shown in the next section does not call
add_disk_randomness.

16.3.5.1 Working with bios

You now know enough to
write
a block driver that works directly with the bio
structures that make up a request. An example might help, however. If
the sbull driver is loaded with the
request_mode parameter set to
1, it registers a bio-aware
request function instead of the simple function
we saw above. That function looks like this:

static void sbull_full_request(request_queue_t *q)
{
struct request *req;
int sectors_xferred;
struct sbull_dev *dev = q->queuedata;
while ((req = elv_next_request(q)) != NULL) {
if (! blk_fs_request(req)) {
printk (KERN_NOTICE "Skip non-fs request\n");
end_request(req, 0);
continue;
}
sectors_xferred = sbull_xfer_request(dev, req);
if (! end_that_request_first(req, 1, sectors_xferred)) {
blkdev_dequeue_request(req);
end_that_request_last(req);
}
}
}

This function simply takes each request, passes it to
sbull_xfer_request, then completes it with
end_that_request_first and, if necessary,
end_that_request_last. Thus, this function is
handling the high-level queue and request management parts of the
problem. The job of actually executing a request, however, falls to
sbull_xfer_request:

static int sbull_xfer_request(struct sbull_dev *dev, struct request *req)
{
struct bio *bio;
int nsect = 0;
rq_for_each_bio(bio, req) {
sbull_xfer_bio(dev, bio);
nsect += bio->bi_size/KERNEL_SECTOR_SIZE;
}
return nsect;
}

Here we introduce another macro:
rq_for_each_bio. As you might expect, this macro
simply steps through each bio structure in the
request, giving us a pointer that we can pass to
sbull_xfer_bio for the transfer. That function
looks like:

static int sbull_xfer_bio(struct sbull_dev *dev, struct bio *bio)
{
int i;
struct bio_vec *bvec;
sector_t sector = bio->bi_sector;
/* Do each segment independently. */
bio_for_each_segment(bvec, bio, i) {
char *buffer = _ _bio_kmap_atomic(bio, i, KM_USER0);
sbull_transfer(dev, sector, bio_cur_sectors(bio),
buffer, bio_data_dir(bio) =  = WRITE);
sector += bio_cur_sectors(bio);
_ _bio_kunmap_atomic(bio, KM_USER0);
}
return 0; /* Always "succeed" */
}

This function simply steps through each segment in the
bio structure, gets a kernel virtual address to
access the buffer, then calls the same
sbull_transfer function we saw earlier to copy
the data over.

Each device has its own needs, but, as a general rule, the code just
shown should serve as a model for many situations where digging
through the bio structures is needed.

16.3.5.2 Block requests and DMA

If you are working on a high-performance block driver, chances are
you will be using DMA for the actual data transfers. A block driver
can certainly step through the bio structures, as
described above, create a DMA mapping for each one, and pass the
result to the device. There is an easier way, however, if your device
can do scatter/gather I/O. The function:

int blk_rq_map_sg(request_queue_t *queue, struct request *req, 
struct scatterlist *list);

fills in the given list with the full set of
segments from the given request. Segments that are adjacent in memory
are coalesced prior to insertion into the scatterlist, so you need
not try to detect them yourself. The return value is the number of
entries in the list. The function also passes back, in its third
argument, a scatterlist suitable for passing to
dma_map_sg. (See Section 15.4.4.7
for more information on
dma_map_sg.)

Your driver must allocate the storage for the scatterlist before
calling blk_rq_map_sg. The list must be able to
hold at least as many entries as the request has physical segments;
the struct request field
nr_phys_segments holds that count, which will not
exceed the maximum number of physical segments specified with
blk_queue_max_phys_segments.

If you do not want blk_rq_map_sg to coalesce
adjacent segments, you can change the default behavior with a call
such as:

clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);

Some SCSI disk drivers mark their request queue in this way, since
they do not benefit from the coalescing of requests.

16.3.5.3 Doing without a request queue

Previously, we have discussed the work the kernel does to optimize
the order of requests in the queue; this work involves sorting
requests and, perhaps, even stalling the queue to allow an
anticipated request to arrive. These techniques help the
system's performance when dealing with a real,
spinning disk drive. They are completely wasted, however, with a
device like sbull. Many block-oriented devices,
such as flash memory arrays, readers for media cards used in digital
cameras, and RAM disks have truly random-access performance and do
not benefit from advanced-request queueing logic. Other devices, such
as software RAID arrays or virtual disks created by logical volume
managers, do not have the performance characteristics for which the
block layer's request queues are optimized. For this
kind of device, it would be better to accept requests directly from
the block layer and not bother with the request queue at all.

For these situations, the block layer supports a "no
queue" mode of operation. To make use of this mode,
your driver must provide a "make
request" function, rather than a
request function. The
make_request function has this prototype:

typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);

Note that a request queue is still present, even though it will never
actually hold any requests. The make_request
function takes as its main parameter a bio
structure, which represents one or more buffers to be transferred.
The make_request function can do one of two
things: it can either perform the transfer directly, or it can
redirect the request to another device.

Performing the transfer directly is just a matter of working through
the bio with the accessor methods we described
earlier. Since there is no request structure to
work with, however, your function should signal completion directly
to the creator of the bio structure with a call to
bio_endio:

void bio_endio(struct bio *bio, unsigned int bytes, int error);

Here, bytes is the number of bytes you have
transferred so far. It can be less than the number of bytes
represented by the bio as a whole; in this way,
you can signal partial completion, and update the internal
"current buffer" pointers within
the bio. You should either call
bio_endio again as your device makes further
process, or signal an error if you are unable to complete the
request. Errors are indicated by providing a nonzero value for the
error parameter; this value is normally an error
code such as -EIO. The
make_request should return 0,
regardless of whether the I/O is successful.

If sbull is loaded with
request_mode=2, it operates with a
make_request function. Since
sbull already has a function that can transfer a
single bio, the make_request
function is simple:

static int sbull_make_request(request_queue_t *q, struct bio *bio)
{
struct sbull_dev *dev = q->queuedata;
int status;
status = sbull_xfer_bio(dev, bio);
bio_endio(bio, bio->bi_size, status);
return 0;
}

Please note that you should never call bio_endio
from a regular request function; that job is
handled by end_that_request_first instead.

Some block drivers, such as those implementing volume managers and
software RAID arrays, really need to redirect the request to another
device that handles the actual I/O. Writing such a driver is beyond
the scope of this book. We note, however, that if the
make_request function returns a nonzero value,
the bio is submitted again. A
"stacking" driver can, therefore,
modify the bi_bdev field to point to a different
device, change the starting sector value, then return; the block
system then passes the bio to the new device.
There is also a bio_split call that can be used
to split a bio into multiple chunks for submission
to more than one device. Although if the queue parameters are set up
correctly, splitting a bio in this way should
almost never be necessary.

Either way, you must tell the block subsystem that your driver is
using a custom make_request function. To do so,
you must allocate a request queue with:

request_queue_t *blk_alloc_queue(int flags);

This function differs from blk_init_queue in
that it does not actually set up the queue to hold requests. The
flags argument is a set of allocation flags to be
used in allocating memory for the queue; usually the right value is
GFP_KERNEL. Once you have a queue, pass it and
your make_request function to
blk_queue_make_request:

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);

The sbull code to set up the
make_request function looks like:

dev->queue = blk_alloc_queue(GFP_KERNEL);
if (dev->queue =  = NULL)
goto out_vfree;
blk_queue_make_request(dev->queue, sbull_make_request);

For the curious, some time spent digging through
drivers/block/ll_rw_block.c shows that all
queues have a make_request function. The default
version, generic_make_request, handles the
incorporation of the bio into a
request structure. By providing a
make_request function of its own, a driver is
really just overriding a specific request queue
method and sorting out much of the work.