2012. január 20., péntek

Linux block device drivers: the queue and make_request

I had to write a kernel module recently that implemented a block device. I've carefully read the LDD3 material on every topic I thought might be useful; I concentrated mainly on block devices of course.

Linux block drivers can do their job by providing a function that the kernel will call if it needs to do IO on the device. The first method described in LDD3 was the "request" method. According to this, the driver had to provide a function (wich was called the "request" function after the name of the field in a struct that is registered with the kernel at driver initialization). This function is called whenever a so-called request is available in a so-called request queue. The request does not equal to a single IO request that the kernel received from userspace (like when you do a read() on a disk for example). Such user space originating requests are merged and/or splitted to form suitably sized and appropriately queued chunks of IO to do.



The keyword here is "appropriately sized". When registering a disk with the kernel, the driver has to provide a struct request_queue in wich these requests are put. The queue has a lot of fields that describe what "appropriately sized" means; so the kernel knows about the physical block size, the minimal and maximal IO size et cetera. This makes writing block device comfortable, because the request function will receive whole blocks, properly aligned, and does not have to do the extra job it would have to if it could get small pieces of blocks not aligned to the physical blocks of the underlying device.

This also minimizes the overhead that occurs when a block has to be written to a device whose size is less then the block size the device can work with. If this could happen, the driver would have to read the entire physical block (or, to make things more complicated, two blocks, if the request is misaligned). After reading the physical block, the driver would have to copy the given data to the appropriate location, and then write it back to the drive. This means a read, a copy, and a write. Three not exactly quick operations.

If the driver always receives complete physical blocks, all properly aligned, then it always can solve the problem with a single IO operation. Since smart people discovered DMA and IOMMU-s, it can be done without the CPU, so no one has to copy a single byte, even when the buffer needed to read/written is scattered in main memory.

Since most of the drivers that need to move considerable amount of data work in a similar manner, you can send files over the network between two computers and two hard drives without the CPU ever touching the data; you can do a thing called zero-copy data handling. This way you can spare a few CPU cores under heavy IO load.

The other thing the request queue does is ordering request to form sequences of IO operations that can be handled efficiently by a rotating disk with a head that moves from track to track - between concentric circles on the surface of a rotating disk painted with special magnetic material.

My device is very much NOT what a magnetic rotating disk is like, sequential and random IO has exactly the same performance for reasons I would not like to discuss right now. Since the request queuing  involves extra CPU cycles, but offers absolutely nothing in my case, I've left the whole thing out of the picture, and went on with the second "make_request" method.

The "make_request" method also got it's name from the function pointer field in the same structure. It has to point to a function that is also called when IO work has to be done, just like we saw above. The great difference is that no request ordering (or even stalling, waiting for anticipated requests) are done. Just what I need!

My big problem was that the book did not say explicitly that if request_queue limits are still apply when using this method. It would be logical, since one had to register the queue, doesn't matter how she wanted to serve requests; the programming interface suggested that IO sizing and aligning was still being done.

My first experiments with the make_request method suggested that the request_queue is just a deadweight. I saw 512 byte request hitting the driver even if I explicitly set the queue limits asking for 4K sized blocks. The problem was another piece of missing information that caused me to put a nasty bug into the driver code.

Before I started to read the kernel code instead of the documentation, my code looked like this:


        PDEBUG("allocating queue\n");
        queue = blk_alloc_queue(GFP_KERNEL);

        if (queue == NULL) {
                PDEBUG("Could not allocate request queue\n");
                error = -ENOMEM;
                goto out_vfree;
        }
        
        PDEBUG("setting up queue parameters\n");
        queue->queuedata = new;
        blk_queue_logical_block_size(queue, block_size);
        blk_queue_make_request(queue, bdtun_make_request);

And after waiting weeks for somebody to do the job for me at stackoverflow and countless hours of reading Linux kernel code, I came up with the solution:

        PDEBUG("allocating queue\n");
        queue = blk_alloc_queue(GFP_KERNEL);

        if (queue == NULL) {
                PDEBUG("Could not allocate request queue\n");
                error = -ENOMEM;
                goto out_vfree;
        }
        
        PDEBUG("setting up queue parameters\n");
        queue->queuedata = new;
        blk_queue_make_request(queue, bdtun_make_request);
        blk_queue_logical_block_size(queue, block_size);


No, really. Look harder, there IS a difference. Let me help you: it's the positions of the blk_queue_make_request(queue, bdtun_make_request) line. That's right. It's now before the line setting the logical block size.

That's all folks: blk_queue_make_request resets every property of the request queue! (Code from block/blk_settings.c, linux 3.0)

void blk_queue_make_request(struct request_queue *q, make_request_fn *mfn)
{
        /*
         * set defaults
         */
        q->nr_requests = BLKDEV_MAX_RQ;

        q->make_request_fn = mfn;
        blk_queue_dma_alignment(q, 511);
        blk_queue_congestion_threshold(q);
        q->nr_batching = BLK_BATCH_REQ;

        blk_set_default_limits(&q->limits);
        blk_queue_max_hw_sectors(q, BLK_SAFE_MAX_SECTORS);
        q->limits.discard_zeroes_data = 0;

        /*
         * by default assume old behaviour and bounce for any highmem page
         */
        blk_queue_bounce_limit(q, BLK_BOUNCE_HIGH);
}
EXPORT_SYMBOL(blk_queue_make_request);

After this modification I was receiving properly sized and aligned IO request. This was reinforcing, but not sufficient evidence. I wanted to see the lines that guarantee these properties, and didn't wanted to rely on mere experimenting. After another countless hours of digging through strange pointer magic and funny comments, I found some evidence.

Throughout the kernel I found explicit checks against invalid bios, so it's pretty sure my driver will never get an invalid bio. Examples:

fs/nilfs2/segbuf.c:


static void nilfs_segbuf_prepare_write(struct nilfs_segment_buffer *segbuf,
                                       struct nilfs_write_info *wi)
{
        wi->bio = NULL;
        wi->rest_blocks = segbuf->sb_sum.nblocks;
        wi->max_pages = bio_get_nr_vecs(wi->nilfs->ns_bdev);
        wi->nr_vecs = min(wi->max_pages, wi->rest_blocks);
        wi->start = wi->end = 0;
        wi->blocknr = segbuf->sb_pseg_start;
}

Similar code from fs/direct-io.c:


static inline int dio_new_bio(struct dio *dio, struct dio_submit *sdio,
                sector_t start_sector, struct buffer_head *map_bh)
{
        sector_t sector;
        int ret, nr_pages;

        ret = dio_bio_reap(dio, sdio);
        if (ret)
                goto out;
        sector = start_sector << (sdio->blkbits - 9);
        nr_pages = min(sdio->pages_in_io, bio_get_nr_vecs(map_bh->b_bdev));
        nr_pages = min(nr_pages, BIO_MAX_PAGES);
        BUG_ON(nr_pages <= 0);
        dio_bio_alloc(dio, sdio, map_bh->b_bdev, sector, nr_pages);
        sdio->boundary = 0;
out:
        return ret;
}

Grepping through the linux code I found similar code, kind of proving that everyone tries to allocate and submit correctly sized bios.

I found a few other not very well documented facts, like that block device block size must be a power of two, must be at least 512 bytes, and must be at most PAGE_SIZE, being 4096 on most architectures (code from fs/block_dev.c):

int set_blocksize(struct block_device *bdev, int size)
{
 /* Size must be a power of two, and between 512 and PAGE_SIZE */
 if (size > PAGE_SIZE || size < 512 || !is_power_of_2(size))
  return -EINVAL;

 /* Size cannot be smaller than the size supported by the device */
 if (size < bdev_logical_block_size(bdev))
  return -EINVAL;

 /* Don't change the size if it is same as current */
 if (bdev->bd_block_size != size) {
  sync_blockdev(bdev);
  bdev->bd_block_size = size;
  bdev->bd_inode->i_blkbits = blksize_bits(size);
  kill_bdev(bdev);
 }
 return 0;
}

Fun fact: you can create a block device from a kernel module that has different block size. It might even seem to work, at least for a while. But if you do that, be prepared for awkward user space errors - or more often - kernel panics.