In-depth understanding of the linux kernel--block device driver

Block device handling

Every operation on a block device driver involves a number of kernel components; some of the most important are shown in Figure 14-1. For example, let's assume that a process issues a read() system call on some disk file - we will see that write requests are handled in essentially the same way. The following are the general steps by which the kernel responds to process requests:
Insert image description here

  1. The service routine of the read() system call calls an appropriate VFS function, passing it the file descriptor and offset within the file. The virtual file system sits on top of the block device processing architecture and provides a common file model that is used by all file systems supported by Linux. We have introduced the VFS layer in detail in Chapter 12.
  2. The VFS function determines whether the requested data already exists and, if necessary, determines how to perform the read operation. Sometimes there is no need to access data on disk because the kernel keeps most of the data recently read from or written to the block device in RAM. Chapter 15 introduces the disk cache mechanism, while Chapter 16 details how VFS handles disk operations and interacts with the disk cache and file system.
  3. We assume that the kernel reads data from a block device, then it must determine the physical location of the data. In order to do this, the kernel relies on the mapping layer, which mainly performs the following two steps:
    a. The kernel determines the block size of the file system where the file is located, and calculates the length of the requested data based on the file block size. Essentially, the file is viewed as split into a number of blocks, so the kernel determines the block number (relative index to the start of the file) where the requested data is located.
    b. Next, the mapping layer calls a file system-specific function, which accesses the file's disk node and then determines the location of the requested data on disk based on the logical block number. In fact, the disk is also considered to be split into many blocks, so the kernel must determine the number of the block (relative index to the beginning of the disk or partition) that holds the requested data. Since a file may be stored in non-contiguous blocks on disk, the data structure stored in the disk index node maps each file block number to a logical block number (Note 1). We will explain the function of the mapping layer in Chapter 16, and introduce some typical disk file systems in Chapter 18.
  4. Now the kernel can issue read requests to the block device. The kernel uses the generic block layer (generic block inyer) to initiate I/O operations to transfer the requested data. Generally speaking, each I/O operation only targets a contiguous set of blocks on the disk. Because the requested data does not have to be in adjacent blocks, the common block layer may initiate several I/O operations. Each I/O operation is described by a "block I/O" ("bio" for short) structure, which collects all the information required by the underlying component to satisfy the request made. The common block layer provides an abstract view of all block devices, thus hiding the differences between hardware block devices. Almost all block devices are disks, so the universal block layer also provides some common data structures to describe "disks" or "disk partitions". We will discuss the common block layer and bio data structures in the "Common block layer" section of this chapter.
  5. The "I/O scheduler" below the general block layer classifies pending I/O data transfer requests according to predefined kernel policies. The role of the scheduler is to group together adjacent data requests on the physical media. We will introduce the scheduler in the "I/O Scheduler" section later in this chapter.
  6. Finally, the block device driver sends the appropriate commands to the disk controller's hardware interface to perform the actual data transfer. We will introduce the overall organizational structure of the universal block device driver in the "Block Device Driver" section later. As you can see, data storage in block devices involves many kernel components; each component uses blocks of different lengths to manage disk data: 6.1. The
    hardware block device controller uses fixed-length blocks called "sectors" to manage disk data. Transfer data. Therefore, the I/O scheduler and block device driver must manage data sectors.
    6.2. Virtual file systems, mapping layers, and file systems store disk data in logical units called "blocks."
    6.3. A block corresponds to the smallest disk storage unit in the file system.
    As we will soon see, block device drivers should be able to handle "segments" of data: a segment is a memory page or part of a memory page that contains physically adjacent blocks of data on disk.
    6.4. The disk cache acts on "pages" of disk data, and each page is exactly installed in a page frame. The common block layer combines all the upper and lower layer components together, so it understands sectors, blocks, segments, and pages of data. Even if there are many different data blocks, they usually share the same physical RAM unit.

For example, Figure 14-2 shows the construction of a page with 4096 bytes. Upper-level kernel components treat pages as four 1024-byte block buffers. The block device driver is transferring the last 3 blocks in the page, so these 3 blocks are inserted into a segment that covers the last 3072 bytes. The hard disk controller sees this segment as consisting of six 512-byte sectors.
Insert image description here
In this chapter we introduce the lower-level kernel components that handle block devices: the general block layer, the I/O scheduler, and the block device driver, so we focus on sectors, blocks, and segments.

sector

To achieve acceptable performance, hard drives and similar devices transfer several contiguous bytes of data quickly. Each data transfer operation on a block device operates on a group of adjacent bytes called a sector. In the following discussion, we assume that bytes are recorded contiguously on the disk surface so that they can be accessed by a single search operation. Although the physical structure of the disk is complex, the commands received by the hard disk controller treat the disk as a large group of sectors. In most disk devices, the sector size is 512 bytes, but some devices use larger sectors (1024 and 2048 bytes).

Note that sectors should be regarded as the basic unit of data transfer; data transfer of less than one sector is not allowed, although most disk devices can transfer several adjacent sectors at the same time. In Linux, the sector size is conventionally set to 512 bytes; if a block device uses larger sectors, the corresponding underlying block device driver will make the necessary changes. Therefore, a set of data stored in a block device is identified by its location on the disk, that is, the index of its first 512-byte sector and the number of sectors. The sector index is stored in a 32-bit or 64-bit variable of type sector_c.

piece

Sectors are the basic unit of data transfer by hardware devices, while blocks are the basic unit of data transfer by VFS and file systems. For example, when the kernel accesses the contents of a file, it must first read the block containing the file's disk inode from disk (see the "Inode Object" section in Chapter 12). This block corresponds to one or more adjacent sectors on the disk, and VFS treats it as a single unit of data. In Linux, the block size must be a power of 2 and cannot exceed one page frame. Additionally, it must be an integer multiple of the sector size, since each block must contain an integer number of sectors. Therefore, in the 80×86 architecture, allowed block sizes are 512, 1024, 2048, and 4096 bytes.

The block size of a block device is not unique. When creating a disk file system, the administrator can select an appropriate block size. Therefore, several partitions on the same disk may use different block sizes. Additionally, each read or write operation to a block device file is a "raw" access because it bypasses the disk file system; the kernel performs this operation by using the largest blocks (4096 bytes). Each block requires its own block buffer, which is an area of ​​RAM memory used by the kernel to store the contents of the block. When the kernel reads a block from disk, it fills the corresponding block buffer with the value obtained from the hardware device; similarly, when the kernel writes a block to disk, it uses the actual value of the relevant block buffer. to update the corresponding set of adjacent bytes on the hardware device. The size of the block buffer usually matches the size of the corresponding block.

The buffer header is a descriptor of type buffer_head associated with each buffer. It contains all the information the kernel needs to know to process the buffer; therefore, before operating on each buffer, the kernel first checks its buffer header. We will describe all the field values ​​in the buffer header in detail in Chapter 15; but in this chapter we will only introduce some of them: b_page, b_data, b_blocknr, and b_bdev.

The b_page field stores the page descriptor address of the page frame where the block buffer is located. If the page frame is located in high memory, then the b_data field stores the offset of the block buffer in the page; otherwise, b_data stores the starting linear address of the block buffer itself.
The b_blocknr field stores the logical block number (such as the block index in the disk partition).
Finally, the b_bdev field identifies the block device using the buffer header (see the "Block Devices" section later in this chapter).

part

We know that each I/O operation on the disk is to transfer the contents of some adjacent sectors between the disk and some RAM units. In most cases, the disk controller directly uses DMA for data transfer [see the "Direct Memory Access (DMA)" section in Chapter 13]. The block device driver can trigger a data transfer by simply sending some appropriate commands to the disk controller; once the data transfer is completed, the controller will issue an interrupt to notify the block device driver.

DMA transfers data from adjacent sectors on the disk. This is a physical constraint: the disk controller allows DMA to transfer non-contiguous sectors of data, but the transfer rate this way is very slow because moving the read/write heads across the disk surface is quite slow. Older disk controllers only supported "simple" DMA transfers: in this transfer mode, the disk must transfer data to and from contiguous memory cells in RAM. However, the new disk controller also supports so-called scatter-gather DMA transfers: in this method, the disk can transfer data to and from non-contiguous memory areas.

To initiate a scatter-gather DMA transfer, the block device driver needs to send the disk controller:

  1. Starting disk sector number and total number of sectors to be transferred
  2. A linked list of descriptors for a memory area, where each entry in the linked list contains an address and a length. The disk controller is responsible for the entire data transfer; for example, during a read operation, the controller obtains data from adjacent disk sectors and stores them in different memory areas. In order to use the scatter-gather DMA transfer method, the block device driver must be able to handle data storage units called segments. A segment is a memory page or a portion of a memory page that contains data from adjacent disk sectors. Therefore, a scatter-gather DMA operation may transfer several segments simultaneously. Note that block device drivers do not need to know about blocks, block sizes, and block buffers. Therefore, even if higher layers see a segment as a page consisting of several block buffers, the block device driver does not need to pay attention to this. As we have seen, the common block layer can merge different segments if their corresponding page frames in RAM happen to be contiguous and their corresponding data blocks on disk are also adjacent. The larger memory area generated by this merging method is called a physical segment. However, another method of merging is allowed on various architectures: by using a specialized bus circuit [such as an IO-MMU; see the section "Direct Memory Access (DMA)" in Chapter 13] to handle the bus Mapping between addresses and physical addresses.

The memory area generated by this merging method is called a hardware segment. Since we focus on the 80×86 architecture, which does not have a dynamic mapping between bus addresses and physical addresses, we assume for the remainder of this chapter that hardware segments always correspond to physical segments.

common block layer

The common block layer is a kernel component that handles requests from all block devices in the system. Thanks to the functions provided by this layer, the kernel can easily do:

  1. Put the data buffer in high memory - map the page frame into a linear address space in the kernel only when the CPU accesses its data, and unmap it after the data is accessed.
  2. Through some additional means, a so-called "zero-copy" mode is implemented, which stores disk data directly in the user-mode address space instead of first copying it to the kernel memory area; in fact, the buffer used by the kernel for I/O data transfer The page frame is mapped in the user-mode linear address space of the process.
  3. Manage logical volumes, such as those used by LVM (Logical Volume Manager) and RAID (Redundant Array of Inexpensive Disks): several disk partitions, even if located in different block devices, can be seen as a single partition .
  4. Take advantage of most of the advanced features of new disk controllers, such as large motherboard disk cache, enhanced DMA performance, relative scheduling of I/O transfer requests, and more.

biostructure

The core data structure of the general block layer is a descriptor called bio, which describes the I/O operations of the block device. Each bio structure contains a disk storage area identifier (the starting sector number and the number of sectors in the storage area) and one or more segments describing the memory area associated with the I/O operation. bio is described by the bio data structure, and its fields are shown in Table 14-1.
Insert image description here
Insert image description here
Each segment in bio is described by a bio_vec data structure, whose fields are shown in Table 14-2. The bi_io_vec field in bio points to the first element of the bio_vec data structure, and the bi_vcnt field stores the current number of elements in the bio_vec array.
Insert image description here
The contents of the bio descriptor are kept updated during block I/O operations. For example, if the block device driver cannot complete the entire data transfer in a scatter-gather DMA operation, the bi_idx field in the bio will be continuously updated to point to the first segment to be transferred. To continuously repeat segments in bio starting from the current segment pointed to by index bi_idx, the device driver can execute the macro bio_for_each_segment. When the general block layer starts a new I/O operation, the bio_alloc() function is called to allocate a new bio structure.

Normally, the bio structure is allocated by the slab allocator, but when memory is insufficient, the kernel also uses a backup bio small memory pool (see the "Memory Pool" section in Chapter 8). The kernel also allocates memory pools for bio_vec structures - after all, there is no point in allocating a bio structure without being able to allocate the segment descriptors in it. Correspondingly, the bio_put() function decrements the value of the reference counter (bi_cnt) in bio. If the value is equal to 0, the bio structure and the related bio_vec structure are released.

Representation of disks and disk partitions

A disk is a logical block device handled by a common block layer. Usually a disk corresponds to a hardware block device, such as a hard disk, floppy disk, or optical disk. However, a disk can also be a virtual device that is built on top of several physical disk partitions or on some memory areas in dedicated pages of RAM. In any case, with the services provided by the common block layer, the upper kernel components can work in the same way on all disks. The disk is described by the gendisk object, whose fields are shown in Table 14-3.
Insert image description here
Insert image description here
The flags field stores information about the disk. The most important of these flags is GENHD_FL_UP: if you set it, the disk will be initialized and ready for use. Another related flag is GENHD_FL_REMOVABLE, which is set if it is a removable disk such as a floppy disk or CD-ROM. The fops field of the gendisk object points to a table block_device_operations, which stores several customized methods for the main operations of the block device (as shown in Table 14-4).
Insert image description hereUsually the hard disk is divided into several logical partitions. Each block device file represents either the entire disk or a partition on the disk. For example, a device file /dev/hda with a major device number of 3 and a minor device number of 0 may represent a primary EIDE disk; the first two partitions in the disk are composed of device files /dev/hdal and /dev/ hda2 means that their major device numbers are all 3, and their minor device numbers are 1 and 2 respectively. Generally speaking, partitions on a disk are distinguished by consecutive minor device numbers. If a disk is divided into several partitions, the partition table is stored in an array of hd_struct structure, and the address of the array is stored in the part field of the gendisk object. The array is indexed by the relative index of the partition within the disk. The fields in the hd_struct descriptor are shown in Table 14-5.
Insert image description here
Insert image description here
When the kernel discovers a new disk in the system (during the boot phase, when a removable media is inserted into a drive, or when an external disk is attached during runtime), it calls the alloc_disk() function, which allocates and Initialize a new gendisk object. If the new disk is divided into several partitions, alloc_disk() will also allocate and initialize an array of appropriate hd_struct type. The kernel then calls the add_disk() function to insert the new gendisk object into the general block layer data structure (see the "Registering and Initializing Device Drivers" section later in this chapter).

Submit a request

We introduce the sequence of steps performed by the kernel when submitting an I/O operation request to the general block layer. We assume that the requested data blocks are contiguous on disk and that the kernel already knows their physical location. The first step is to execute the bio_alloc() function to allocate a new bio descriptor. Then, the kernel initializes the bio descriptor by setting some field values:

  1. Set bi_sector to the starting sector number of the data (if the block device is divided into several partitions, the sector number is relative to the starting position of the partition).
  2. Set bi_size to the number of sectors covering the entire data.
  3. Set bi_bdev to the address of the block device descriptor (see the "Block Devices" section later in this chapter).
  4. Set bi_io_vec to the starting address of the bio_vec structure array. Each element in the array describes a segment (memory cache) in the 1/0 operation; in addition, set bi_vcnt to the total number of segments in bio.
  5. Set bi_rw to the flag of the requested operation. The most important flag indicates the direction of data transfer: READ (0) or WRITE (1).
  6. Set bi_end_io to the address of the completion routine that is executed when the I/O operation on bio completes.

Once the bio descriptor has been properly initialized, the kernel calls the generic_make_request() function, which is the main entry point to the general block layer. This function mainly performs the following operations:

  1. Check that bio->bi_sector does not exceed the number of sectors of the block device. If it exceeds, set bio->bi_flags to the BIO_EOF flag, then print a kernel error message, call the bio_endio() function, and terminate. bio_endio() updates the bi_size and bi_sector values ​​in the bio descriptor, and then calls the bi_end_io method of bio. The implementation of the bi_end_io function essentially relies on kernel components that trigger I/O data transfers; we will see some examples of the bi_end_io method in the following chapters.
  2. Get the request queue q associated with the block device (see the "Request Queue Descriptor" section later in this chapter); its address is stored in the bd_disk field of the block device descriptor, and each element in it is pointed to by bio->bi_bdev.
  3. Call the block_wait_queue_running() function to check whether the I/O scheduler currently in use can be dynamically replaced; if so, let the current process sleep until a new I/O scheduler is started (see the next section "I/O Scheduler ").
  4. Call the blk_partition_remap() function to check whether the block device refers to a disk partition (bio->bi_bdev is not equal to bio->bi_dev->bd_contains; see the "Block Devices" section later in this chapter). If so, obtain the hd_struct descriptor of the partition from bio->bi_bdev to perform the following sub-operations:
    a. Update the read_sectors and reads values, or write_sectors and writes values ​​in the hd_struct descriptor according to the direction of data transfer.
    b. Adjust the bio->bi_sector value to convert the starting sector number relative to the partition into the sector number relative to the entire disk.
    c. Set bio->bi_bdev to the block device descriptor of the entire disk (bio->bd_contains). From now on, the general block layer, I/O scheduler, and device drivers will forget about the existence of disk partitions and work directly on the entire disk.
  5. Call the q->make_request_fn method to insert the bio request into the request queue q.
  6. return.
    We will discuss a typical implementation of the make_request_fn method in the section "Requesting the I/O Scheduler" later in this chapter.

I/O scheduler

Although the block device driver can transfer a single sector at a time, the block I/O layer does not perform a separate I/O operation for each accessed sector on the disk; this can cause a decrease in disk performance. Because determining the physical location of sectors on the disk surface is quite time-consuming. Instead, whenever possible, the kernel attempts to merge several sectors together and process them as a whole, thus reducing the average head movement time. When a kernel component wants to read or write some disk data, it actually creates a block device request. Essentially, the request describes the sector being requested and the type of operation (read or write) to be performed on it. However, the kernel does not satisfy the request as soon as it is issued - the I/O operation is simply scheduled and execution is deferred. This artificial delay is a key mechanism for improving block device performance.

When a new data block is requested, the kernel checks whether the new request can be satisfied by slightly extending the previous request that has been waiting (that is, whether the new request can be satisfied without further seek operations). Since disk access is mostly sequential, this simple mechanism is very efficient. Delayed requests complicate handling of block devices.
For example, suppose a process opens a regular file, and then the file system driver reads the corresponding index node from the disk. The block device driver queues the request and suspends the process until the block holding the inode is transferred. However, the block device driver itself does not block, as any other process trying to access the same disk may also block. To prevent block device drivers from hanging, each I/O operation is handled asynchronously.

In particular, block device drivers are interrupt-driven (see the "Monitoring I/O Operations" section in Chapter 13); the general block layer calls the I/O scheduler to generate a new block device request or to extend an existing one. block device request, then terminates. The activated block device driver then calls a so-called strategy routine to select a pending request and issue an appropriate command to the disk controller to satisfy the request.

When the I/O operation terminates, the disk controller generates an interrupt, and if necessary, the corresponding interrupt handler calls the policy routine to handle another request in the queue. Each block device driver maintains its own request queue, which contains a linked list of pending requests for the device. If the disk controller is handling several disks, there is usually a request queue for each physical block device. Performing I/O scheduling independently on each request queue can improve disk performance.

request queue descriptor

The request queue is represented by a large data structure request_queue, whose fields are shown in Table 14-6.
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Insert image description here

In essence, the request queue is a doubly linked list whose elements are request descriptors (that is, the request data structure; see the next section). The queue_head field in the request queue descriptor stores the head of the linked list (the first pseudo element), and the pointer in the queuelist field in the request descriptor links any request to the previous and next elements of the linked list.
The ordering of elements in the queue list is specific to each block device driver; however, the I/O scheduler provides several predetermined ways of ordering elements, which will be discussed later in "I/O Scheduling Algorithms" discussed in section. The backing_dev_info field is a small object of type backing_dev_info, which stores information about the I/O data traffic of the basic hardware block device. For example, it holds information about read-ahead and about the congestion status of the request queue.

request descriptor

Each block device's pending request is represented by a request descriptor, which is stored in the request data structure shown in Table 14-7.
Insert image description here
Insert image description here
Insert image description here
Insert image description here
Each request contains one or more bio structures. Initially, the generic block layer creates a request containing only a bio structure. The I/O scheduler then "extends" the request by either adding a new segment to the initial bio or by linking another bio structure to the request. There may be situations where new data is physically adjacent to data already present in the request. The bio field of the request descriptor points to the first bio structure in the request, and the biotail field points to the last bio structure. The rq_for_each_bio macro executes a loop, thereby traversing all bio structures in the request.
Several field values ​​in the request descriptor may change dynamically. For example, once all the data blocks referenced in the bio have been transferred, the bio field is immediately updated to point to the next bio in the request list. During this period, new bio may be added to the end of the request list, so the value of biotail may also change. While the disk data block is being transferred, the values ​​of several other fields of the request descriptor are modified by the I/O scheduler or device driver.

For example, nr_sectors stores the number of sectors that still need to be transmitted in the entire request, and current_nr_sectors stores the number of sectors that still need to be transmitted in the current bio structure. There are many flags stored in flags, as shown in Table 14-8. By far, the most important flag is REQ_RW, which determines the direction of data transfer.
Insert image description here
Insert image description here
Insert image description here

Manage allocation of request descriptors

Under heavy load and frequent disk operations, a fixed amount of dynamic free memory will become a bottleneck for processes that want to add new requests to request queue q. To solve this problem, each request_queue descriptor contains a request_list data structure, which includes:

  1. A pointer to the memory pool for the requested descriptor (see the "Memory Pool" section in Chapter 8).
  2. Two counters are used to record the number of request descriptors allocated to READ and WRITE requests.
  3. Two flags used to mark whether an allocation failed for a read or write request.
  4. Two waiting queues store processes sleeping to obtain free read and write request descriptors. · A waiting queue that stores processes waiting for a request queue to be refreshed (emptied).

The blk_get_request() function attempts to obtain a free request descriptor from the memory pool of a specific request queue; if the memory area is insufficient and the memory pool is exhausted, it will either suspend the current process or return NULL (if the kernel control path cannot be blocked ). If the allocation is successful, the address of the request_list data structure of the request queue is stored in the r1 field of the request descriptor. The blk_put_request() function releases a request descriptor; if the value of the descriptor's reference counter is 0, the descriptor is returned to the memory pool where it originally resided.

Avoid request queue congestion

Each request queue has a maximum number of requests it is allowed to handle. The nr_requests field of the request queue descriptor stores the maximum number of requests allowed to be processed in each data transfer direction. By default, a queue has at most 128 pending read requests and 128 pending write requests.

If the number of pending read (write) requests exceeds the nr_requests value, then the queue is marked as full by setting the QUEUE_FLAG_READFULL (QUEUE_FLAG_WRITEFULL) flag in the queue_flags field of the request queue descriptor, and attempts to add the request to an available transfer direction. The blocking process is placed in the waiting queue corresponding to the request_list structure and sleeps. A filled request queue has a negative impact on system performance because it forces many processes to sleep while waiting for I/O data transfers to complete. Therefore, if the number of pending requests in a given transfer direction exceeds the value stored in the nr_congestion_on field of the request descriptor (the default value is 113), then the kernel considers the queue to be congested and attempts to slow down the creation of new requests. rate.

When the number of pending requests is less than the value of nr_congestion_off (the default value is 111), the congested request queue becomes non-congested. The blk_congestion_wait() function suspends the current process until all request queues become uncongested or the timeout has expired.

Activate block device driver

As we've seen before, delaying the activation of a block device driver has the advantage of concentrating requests for adjacent blocks. This delay is achieved through so-called device insertion and device extraction technology (Note 2). When a block device driver is plugged in, the driver is not activated, even if there are pending requests in the driver queue. The function of the blk_plug_device() function is to insert a block device - more precisely, into a request queue handled by a block device driver.

Essentially, this function receives the address q of a request queue descriptor as its argument. It sets the QUEUE_FLAG_PLUGGED bit in the q->queue_flags field; then, restarts the embedded dynamic timer in the q->unplug_timer field. blk_remove_plug() removes a request queue q: clears the QUEUE_FLAG_PLUGGED flag and cancels the execution of the q->unplug_timer dynamic timer. The kernel explicitly calls this function when all mergeable requests "in sight" have been added to the request queue. In addition, if the number of pending requests in the request queue exceeds the value stored in the unplug_thresh field of the request queue descriptor (the default value is 4), the I/O scheduler will also remove the request queue.

If a device remains plugged in for a time interval of q->unplug_delay (usually 3ms), then the dynamic timer activated by the blk_plug_device() function has expired, so the blk_unplug_timeout() function will be executed. Therefore, wake up the work queue kblockd_workqueue operated by the kernel thread kblockd (see the "Work Queue" section in Chapter 4). kblockd executes the blk-unplug_work() function, and its address is stored in the q->unplug_work structure. Next, the function calls the q->unplug_fn method in the request queue, which is usually implemented by the generic_unplug_device() function.

The function of the generic_unplug_device() function is to unplug the block device:
first, check whether the request queue is still active;
then, call the blk_remove_plug() function;
finally, execute the policy routine request_fn method to start processing the next request in the request queue (see this chapter "Registering and Initializing Device Drivers" section later).

IO scheduling algorithm

When a new request is added to the request queue, the general block layer calls the I/O scheduler to determine the exact location of the new request in the request queue. The I/O scheduler attempts to queue requests by sector. If the requests to be processed are extracted sequentially from the linked list, the number of head seeks will be significantly reduced, because the head moves from the inner track to the outer track (and vice versa) in a straight line, rather than randomly moving from one track to another. Jump to another track. This can be inspired by the elevator algorithm, recall that the elevator algorithm handles up and down requests from different layers. The elevator moves in one direction; when the last scheduled floor in one direction is reached, the elevator changes direction and starts moving in the opposite direction. Therefore, the I/O scheduler is also called the elevator algorithm.

Under heavy load, I/O scheduling algorithms that strictly follow sector number order do not work very well. In this case, the completion time of the data transfer depends mainly on the physical location of the data on the disk. Therefore, if the request processed by the device driver is at the head of the queue (with a small sector number), and new requests with small sector numbers are continuously added to the queue, the requests at the end of the queue will easily starve. Therefore, the I/O scheduling algorithm will be very complex.
Currently, Linux 2.6 provides four different types of I/O schedulers or elevator algorithms, namely the "Anticipatory" algorithm, the "Deadline" algorithm, and the "CFQ (Complete Fairness Queueing)" algorithm. )" algorithm and the "Noop (No Operation)" algorithm. For most block devices, the default elevator algorithm used by the kernel can be reset at boot time via the kernel parameter elevator=, where the value can be any of the following: as, deadline, cfq, and noop. If no boot parameters are given, the kernel defaults to using the "expected" I/O scheduler. In short, the device driver can replace the default elevator algorithm with any scheduler; the device driver can also customize the I/O scheduling algorithm, but this is rare. In addition, system administrators can change the I/O scheduler for a specific block device at runtime. For example, to change the I/O scheduler used by the primary disk of the first IDE channel, the administrator can write the name of an elevator algorithm to the /sys/block/hda/queue/scheduler file in the sysfs special file system ( See the "sysfs file system" section in Chapter 13).

The I/O scheduling algorithm used in the request queue is represented by an elevator object of type elevator_t; the address of the object is stored in the elevator field of the request queue descriptor. The elevator object contains several methods that cover all possible operations of the elevator: linking and disconnecting the elevator, adding and merging requests in the queue, deleting requests from the queue, getting the next pending request in the queue, etc. The elevator object also stores the address of a table that contains all the information needed to process the request queue. Furthermore, each request descriptor contains an elevator_private field that points to an additional data structure used by the I/O scheduler to handle the request.

Now we briefly introduce the four I/O scheduling algorithms from easy to difficult. Note that designing an I/O scheduler is very similar to designing a CPU scheduler (see Chapter 7): the heuristics and constant values ​​used are the result of test and benchmark extensions. Generally speaking, all algorithms use a dispatch queue (dispatch queue), and all requests contained in the queue are sorted in the order that the device driver should process - that is, the next request to be processed by the device driver is usually the dispatch queue the first element in .

The dispatch queue is actually the request queue determined by the queue_head field of the request queue descriptor. Almost all algorithms use additional queues to classify and order requests. They allow the device driver to add bio structures to an existing request and, if necessary, merge two "adjacent" requests.

"Noop" algorithm

This is the simplest I/O scheduling algorithm. It has no sorted queue: new requests are usually inserted at the beginning or end of the dispatch queue, and the next request to be processed is always the first request in the queue.

"CFQ" algorithm

The main goal of the "CFQ (Completely Fair Queuing)" algorithm is to ensure fair distribution of disk I/O bandwidth among all processes that trigger I/O requests. To achieve this goal, the algorithm uses many sorting queues, which store requests issued by different processes.

When the algorithm processes a request, the kernel calls a hash function to convert the current process's thread group identifier (usually, it corresponds to its PID, see the "Identifying a Process" section in Chapter 3) into the queue's index value; then, The algorithm inserts a new request at the end of the queue. Therefore, requests from the same process are usually inserted into the same queue. To repopulate the dispatch queue, the algorithm essentially scans the I/O input queues in a polling manner, selects the first non-empty queue, and then moves a set of requests in that queue to the end of the dispatch queue.

"Deadline" Algorithm

In addition to the dispatch queue, the Deadline algorithm uses four queues. The two sorting queues contain read requests and write requests respectively, and the requests are sorted according to the starting sector number. The other two deadline queues contain the same read and write requests, but this is ordered according to their "deadline". These queues were introduced to avoid request starvation, which occurs when a request is ignored for a long period of time due to the elevator policy prioritizing the request closest to the last one processed. The request deadline is essentially a timeout that starts when the request is passed to the elevator algorithm.

By default, the timeout for read requests is 500ms and the timeout for write requests is 5s - read requests take precedence over write requests because read requests usually block the requesting process. Deadlines guarantee that the scheduler will take care of a request that has been waiting for a long time, even if it is at the end of the sort queue. When the algorithm wants to replenish the dispatch queue, it first determines the data direction for the next request. If both read and write requests are scheduled at the same time, the algorithm will choose the "read" direction unless the "write" direction has been abandoned many times (to avoid starvation of write requests). Next, the algorithm checks the deadline queue associated with the selected direction: if the deadline of the first request in the queue has been exhausted, then the algorithm moves the request to the end of the dispatch queue; it can also start from the request that has timed out. Moves a set of requests from the sort queue. The length of the group becomes longer if the requests to be moved are physically adjacent on disk, and shorter otherwise.

Finally, if there are no request timeouts, the algorithm schedules a set of requests after the last request from the sorted queue. When the pointer reaches the end of the sorted queue, the search starts again from the beginning ("one-way algorithm").

"anticipation" algorithm

The "anticipation" algorithm is the most complex I/O scheduling algorithm provided by Linux. Basically, it is an evolution of the "deadline" algorithm, borrowing the basic mechanism of the "deadline" algorithm: two deadline queues and two sorting queues; the I/O scheduler interactively scans and sorts between reads and writes. Queue, but prefers read requests. Scanning is basically continuous unless a request times out. The default timeout for read requests is 125ms, and the default timeout for write requests is 250ms. However, the algorithm also follows some additional heuristic guidelines:
1. In some cases, the algorithm may select a request after the current position in the sort queue, thus forcing the head to search from behind. This usually occurs when the search distance after this request is less than half the search distance for this request after the current position in the sort queue.
2. The algorithm counts the types of I/O operations triggered by each process in the system. After just scheduling a read request issued by a certain process p, the algorithm immediately checks whether the next request in the sorting queue comes from the same process p. If so, schedule the next request immediately. Otherwise, check the statistics about the process p: if it is determined that process p may issue another read request soon, then delay for a short period of time (the default is about 7ms).
Therefore, the algorithm predicts that the read request issued by process p and the request just scheduled may be "near neighbors" on the disk.

Make a request to the I/O scheduler

As we saw in the "Submitting Requests" section earlier in this chapter, the generic_make_request() function calls the make_request_fn method of the request queue descriptor to send a request to the I/O scheduler. Usually this method is implemented by the __make_request() function; this function receives a request_queue type descriptor q and a bio structure descriptor bio as its parameters, and then performs the following operations:

  1. If necessary, call the blk_queue_bounce() function to create a bounce buffer (see later). If a bounce buffer is created, the __make_request() function will operate on the buffer instead of the original bio structure.
  2. Call the elv_queue_empty() function of the I/O scheduler to check whether there are pending requests in the request queue. Note that the dispatch queue may be empty, but the I/O scheduler's other queues may contain pending requests. If there are no pending requests, call the blk_plug_device() function to insert the request queue (see the "Activating Block Device Driver" section earlier in this chapter), and then jump to step 5.
  3. The inserted request queue contains pending requests. Call the elv_merge() function of the I/O scheduler to check whether the new bio structure can be merged into the existing request. This function will return three possible values:
    3.1. ELEVATOR_NO_MERGE: The existing request cannot contain the bio structure; in this case, jump to step 5.
    3.2. ELEVATOR_BACK_MERGE: The bio structure can be inserted into a request req as the last bio;
    in this case, the function calls the q->back_merge_fn method to check whether the request can be expanded. If not, jump to step 5. Otherwise, insert the bio descriptor at the end of the req list and update the corresponding field value of req. The function then attempts to merge this request with the request that follows it (new bio may be populated between the two requests).
    3.3. ELEVATOR_FRONT_MERGE: The bio structure can be inserted as the first bio of a request req; in this case, the function calls the q->front_merge_fn method to check whether the request can be extended. If not, jump to step 5. Otherwise, insert the bio descriptor into the head of the req list and update the corresponding field value of req. Then, an attempt is made to merge the request with its previous request.
  4. bio has been incorporated into the existing request, jump to step 7 to terminate the function.
  5. bio must be inserted into a new request. Allocate a new request descriptor. If there is no free memory, the current process is suspended until the BIO_RW_AHEAD flag in bio->bi_rw is set, which indicates that this I/O operation is a read-ahead (see Chapter 16); in this case, The function calls bio_endio() and terminates: no data transfer will be performed at this time. For a description of the bio_endio() function, see step 1 of the generic_make_request() function (see the previous "Submit a request" section).
  6. Initialize fields in the request descriptor. The main ones are:
    a. Initialize each field according to the content of the bio descriptor, including the number of sectors, the current bio and the current segment.
    b. Set the REQ_CMD flag in the flags field (a standard read or write operation).
    c. If the page frame of the first bio segment is stored in low-end memory, set the buffer field to the linear address of the buffer.
    d. Set the rq_disk field to the address of bio->bi_bdev->bd_disk.
    e. Insert bio into the request list.
    f. Set the start_time field to the value of jiffies.
  7. All operations are completed. However, before terminating, check whether the BIO_RW_SYNC flag in bio->bi_rw is set. If so, the generic_unplug_device() function is called on the "request queue" to uninstall the device driver (see the "Activating Block Device Drivers" section earlier in this chapter).
  8. The function terminates.

If the request queue is not empty before calling the __make_request() function, then the request queue has either been unplugged or will be unplugged soon - because every insert request queue q with pending requests has A running dynamic timer q->unplug_timer. On the other hand, if the request queue is empty, the __make_request() function inserts into the request queue. Either later (the worst case is when the unplug timer expires) or early (when exiting from __make_request(), if bio's BIO_RW_SYNC flag is set), the request queue will be unplugged. In any case, the block device driver's policy routine will eventually process the request in the dispatch queue (see the "Registering and Initializing Device Drivers" section later in this chapter).

blk_queue_bounce() function

The function of the blk_queue_bounce() function is to look at the flags in q->bounce_gfp and the threshold in q->bounce_pfn to determine whether buffer bouncing is necessary. This usually occurs when some buffers in the request are located in high memory and the hardware device cannot access them.

The old DMA method used by the ISA bus can only handle 24-bit physical addresses. Therefore, the upper limit of the bounce buffer is set to 16 MB, which means that the page frame number is 4096. However, when dealing with older devices, block device drivers generally do not rely on bounce buffers; instead, they prefer to allocate DMA buffers directly in the ZONE_DMA memory zone. If the hardware device cannot handle buffers in high memory, the blk_queue_bounce() function checks whether some buffers in the bio really must be bounced.

If so, copy the bio descriptor and create a bounce bio; then, when the page frame number in the segment is equal to or greater than the value of q->bounce_pfn, perform the following operations

  1. Allocate a page frame in the ZONE_NORMAL or ZONE_DMA memory area according to the allocated flags.
  2. Update the value of the bv_page field in the middle of the rebound bio so that it points to the descriptor of the new page frame.
  3. If bio->bio_rw represents a write operation, call kmap() to temporarily map the high-end memory page into the kernel address space, then copy the high-end memory page to the low-end memory page, and finally call kunmap() to release the mapping. Then the blk_queue_bounce() function sets the BIO_BOUNCED flag in the bounce bio, initializes a specific bi_end_io method for it, and finally stores it in the bi_private field of the bounce bio, which is a pointer to the initial bio. When the I/O data transfer on the rebound bio terminates, the function executes the bi_end_io method to copy the data to the high-end memory area (only suitable for read operations), and releases the rebound bio structure.

block device driver

Block device drivers are the lowest-level components in the Linux block subsystem. They get requests from the I/O scheduler and process those requests as required. Of course, block device drivers are part of the device driver model (see the "Device Driver Model" section in Chapter 13). Therefore, each block device driver corresponds to a descriptor of type device_driver; furthermore, each disk handled by a device driver is associated with a device descriptor. However, there is nothing special about these descriptors: the block I/O subsystem must store additional information for each block device in the system.

block device

A block device driver may handle several block devices. For example, an IDE device driver can handle several IDE disks, each of which is a separate block device. Furthermore, each disk is usually partitioned, and each partition can be viewed as a logical block device. Obviously, the block device driver must handle all VFS system calls issued on the block device file corresponding to the block device. Each block device is represented by a descriptor of a block_device structure, whose fields are shown in Table 14-9.
Insert image description here
Insert image description here
All block device descriptors are inserted into a global linked list. The head of the linked list is represented by the variable all_bdevs; the pointer used for linking the linked list is located in the bd_list field of the block device descriptor. If the block device descriptor corresponds to a disk partition, then the bd_contains field points to the block device descriptor associated with the entire disk, and the bd_part field points to the hd_struct partition descriptor (see the "Representation of Disks and Disk Partitions" section earlier in this chapter). Otherwise, if the block device descriptor corresponds to the entire disk, then the bd_contains field points to the block device descriptor itself, and the bd_part_count field is used to record how many times the partition on the disk has been opened. The bd_holder field stores the linear address representing the block device holder. The holder is not a block device driver doing I/O data transfers; rather, it is a kernel component that uses the device and has unique privileges (for example, it can freely use the bd_private field of the block device descriptor).
Typically, the holder of a block device is the file system mounted on the device. Another common problem arises when a block device file is opened for exclusive access: the holder is the corresponding file object. The bd_claim() function sets the bd_holder field to a specific address; conversely, the bd_release() function resets the field to NULL. However, it is worth noting that the same kernel component can call the bd_claim() function multiple times, increasing the value of bd_holders with each call. In order to release a block device, the kernel component must call the bd_release() function bd_holders times.
Figure 14-3 corresponds to an entire disk, which illustrates how block device descriptors are linked to other important data structures of the block I/O subsystem.

Access block device

When the kernel receives a request to open a block device file, it must first determine whether the device file is already open. In fact, if the file is already open, the kernel does not need to create and initialize a new block device descriptor; instead, the kernel should update the existing block device descriptor. However, the real complexity is that block device files with the same major and minor numbers but different pathnames are viewed by VFS as different files, but they actually point to the same block device. Therefore, the kernel cannot determine that the corresponding block device is already in use by simply checking the existence of a block device file in an object's inode cache.
Insert image description here
The relationship between major and minor device numbers and the corresponding block device descriptors is maintained through the bdev special file system (see the "Special File Systems" section in Chapter 12). Each block device descriptor corresponds to a bdev special file: the bd_inode field of the block device descriptor points to the corresponding bdev index node; and the index node encodes the major and minor device numbers of the block device and the address of the corresponding descriptor. bdget() receives the major and minor device numbers of the block device as its parameters: it searches the bdev file system for the relevant index node; if no such node exists, a new index node and a new block device descriptor are allocated. In any case, the function returns the address of a block device descriptor corresponding to the given major and minor device numbers. Once the block device's descriptor is found, the kernel determines whether the block device is currently in use by checking the value of the bd_openers field: if the value is positive, the block device is already in use (possibly through a different device file). At the same time, the kernel also maintains a linked list of inode objects corresponding to the opened block device files. The linked list is stored in the nd_inodes field of the block device descriptor; the i_devices field of the index node object stores pointers used to link the preceding and following elements in the linked list.

Register and initialize device drivers

Now let's explain the basic steps involved in designing a new driver for a block device. Obviously, the description is relatively simple, but it is useful to understand when and how to initialize the main data structures used by the block I/O subsystem. We have omitted all the steps required by block device drivers but already covered in Chapter 13. For example, we skipped all the steps of registering a driver itself (see the "Device Driver Model" section in Chapter 13). Usually, the block device belongs to a standard bus architecture such as PCI or SCSI, and the kernel provides corresponding auxiliary functions. As an auxiliary function, it is to register the driver in the driver model.

Custom driver descriptor

First, the device driver requires a custom descriptor foo of type foo_dev_t, which holds the data needed to drive the hardware device. This descriptor stores relevant information about each device, such as the I/O port used to operate the device, the IRQ line from which the device issues an interrupt, the internal status of the device, etc. It also contains some fields required by the block I/O subsystem:

struct foo_dev_t {
[...]
spinlock_t lock;
struct gendisk *gd;
[...]
} foo;

The lock field is a spin lock used to protect the field value in the foo descriptor; its address is usually passed to a kernel helper function to protect data structures of the block I/O subsystem specific to the driver. The gd field is a pointer to the gendisk descriptor that describes the entire block device (disk) handled by this driver.

Subscription master number

Device drivers must subscribe to a major device number themselves. Traditionally, this operation is done by calling the register_blkdev() function:

err = register_blkdev(FOO_MAJOR,"foo");
if(err)goto error_major_is_busy;

This function is similar to the register_chrdev() function that appears in the "Assigning Device Numbers" section of Chapter 13: it reserves the major device number FOO_MAJOR and assigns the device name foo to it. Note that this cannot allocate a minor number range because there is no equivalent register_chrdev_region() function; furthermore, no link is established between the subscribed major number and the driver's data structure. The only visible effect produced by the register_blkdev() function is the inclusion of a new entry in the list of registered major device numbers in the /proc/devices special file.

Initialize custom descriptors

All fields in the foo descriptor must be properly initialized before using the driver. In order to initialize fields related to the block I/O subsystem, the device driver mainly performs the following operations:
spin_lock_init(&foo.lock);
foo.gd = alloc_disk(16);
if(!foo.gd)goto error_no_gendisk;
The driver first Initialize the spin lock and then allocate a disk descriptor. As you saw earlier in Figure 14-3, the gendisk structure is the most important data structure in the block I/O subsystem because it involves many other data structures. The alloc_disk() function also allocates an array to store disk partition descriptors. The parameter required by this function is the number of elements of the hd_struct structure in the array; 16 means that the driver can support 16 disks, and each disk can contain 15 partitions (0 partition is not used).

Initialize gendisk descriptor

Next, the driver initializes some fields of the gendisk descriptor:
foo.gd->private_data =&foo;
foo.gd->major = FOO_MAJOR;
foo.gd->first_minor = 0;
foo.gd->minors = 16;
set_capacity (foo.gd,foo_disk_capacity_in_sectors);strcpy(foo.gd->disk_name,"foo");
foo.gd->fops =&foo_ops;
The address of the foo descriptor is stored in the private_data field of the gendisk structure, so it is block I/ Low-level driver functions called by the O subsystem as methods can quickly find driver descriptors. This approach can improve efficiency if the driver can handle multiple disks concurrently. The set_capacity() function initializes the capacity field to the disk size in 512-byte sectors. This value may also be determined when probing the hardware and asking for disk parameters.

Initialize block device operation table

The fops field of the gendisk descriptor is initialized to the address of the custom block device method table (see Table 14-4 earlier in this chapter) (Note 3). Similarly, a device driver's foo_ops table contains device driver-specific functions. For example, if the hardware device supports removable disks, the generic block layer will call the media_changed method to check whether the disk has been changed since the block device was last mounted or opened. This check is usually done by sending some low-level commands to the hardware controller, so the media_changed method implemented by each device driver is different. Similarly, ioctl methods are only called when the generic block layer does not know how to handle the ioctl command. For example, this method is usually called when an ioctl() system call queries the disk configuration, that is, the number of cylinders, tracks, sectors, and heads used by the disk. Therefore, the ioctl methods implemented by each device driver are also different.

Allocate and initialize request queue

Our intrepid device driver designer will now set up a request queue that will hold requests waiting to be processed. You can easily create a request queue by doing the following:

foo.gd->rq = blk_init_queue(foo_strategy,&foo.lock);
if(!foo.gd->rq)goto error_no_request_queue;
blk_queue_hardsect_size(foo.gd->rd,foo_hard_sector_size);
blk_queue_max_sectors(foo.gd->rd,foo_max_sectors);
blk_queue_max_hw_segments(foo.gd->rd,foo_max_hw_segments);
blk_queue_max_phys_segments(foo.gd->rd,foo_max_phys_segments);

The blk_init_queue() function allocates a request queue descriptor and initializes many of its fields to default values. The parameters it receives are the address of the device descriptor's spin lock (foo.gd->rq->queue_lock field value) and the address of the device driver's policy routine (see the next section "Policy Routine") (foo .gd->rq->request_fn field value).
This function also initializes the foo.gd->rq->elevator field and forces the driver to use the default I/O scheduling algorithm. If the device driver wants to use a different scheduling algorithm, it can override the address of the elevator field later. Next, use several helper functions to set different fields of the request queue descriptor to characteristic values ​​of the device driver (see Table 14-6 for similar fields).

Set interrupt handler

As introduced in the "I/O Interrupt Handling" section of Chapter 4, the device driver needs to register the IRQ line for the device. This can be accomplished by doing the following:

request_irq(foo_irq,foo_interrupt, SA_INTERRUPTISA_SHIRQ,"foo",NULL);

The foo_interrupt() function is the device's interrupt handler; we discuss some of its features in the "Interrupt Handlers" section later in this chapter.

Register disk

Finally, all the data structures for the device driver are ready: the final step in the initialization phase is to "register" and activate the disk. This can be accomplished simply by performing the following operations: add_disk(foo.gd);
The add_disk() function receives the address of the gendisk descriptor as its parameter and mainly performs the following operations:

  1. Set the GENHD_FL_UP flag of gd->flags.
  2. Call kobj_map() to establish a connection between the device driver and the device's major number (along with the minor number in the relevant range) (see the "Character Device Drivers" section in Chapter 13; note that in this case Below, the kobject mapping domain is represented by the bdev_map variable).
  3. Registers a kobject structure in the device driver model's gendisk descriptor as a new device that the device driver handles (for example, /sys/block/foo).
  4. If necessary, scan the partition table on disk; for each partition found, initialize the corresponding hd_struct descriptor in the foo.gd->part array appropriately. Also register partitions in the device driver model (e.g. /sys/block/foolfool).
  5. Register the kobject structure embedded in the request queue descriptor of the device driver model (for example, /sys/block/foo/queue). Once add_disk() returns, the device driver is ready to work. The initialization function terminates; the policy routine and interrupt handler begin processing each request passed to the device driver by the I/O scheduler.

strategy routine

A policy routine is a function or set of functions of a block device driver that interacts with a hardware block device to satisfy requests assembled in a dispatch queue. Strategy routines can be called through the request_fn method in the request queue descriptor - such as the foo_strategy() function introduced in the previous section. The I/O scheduler layer passes the address of the request queue descriptor q to this function. As mentioned earlier, the policy routine is usually started after a new request is inserted into an empty request queue. As long as the block device driver is activated, all requests in the queue should be processed until the queue is empty. The simple implementation of the policy routine is as follows: for each element in the scheduling queue, it interacts with the block device controller to service the request, waits until the data transfer is completed, and then deletes the serviced request from the queue and continues to process the schedule. The next request in the queue. This implementation is not very efficient. Even assuming that DMA can be used to transfer data, the policy routine must suspend itself while waiting for the I/O operation to complete. This means that the policy routine should be executed on a dedicated kernel thread (we don't want to penalize unrelated user processes). Furthermore, such drivers cannot support modern disk controllers that can handle multiple I/O data transfers at once. Therefore, many block device drivers adopt the following strategy:

  1. The policy routine handles the first request in the queue and sets up the block device controller so that an interrupt can be generated when the data transfer is complete. The policy routine then terminates.
  2. When the disk controller generates an interrupt, the interrupt controller calls the policy routine again (usually directly, sometimes by activating a work queue). The policy routine either initiates another data transfer for the current request, or when all data blocks for the request have been transferred, removes the request from the dispatch queue and starts processing the next request.
    The request is composed of several bio structures, and each bio structure is composed of several segments. Basically, block device drivers use DMA in two ways:
  3. The driver establishes different DMA transfer methods to service each segment in each bio structure requested.
  4. The driver establishes a single scatter-gather DMA transfer to service all segments in all requested bios.
    Finally, the design of device driver policy routines depends on the characteristics of the block controller. Each physical block device has inherent characteristics that differ from other physical block devices (for example, a floppy disk driver groups blocks on a track into tracks, and a single I/O operation transfers the entire track), so it is important for the device driver to It doesn't make much sense to make general assumptions about how each request should be serviced.

In our example, the foo_strategy() strategy routine should do the following:

  1. Get the current request from the dispatch queue by calling the auxiliary function elv_next_request() of the I/O scheduler. If the dispatch queue is empty, end this policy routine: req = elv_next_request(q);if(!req)return;
  2. Execute the blk_fs_request macro to check whether the REQ_CMD flag of the request is set, that is, whether the request contains a standard read or write operation:
if(!blk_fs_request(req))
	goto handle_special_request;
  1. If the block device controller supports scatter-gather DMA, program the disk controller to perform the data transfer for the entire request and generate an interrupt when the transfer is completed. The blk_rq_map_sg() helper function returns a scatter-gather linked list that can be used immediately to initiate data transfers.
  2. Otherwise, the device driver must transfer the data piece by piece. In this case, the strategy routine executes two macros, rq_for_each_bio and bio_for_each_segment, to traverse the bio linked list and the segment linked list in each bio respectively.
rq_for_each_bio(bio,rq)
bio_for_each_segment(bvec,bio,i){
/* transfer the i-th segment bvec */
local_irq_save(flags);
addr = kmap_atomic(bvec->bv_page,KM_BIO_SRC_IRQ);
foo_start_dma_transfer(addr+bvec->bv_offset,bvec->bv_len);
kunmap_atomic(bvec->bv_page,KM_BIO_SRC_IRQ);
local_irq_restore(flags);

If the data to be transferred is in high-end memory, the kmap_atomic() and kunmap_atomic() functions are necessary. The foo_start_dma_transfer() function programs the hardware device to initiate a DMA data transfer and generate an interrupt when the I/O operation is completed.
5. Return.

Interrupt handler

The block device driver's interrupt handler is activated at the end of the DMA data transfer. It checks whether all requested data blocks have been transferred. If so, the interrupt handler calls the policy routine to handle the next request in the dispatch queue. Otherwise, the interrupt handler updates the corresponding fields of the request descriptor and calls the policy routine to handle the pending data transfer. A typical snippet of the interrupt handler for our device driver foo is as follows:

irqreturn_t foo_interrupt(int irq,void *dev_id,struct pt_regs *regs)
{
	struct foo_dev_t *p =(struct foo_dev_t *)dev_id;
	struct request_queue *rq= p->gd->rq;
	[...]
	if(!end_that_request_first(rq,uptodate,nr_sectors)){
		blkdev_dequeue_request(rq);
		end_that_request_last(rq);
	}
	rq->request_fn(rq);
	[...]
	return IRQ_HANDLED;
}

The two functions end_that_request_first() and end_that_request_last() share the task of ending a request. The parameters received by the end_that_request_first() function are a request descriptor, a flag indicating the successful completion of the DMA data transfer, and the number of sectors transferred by the DMA (the end_that_request_chunk() function is similar, except that the function receives the number of bytes transferred. not the number of sectors). Essentially, it scans the bio structure in the request and the segments in each bio, and then updates the field values ​​of the request descriptor as follows:

  1. Modify the bio field so that it points to the first outstanding bio structure in the request.
  2. Modify the bi_idx field of the unfinished bio structure to point to the first unfinished segment.
  3. Modify the bv_offset and bv_len fields of the unfinished segment to specify the data that still needs to be transmitted. This function also calls the bio_endio() function on each bio structure that has completed data transfer. end_that_request_first() returns 0 if all data blocks in the request have been transmitted; otherwise it returns 1. If the return value is 1, the interrupt handler calls the policy routine again to continue processing the request. Otherwise, the interrupt handler deletes the request from the request queue (mainly done by blkdev_dequeue_request()), then calls the end_that_request_last() helper function, and calls the policy routine again to process the next request in the dispatch queue. The function of the end_that_request_last() function is to update some disk usage statistics, delete the request descriptor from the scheduling queue of the I/O scheduler rq->elevator, wake up any sleeping process waiting for the request descriptor to complete, and release the deleted that descriptor.

Open block device file

We conclude this chapter by describing the operations performed by VFS when a block device file is opened. Whenever a file system is mapped to a disk or partition, whenever a swap partition is activated, and whenever a user-mode process issues an open() system call to the block device file, the kernel opens a block device file. In all cases, the kernel essentially performs the same operations: looks for the block device descriptor (and allocates a new descriptor if the block device is not in use), and sets the file operation method for the upcoming data transfer. We have described in the "VFS Processing of Device Files" section of Chapter 13 how the dentry_open() function customizes the method of the file object when a device file is opened. Its f_op field is set to the address of the table def_blk_fops, the contents of which are shown in Table 14-10.
Insert image description here
Insert image description here
We only consider the open method, which is called by the dentry_open() function. blkdev_open() receives inode and filp as its parameters, which store the address of the index node and file object respectively; this function essentially performs the following operations:

  1. Execute bd_acquire(inode) to obtain the address of the block device descriptor bdev. This function receives the address of the index node object and performs the following main steps:
    a. Check whether the inode->i_bdev field of the index node object is not NULL; if so, it indicates that the block device file has been opened and this field stores the corresponding block descriptor. the address of. In this case, increment the reference counter of the inode->i_bdev->bd_inode index node of the bdev special file system associated with the block device and return the address of the descriptor inode->i_bdev.
    b. The block device file is not opened. According to the major device number and minor device number of the block device file (see the "Block Device" section earlier in this chapter), execute bdget(inode->i_rdev) to obtain the address of the block device descriptor. If the descriptor does not exist, bdget() allocates one; however, it should be noted that the descriptor may already exist, for example, other block device files have accessed the block device.
    c. Store the address of the block device descriptor in inode->i_bdev to speed up future opening operations of the same block device file.
    d. Set the inode->i_mapping field to the value of the corresponding field in the bdev index node. This field points to the address space object, which we will introduce in the "address_space object" section of Chapter 15.
    e. Insert the index node into the opened index node list of the block device descriptor established by bdev->bd_inodes.
    f. Return the address of descriptor bdev.
  2. Set the filp->i_mapping field to the value of inode->i_mapping (see step 1d above).
  3. Get the address of the gendisk descriptor related to this block device: disk = get_gendisk(bdev->bd_dev,&part); if the opened block device is a partition, the returned index value is stored in the local variable part; otherwise, part is 0. The get_gendisk() function simply calls kobj_lookup() on the kobject map domain bdev_map, passing the device's major and minor numbers (see the "Registering and Initializing Device Drivers" section earlier in this chapter).
  4. If the value of bdev->bd_openers is not equal to 0, it indicates that the block device has been opened. Check the bdev->bd_contains field:
    a. If the value is equal to bdev, then the block device is a whole disk: call the block device method bdev->bd_disk->fops->open (if defined), and then check the bdev->bd_invalidated field , if necessary, call the rescan_partitions() function (see steps 6a and 6c later).
    b. If not equal to bdev, then the block device is a partition: increase the value of the bdev->bd_contains->bd_part_count counter. Then skip to step 8.
  5. Here the block device is accessed for the first time. Initialize bdev->bd_disk to the address disk of the gendisk descriptor.
  6. If the block device is a whole disk (part equals 0), perform the following substeps:
    a. If the disk->fops->open block device method is defined, execute it: this method is customized by the block device driver. Function that performs any specific last minute initialization.
    b. Get the sector size (number of bytes) from the hardsect_size field of the disk->queue request queue, and use this value to set the bdev->bd_block_size and bdev->bd_inode->i_blkbits fields appropriately. Also set the bdev->bd_inode->i_size field with the disk size calculated from disk->capacity.
    c. If the bdev->bd_invalidated flag is set, call rescan_partitions() to scan the partition table and update the partition descriptor. This flag is set by the check_disk_change block device method and only applies to removable devices.
  7. Otherwise, if the block device is a partition, perform the following substeps:
    a. Call bdget() again—this time passing the disk->first_minor minor device number—to obtain the block descriptor address of the entire disk.
    b. Repeat steps 3 to 6 for the block device descriptor of the entire disk, and initialize the descriptor if necessary.
    c. Set bdev->bd_contains to the address of the entire disk descriptor.
    d. Increase the value of whole->bd_part_count to account for new open operations on the disk partition.
    e. Set bdev->bd_part with the value in disk->part[part-1]; it is the address of the partition descriptor hd_struct. Similarly, execute kobject_get(&bdev->bd_part->kobj) to increase the value of the partition reference counter.
    f. As in step 6b, set the fields in the index node that represent the partition size and sector size.
  8. Increase the value of bdev->bd_openers counter.
  9. If the block device file is opened exclusively (the O_EXCL flag in filp->f_flags is set), then call bd_claim(bdev,filp) to set the holder of the block device (see the "Block Devices" section earlier in this chapter). In case of error - the block device already has an owner - the block device descriptor is released and an error code -EBUSY is returned.
  10. Return 0 (success) to terminate.

Once the blkdev_open() function terminates, the open() system call continues as usual. Each future system call issued on an open file will trigger a default block device file operation. As we will see in Chapter 16, it is efficient to perform each data transfer to a block device by submitting a request to the general block layer.

Guess you like

Origin blog.csdn.net/x13262608581/article/details/132353858