Detailed explanation of Linux I/O principle and zero-copy (zero-copy) technical principle

The 20,000-word long article comprehensively and detailedly analyzes the underlying principles of I/O of the Linux system from knowledge points such as virtual memory, I/O buffer, user state & kernel state, and I/O mode, and analyzes the traditional I/O of Linux. The disadvantages of the O mode, and then introduce the introduction and principle analysis of Linux Zero-copy zero-copy technology, distinguish and compare the zero-copy technology and the traditional I/O mode, and help readers understand the optimization and improvement ideas of the Linux kernel for I/O modules . The most in-depth and detailed analysis article on Linux I/O and zero-copy technology in the whole network

preface

Today's network applications have already changed from CPU-intensive to I/O-intensive. Most network servers are based on  C-S models, that is,  客户端 - 服务端 models. Clients need to communicate with servers in a large amount, which also determines the performance bottleneck of modern network applications. : I/O.

The standard I/O interface of the traditional Linux operating system is based on data copy operations, that is, I/O operations will cause data to be transferred between the buffer in the operating system kernel address space and the buffer defined in the user process address space. The biggest advantage of setting the buffer is that it can reduce disk I/O operations. If the requested data is already stored in the cache memory of the operating system, then there is no need to perform actual physical disk I/O operations; however, the traditional The data copy operation of Linux I/O in the data transmission process is deeply dependent on the CPU, which means that the I/O process requires the CPU to perform data copy operations, which leads to a huge system overhead and limits the effective data transmission of the operating system. ability to operate.

I/O is the key to determining the performance bottleneck of the network server, and the traditional Linux I/O mechanism will cause a large number of data copy operations and loss of performance. Therefore, we urgently need a new technology to solve the problem of massive data copying. This The answer is zero copy (Zero-copy).

computer memory

Since it is necessary to analyze Linux I/O, it is necessary to understand various types of memory of the computer.

The memory is one of the core components of the computer. In a completely ideal state, the memory should have the following three characteristics at the same time:

  1. Fast enough: The memory access speed should be faster than the CPU can execute an instruction, so that the efficiency of the CPU is not limited by the memory
  2. Sufficient capacity: the capacity can store all the data required by the computer
  3. Inexpensive enough: Inexpensive enough for all types of computers

But the reality is often cruel. Our current computer technology cannot meet the above three conditions at the same time, so the memory design of modern computers adopts a hierarchical structure:

From top to bottom, the types of memory in a modern computer are: registers, cache, main memory, and disk. These memories decrease in speed and increase in capacity. The fastest access speed is the register, because the material of the register is the same as that of the CPU, so the speed is as fast as the CPU, and there is no delay when the CPU accesses the register. However, because the price is expensive, the capacity is also extremely small, generally 32 bits The register capacity of the CPU is 32✖️32 Bit, and the 64-bit CPU is 64✖️64 Bit. Whether it is 32-bit or 64-bit, the register capacity is less than 1 KB, and the registers must also be managed by software.

The second layer is the cache, which is the CPU cache L1, L2, and L3 that we usually understand. Generally, L1 is exclusive to each CPU, L3 is shared by all CPUs, and L2 is designed to be exclusive according to different architecture designs. Share or share one of the two modes, for example, Intel's multi-core chips use the shared L2 mode and AMD's multi-core chips use the exclusive L2 mode.

The third layer is the main memory, that is, the main memory, usually called random access memory (Random Access Memory, RAM). It is an internal memory that directly exchanges data with the CPU. It can be read and written at any time (except when refreshing), and the speed is very fast, and it is usually used as a temporary data storage medium for the operating system or other running programs.

Finally, there is the disk. Compared with the main memory, the cost of each binary bit of the disk is two orders of magnitude lower, so the capacity will be much larger than that, and it is often GB, TB. The problem is that the access speed is slower than the main memory. About three orders of magnitude. The slow speed of the mechanical hard disk is mainly due to the fact that the mechanical arm needs to constantly move between the metal platters, waiting for the disk sector to rotate under the magnetic head before performing read and write operations, so the efficiency is very low.

The main memory is the most important thing for the operating system to perform I/O operations. Most of the work is done in the memory buffer of the user process and the kernel. Therefore, we need to learn some related principles of the main memory in advance.

physical memory

The physical memory we have always mentioned is the third computer memory corresponding to the above, RAM main memory, which exists in the form of a memory stick in the computer, embedded in the memory slot of the motherboard, and is used to load various Programs and data are run and used directly by the CPU.

Virtual Memory

In the computer field, there is a sacred philosophic saying like the Ten Commandments of Moses: " Any problem in the field of computer science can be solved by adding an indirect middle layer ", from memory management, network model, concurrent scheduling and even hardware architecture. I can see this philosophy shining brightly, and virtual memory is one of the perfect practices of this philosophy.

Virtual memory is a very important storage abstraction in modern computers. It is mainly used to solve the increasing memory usage requirements of applications: the capacity of modern physical memory has grown very fast, but it still cannot keep up with the application's demand for main memory. The growth rate is still not enough for the application, so a method is needed to solve the capacity difference between the two.

The computer's management of multi-program memory access has gone through  静态重定位 -->  动态重定位 -->  交换(swapping)技术 -->  虚拟内存, the most primitive multi-program memory access is to directly access the absolute memory address, this method is almost completely unavailable, because if every program If both directly access the physical memory address, for example, when two programs execute the following instructions concurrently:

mov cx, 2
mov bx, 1000H
mov ds, bx
mov [0], cx

...

mov ax, [0]
add ax, ax

This section of assembly indicates that the value 2 is stored at address 1000:0, and then the value of the address is taken out and multiplied by 2 in the following logic, and the value finally stored in the ax register is 4. If the second program stores cx The value in the register is 3, so when executing concurrently, the value that the first program finally gets from the ax register may be 6, which is completely wrong, and the result of the program is at most wrong if the dirty data is obtained. If other programs Writing some dangerous instructions to a specific address and being fetched and executed by another program may also cause the entire system to crash. Therefore, in order to ensure that the processes do not interfere with each other, each user process needs to know which memory addresses are currently being used by other processes in real time, which is undoubtedly a nightmare for program writers.

Therefore, it is completely infeasible to operate absolute memory addresses, so we can only use relative memory addresses. We know that each process will have its own process address. Starting from 0, memory can be accessed through relative addresses, but this is also If there is a problem, it is still the same problem as before. For example, there are two programs A and B with a size of 16KB. Now they are both loaded into the memory, and the memory address segments are 0 ~ 16384, 16384 ~ 32768 respectively. The first instruction of A is  jmp 1024, and at address 1024 is an  mov instruction, and the next instruction is  to perform an addition operation addbased on the previous  mov instruction. At the same time, the first instruction of B is  jmp 1028, originally at the relative address 1028 of B It should also be one  mov to operate the value on its own memory address, but because the two programs share the segment register, although they use their respective relative addresses, they still operate on the absolute memory address, so B will jump to execute  add Instruction, at this time, it will crash due to illegal memory operation.

有一种静态重定位的技术可以解决这个问题,它的工作原理非常简单粗暴:当 B 程序被加载到地址 16384 处之后,把 B 的所有相对内存地址都加上 16384,这样的话当 B 执行 jmp 1028 之时,其实执行的是 jmp 1028+16384,就可以跳转到正确的内存地址处去执行正确的指令了,但是这种技术并不通用,而且还会对程序装载进内存的性能有影响。

再往后,就发展出来了存储器抽象:地址空间,就好像进程是 CPU 的抽象,地址空间则是存储器的抽象,每个进程都会分配独享的地址空间,但是独享的地址空间又带来了新的问题:如何实现不同进程的相同相对地址指向不同的物理地址?最开始是使用动态重定位技术来实现,这是用一种相对简单的地址空间到物理内存的映射方法。基本原理就是为每一个 CPU 配备两个特殊的硬件寄存器:基址寄存器和界限寄存器,用来动态保存每一个程序的起始物理内存地址和长度,比如前文中的 A,B 两个程序,当 A 运行时基址寄存器和界限寄存器就会分别存入 0 和 16384,而当 B 运行时则两个寄存器又会分别存入 16384 和 32768。然后每次访问指定的内存地址时,CPU 会在把地址发往内存总线之前自动把基址寄存器里的值加到该内存地址上,得到一个真正的物理内存地址,同时还会根据界限寄存器里的值检查该地址是否溢出,若是,则产生错误中止程序,动态重定位技术解决了静态重定位技术造成的程序装载速度慢的问题,但是也有新问题:每次访问内存都需要进行加法和比较运算,比较运算本身可以很快,但是加法运算由于进位传递时间的问题,除非使用特殊的电路,否则会比较慢。

然后就是 交换(swapping)技术,这种技术简单来说就是动态地把程序在内存和磁盘之间进行交换保存,要运行一个进程的时候就把程序的代码段和数据段调入内存,然后再把程序封存,存入磁盘,如此反复。为什么要这么麻烦?因为前面那两种重定位技术的前提条件是计算机内存足够大,能够把所有要运行的进程地址空间都加载进主存,才能够并发运行这些进程,但是现实往往不是如此,内存的大小总是有限的,所有就需要另一类方法来处理内存超载的情况,第一种便是简单的交换技术:

先把进程 A 换入内存,然后启动进程 B 和 C,也换入内存,接着 A 被从内存交换到磁盘,然后又有新的进程 D 调入内存,用了 A 退出之后空出来的内存空间,最后 A 又被重新换入内存,由于内存布局已经发生了变化,所以 A 在换入内存之时会通过软件或者在运行期间通过硬件(基址寄存器和界限寄存器)对其内存地址进行重定位,多数情况下都是通过硬件。

另一种处理内存超载的技术就是虚拟内存技术了,它比交换(swapping)技术更复杂而又更高效,是目前最新应用最广泛的存储器抽象技术:

虚拟内存的核心原理是:为每个程序设置一段"连续"的虚拟地址空间,把这个地址空间分割成多个具有连续地址范围的页 (page),并把这些页和物理内存做映射,在程序运行期间动态映射到物理内存。当程序引用到一段在物理内存的地址空间时,由硬件立刻执行必要的映射;而当程序引用到一段不在物理内存中的地址空间时,由操作系统负责将缺失的部分装入物理内存并重新执行失败的指令:

The virtual address space is divided into several units called pages according to a fixed size, and the corresponding in physical memory is a page frame. These two are generally the same size, as shown in the figure above is 4KB, but in fact the computer system is generally 512 bytes to 1 GB, this is the paging technology of virtual memory. Because it is a virtual memory space, the size allocated to each process is 4GB (32-bit architecture), but in fact it is impossible to allocate 4GB of physical memory to all running processes, so the virtual memory technology also needs to be used. The 交换(swapping)technology that only allocates and maps the currently used memory during the running of the process, writes the temporarily unused data back to the disk as a copy, and then reads it into the memory when it is needed, and dynamically exchanges data between the disk and the memory.

In fact, from a certain point of view, virtual memory technology is like a new technology that combines the base address register and the limit register. It enables the address space of the entire process to be mapped to physical memory through smaller units without the need to relocate the code and data addresses of the program.

The memory addresses generated during the running of the process are all virtual addresses. If the computer does not introduce the memory abstraction technology of virtual memory, the CPU will directly send these addresses to the memory address bus to directly access the physical address with the same value as the virtual address. ; If virtual memory technology is used, the CPU sends these virtual addresses to the Memory Management Unit (MMU) through the address bus, and the MMU maps the virtual addresses to physical addresses and then accesses the physical memory through the memory bus:

The virtual address (such as 16-bit address 8196=0010 000000000100) is divided into two parts: the virtual page number (high part) and offset (low part), and the virtual address is converted into a physical address through the page table (page table). , the page table is composed of page table items, which store information such as page frame number, modification bit, access bit, protection bit, and "in/not in" bit. From a mathematical point of view, the page table is a function. is the virtual page number, and the output is the physical page frame number. After getting the physical page frame number, copy it to the upper three bits of the register, and finally copy the 12-bit offset directly to the last 12 bits of the register to form a 15-bit physical address, that is The physical memory address stored in this register can be sent to the memory bus:

When the MMU performs address translation, if the "in/out" bit of the page table entry is 0, it means that the page is not mapped to the real physical page frame, and a page fault interrupt will be triggered, the CPU will fall into the operating system kernel, and then The operating system will select a page through the page replacement algorithm and swap it out (swap) to make room for the new page to be transferred in. If the modification bit in the page table entry of the page to be swapped out has been set, That is, it has been updated, then this is a dirty page (dirty page), which needs to be written back to the disk to update the copy of the changed page on the disk. If the page is "clean", that is, it has not been modified, directly call The new page that was imported can overwrite the old page that was swapped out.

Finally, a concept that needs to be understood is the Translation Lookaside Buffer (TLB), also known as the fast table, which is used to speed up virtual address mapping. Because of the paging mechanism of virtual memory, the page table is generally saved in the memory. A fixed storage area causes the process to access the memory through the MMU one more memory access than directly accessing the memory, and the performance drops by at least half. Therefore, it is necessary to introduce an acceleration mechanism, that is, the TLB fast table. The TLB can be simply understood as a cache of the page table. Save the most frequently accessed page table entry, because it is generally implemented by hardware, so the speed is extremely fast. When the MMU receives the virtual address, it will first query the corresponding page table number through the hardware TLB. If it hits and the page table entry If the access operation is legal, the corresponding physical page frame number is directly retrieved from the TLB and returned. If it is not hit, it will penetrate into the memory page table for query, and will replace the existing page table entry with the latest page table entry queried from the memory page table. One of them in the TLB for the next cache hit.

So far, we have introduced a number of computer memory abstraction technologies including virtual memory, and other contents of virtual memory, such as multi-level page tables for large memory, inverted page tables, and page replacement algorithms for handling page fault interrupts, etc. In the future, if I have the opportunity to write a separate article introduction, or readers can also go to the relevant information first, so I won’t go into depth here.

1. Physical memory and virtual memory

Since the processes of the operating system share CPU and memory resources, a complete memory management mechanism is required to prevent memory leaks between processes. In order to manage memory more effectively and reduce errors, modern operating systems provide an abstract concept of main memory, namely virtual memory (Virtual Memory). Virtual memory provides each process with a consistent, private address space, which gives each process the illusion that it owns its own main memory (each process has a continuous and complete memory space).

1.1. Physical memory

Physical memory (Physical memory) is relative to virtual memory (Virtual Memory). Physical memory refers to the memory space obtained through physical memory sticks, while virtual memory refers to dividing an area of ​​the hard disk as memory. The main function of memory is to provide temporary storage for the operating system and various programs when the computer is running. In the application, it is naturally, as the name suggests, physically, the size of the capacity of the real memory stick inserted into the memory slot of the motherboard.

1.2. Virtual memory

Virtual memory is a technology for computer system memory management. It makes the application think it has contiguous memory available (a contiguous complete address space). In fact, virtual memory is usually divided into multiple physical memory fragments, and some are temporarily stored on external disk storage, and data exchange is performed when needed, and loaded into physical memory. Currently, most operating systems use virtual memory, such as virtual memory in Windows systems, swap space in Linux systems, and so on.

The virtual memory address is closely related to the user process. Generally speaking, the same virtual address in different processes points to different physical addresses, so it is meaningless to talk about virtual memory without the process. The virtual address size that each process can use is related to the number of CPU bits. On a 32-bit system, the size of the virtual address space is 2 ^ 32 = 4G, on a 64-bit system, the size of the virtual address space is 2 ^ 64 = 2 ^ 34G, and the actual physical memory may be much smaller than the size of the virtual memory . Each user process maintains a separate page table (Page Table), and virtual memory and physical memory are mapped to the address space through this page table. The following is a schematic diagram of the address mapping between the respective virtual memory spaces of the two processes A and B and the corresponding physical memory:

When a process executes a program, it needs to read the instruction of the process from the memory first, and then execute it. The virtual address is used when obtaining the instruction. This virtual address is determined when the program is linked (the address range of the dynamic library will be adjusted when the kernel loads and initializes the process). In order to obtain the actual data, the CPU needs to convert the virtual address into a physical address. When the CPU converts the address, it needs to use the page table (Page Table) of the process, and the data in the page table (Page Table) is maintained by the operating system.

Among them, the page table (Page Table) can be simply understood as a linked list of a single memory mapping (Memory Mapping) (of course, the actual structure is very complicated), and each memory mapping (Memory Mapping) in it maps a block of virtual addresses to a specific address Space (physical memory or disk storage space). Each process has its own page table (Page Table), which has nothing to do with the page tables (Page Table) of other processes.

Through the above introduction, we can simply summarize the process of user processes applying for and accessing physical memory (or disk storage space) as follows:

  1. The user process sends a memory application request to the operating system
  2. The system will check whether the virtual address space of the process is used up, and if there is any remaining, assign a virtual address to the process
  3. The memory mapping (Memory Mapping) created by the system for this virtual address and put it into the page table (Page Table) of the process
  4. The system returns the virtual address to the user process, and the user process starts to access the virtual address
  5. The CPU finds the corresponding memory mapping (Memory Mapping) in the page table (Page Table) of this process according to the virtual address, but this memory mapping (Memory Mapping) is not associated with the physical memory, so a page fault interrupt occurs
  6. After the operating system receives a page fault interrupt, it allocates real physical memory and associates it with the corresponding memory mapping (Memory Mapping) of the page table. After the interrupt processing is completed, the CPU can access the memory
  7. Of course, page fault interrupts do not happen every time. They are only used when the system feels it is necessary to delay memory allocation. That is, in step 3 above, the system will allocate real physical memory and map it with memory (Memory Mapping) to associate.

The introduction of virtual memory between user processes and physical memory (disk storage) has the following main advantages:

  • Address space: Provide a larger address space, and the address space is continuous, making program writing and linking easier
  • Process isolation: There is no relationship between the virtual addresses of different processes, so the operation of one process will not affect other processes
  • Data protection: Each block of virtual memory has corresponding read and write attributes, so that the code segment of the program can be protected from being modified, and the data block cannot be executed, etc., which increases the security of the system
  • Memory mapping: With virtual memory, files (executable files or dynamic libraries) on the disk can be directly mapped to the virtual address space. In this way, physical memory allocation can be delayed. Only when the corresponding file needs to be read, it is actually loaded from the disk into the memory. When the memory is tight, this part of the memory can be cleared to improve Physical memory utilization efficiency, and all of this is transparent to the application
  • Shared memory: For example, a dynamic library only needs to store one copy in memory, and then map it to the virtual address space of different processes, so that the process feels that it owns the file exclusively. Memory sharing between processes can also be shared by mapping the same piece of physical memory to different virtual address spaces of the process
  • Physical memory management: The physical address space is all managed by the operating system, and the process cannot be directly allocated and recycled, so that the system can make better use of memory and balance the memory requirements between processes

2. Kernel space and user space

The core of the operating system is the kernel, which is independent of ordinary applications and can access the protected memory space as well as the underlying hardware devices. In order to prevent user processes from directly operating the kernel and ensure kernel security, the operating system divides the virtual memory into two parts, one part is the kernel space (Kernel-space), and the other part is the user space (User-space). In the Linux system, the kernel module runs in the kernel space, and the corresponding process is in the kernel state; while the user program runs in the user space, and the corresponding process is in the user state.

The virtual memory ratio of the kernel process and the user process is 1:3, and the addressing space (virtual storage space) of the Linux x86_32 system is 4G (2 to the 32nd power), and the highest 1G byte (from the virtual address 0xC0000000 to 0xFFFFFFFF) are used by the kernel process, called kernel space; and the lower 3G bytes (from virtual address 0x00000000 to 0xBFFFFFFF) are used by each user process, called user space. The following figure shows the memory layout of a process's user space and kernel space:

2.1. Kernel space

Kernel space always resides in memory, which is reserved for the operating system's kernel. The application program is not allowed to directly read and write in this area or directly call the functions defined by the kernel code. The area on the left side of the figure above is the virtual memory corresponding to the kernel process, which can be divided into process-private and process-shared areas according to access rights.

  • Process-private virtual memory: Each process has a separate kernel stack, page table, task structure, and mem_map structure.
  • Process-shared virtual memory: belongs to the memory area shared by all processes, including physical memory, kernel data, and kernel code areas.

2.2. User space

Every ordinary user process has a separate user space. A process in user mode cannot access data in the kernel space, nor can it directly call kernel functions. Therefore, when making a system call, the process must be switched to the kernel. attitude will do. User space includes the following memory areas:

  • Runtime stack: automatically released by the compiler, storing function parameter values, local variables and method return values, etc. Whenever a function is called, the return type of the function and some call information are stored on the top of the stack, and the call information will be popped and released after the call ends. The stack area grows from high address bits to low address bits. It is a continuous internal area. The maximum capacity is predefined by the system. When the requested stack space exceeds this limit, it will prompt overflow, and the user can get it from the stack. Small space.
  • Runtime heap: It is used to store the memory segment dynamically allocated during the running of the process, and is located in the address bit between the BSS and the stack. The card issuer applies for allocation (malloc) and release (free). The heap grows from low address bits to high address bits, and adopts a chain storage structure. Frequent malloc/free causes the discontinuity of the memory space, resulting in a large number of fragments. When applying for heap space, the library function searches for a sufficiently large space available according to a certain algorithm. Therefore, the efficiency of the heap is much lower than that of the stack.
  • Code segment: stores machine instructions that can be executed by the CPU, and this part of the memory can only be read but not written. Usually the code area is shared, that is, other executable programs can call it. If several processes in the machine run the same program, they can use the same code segment.
  • Uninitialized data segment: store uninitialized global variables, BSS data is initialized to 0 or NULL before the program starts executing.
  • Initialized data segment: stores initialized global variables, including static global variables, static local variables, and constants.
  • Memory mapping area: For example, memory in virtual space such as dynamic library and shared memory is mapped to memory in physical space, generally the virtual memory space allocated by the mmap function.

3. The internal hierarchy of Linux

Kernel mode can execute arbitrary commands and call all resources of the system, while user mode can only perform simple operations and cannot directly call system resources. The user mode must pass the system interface (System Call) to issue instructions to the kernel. For example, when a user process starts a bash, it will initiate a system call to the pid service of the kernel through getpid() to obtain the ID of the current user process; when the user process checks the host configuration through the cat command, it will call the kernel file sub The system initiates a system call.

  • Kernel space has access to all CPU instructions and all memory spaces, I/O spaces, and hardware devices.
  • User space can only access limited resources. If special permissions are required, corresponding resources can be obtained through system calls.
  • User space allows page breaks, while kernel space does not.
  • Kernel space and user space are for linear address spaces.
  • In the x86 CPU, the user space is the address range of 0 - 3G, and the kernel space is the address range of 3G - 4G. x86_64 CPU user space address range is 0x0000000000000000 – 0x00007fffffffffff, kernel address space is 0xffff880000000000 – maximum address.
  • All kernel processes (threads) share an address space, while user processes have their own address spaces.

With the division of user space and kernel space, the internal hierarchical structure of Linux can be divided into three parts, from the bottom to the top are hardware, kernel space and user space, as shown in the following figure:

user mode and kernel mode

Generally speaking, when we write programs to operate Linux I/O, nine out of ten are transferring data between user space and kernel space, so it is necessary to understand the concepts of Linux user mode and kernel mode first.

The first is user mode and kernel mode:

从宏观上来看,Linux 操作系统的体系架构分为用户态和内核态(或者用户空间和内核)。内核从本质上看是一种软件 —— 控制计算机的硬件资源,并提供上层应用程序 (进程) 运行的环境。用户态即上层应用程序 (进程) 的运行空间,应用程序 (进程) 的执行必须依托于内核提供的资源,这其中包括但不限于 CPU 资源、存储资源、I/O 资源等等。

现代操作系统都是采用虚拟存储器,那么对 32 位操作系统而言,它的寻址空间(虚拟存储空间)为 2^32 B = 4G。操作系统的核心是内核,独立于普通的应用程序,可以访问受保护的内存空间,也有访问底层硬件设备的所有权限。为了保证用户进程不能直接操作内核(kernel),保证内核的安全,操心系统将虚拟空间划分为两部分,一部分为内核空间,一部分为用户空间。针对 Linux 操作系统而言,将最高的 1G 字节(从虚拟地址 0xC0000000 到 0xFFFFFFFF),供内核使用,称为内核空间,而将较低的 3G 字节(从虚拟地址 0x00000000 到 0xBFFFFFFF),供各个进程使用,称为用户空间。

Because the resources of the operating system are limited, if there are too many operations to access resources, too many system resources will inevitably be consumed, and if these operations are not distinguished, resource access conflicts are likely to occur. Therefore, in order to reduce access and use conflicts of limited resources, one of the design philosophies of Unix/Linux is to assign different execution levels to different operations, which is the concept of so-called privileges. Simply put, it means doing as much as you have the ability to do. Some particularly critical operations related to the system must be completed by the program with the highest privileges. Intel's x86 architecture CPU provides four privilege levels from 0 to 3. The smaller the number, the higher the privilege. The Linux operating system mainly uses two privilege levels of 0 and 3, which correspond to the kernel mode and user mode respectively. Processes running in user mode can perform operations and access resources are greatly restricted, while processes running in kernel mode can perform any operations and have no restrictions on the use of resources. Many programs run in user mode at the beginning, but in the process of execution, some operations need to be performed under kernel authority, which involves a process of switching from user mode to kernel mode. For example, the memory allocation function malloc() in the C function library specifically uses the sbrk() system call to allocate memory. When malloc calls sbrk(), it involves a switch from user mode to kernel mode. Similar functions also There's printf(), which invokes the write() system call to output strings, and so on.

When a user process runs in the system, it spends most of its time in the user mode space. When it needs the help of the operating system to complete some operations that the user mode does not have the privilege and ability to complete, it needs to switch to the kernel mode. So how does the user process switch to the kernel state to use those kernel resources? The answers are: 1) system call (trap), 2) exception (exception) and 3) interrupt (interrupt).

  • System call : An operation initiated by a user process. The user state process initiates a system call to actively request to switch to the kernel state. After falling into the kernel, the operating system will use the operating system resources, and then return to the process after completion.
  • 异常:被动的操作,且用户进程无法预测其发生的时机。当用户进程在运行期间发生了异常(比如某条指令出了问题),这时会触发由当前运行进程切换到处理此异常的内核相关进程中,也即是切换到了内核态。异常包括程序运算引起的各种错误如除 0、缓冲区溢出、缺页等。
  • 中断:当外围设备完成用户请求的操作后,会向 CPU 发出相应的中断信号,这时 CPU 会暂停执行下一条即将要执行的指令而转到与中断信号对应的处理程序去执行,如果前面执行的指令是用户态下的程序,那么转换的过程自然就会是从用户态到内核态的切换。中断包括 I/O 中断、外部信号中断、各种定时器引起的时钟中断等。中断和异常类似,都是通过中断向量表来找到相应的处理程序进行处理。区别在于,中断来自处理器外部,不是由任何一条专门的指令造成,而异常是执行当前指令的结果。

通过上面的分析,我们可以得出 Linux 的内部层级可分为三大部分:

  • 用户空间;
  • 内核空间;
  • 硬件。

Linux I/O

I/O 缓冲区

在 Linux 中,当程序调用各类文件操作函数后,用户数据(User Data)到达磁盘(Disk)的流程如上图所示。

图中描述了 Linux 中文件操作函数的层级关系和内存缓存层的存在位置,中间的黑色实线是用户态和内核态的分界线。

read(2)/write(2) 是 Linux 系统中最基本的 I/O 读写系统调用,我们开发操作 I/O 的程序时必定会接触到它们,而在这两个系统调用和真实的磁盘读写之间存在一层称为 Kernel buffer cache 的缓冲区缓存。在 Linux 中 I/O 缓存其实可以细分为两个:Page Cache 和 Buffer Cache,这两个其实是一体两面,共同组成了 Linux 的内核缓冲区(Kernel Buffer Cache):

  • Read disk : The kernel will first check  Page Cache whether the data has been cached in memory, if it is, read it directly from the memory buffer and return it, if not, it will penetrate to the disk to read, and then cache it in memory for future  Page Cache use cache hits;
  • Write to disk : The kernel directly writes the data  Page Cache, marks the corresponding page as dirty, adds it to the dirty list, and then returns directly. The kernel will periodically flush the page cache of the dirty list to the disk to ensure that the page cache and the disk are finally consistency.

Page Cache Old pages are regularly eliminated and new pages are loaded through page replacement algorithms such as LRU. It can be seen that the so-called I/O buffer cache is a layer of buffer between the kernel and peripherals such as disks and network cards to improve read and write performance.

Before Linux did not support virtual memory technology, there was no concept of pages, so it  Buffer Cache was based on the smallest unit of the operating system to read and write disks - block (block), all disk block operations are  Buffer Cache accelerated through, Linux introduces After the virtual memory mechanism is used to manage memory, the page becomes the smallest unit of virtual memory management, so it is also introduced to  Page Cache cache the content of Linux files, mainly used as a cache of file data on the file system to improve read and write performance. The operation of the file  read()/write() also includes  mmap() the block device after mapping, that is to say, in fact, Page Cache is responsible for most of the block device file caching work. When  Buffer Cache the system reads and writes the block device, the system that caches the data of the block is actually responsible for all I/O access to the disk:

Because  Buffer Cache it is a cache of device blocks with finer granularity, Page Cache but  a page unit cache based on virtual memory, it will still be based on Buffer Cachethat, that is to say, if it is to cache file content data, two copies of the same data will be cached in memory, which is It will cause two copies of the same file to be saved, which is redundant and inefficient. Another problem is that after the call  write , the valid data is in  Buffer Cache , not  Page Cache in . This leads to  mmap possible inconsistencies in the accessed file data. In order to avoid this problem, all disk-based file systems  writeneed to call  update_vm_cache() functions, and this operation will   update them  write after  the call  . Due to these design drawbacks, after the Linux 2.4 version, the kernel unifies the two,  no longer exists in an independent form, but exists in a fusion manner   :Buffer CachePage CacheBuffer CachePage Cache

After the integration, the operation can be unified  Page Cache :  Buffer Cachethe file I/O cache is handed over  Page Cache, and when the underlying RAW device refreshes the data,  Buffer Cache the actual processing is performed in block units.

I/O mode

In Linux or other Unix-like operating systems, there are generally three I/O modes:

  1. Program Control I/O
  2. Interrupt-driven I/O
  3. DMA I/O

Let me explain these three I/O modes in detail respectively.

Program Control I/O

This is the simplest I/O mode, also called busy waiting or polling: the user initiates a system call and falls into the kernel mode, and the kernel translates the system call into a procedure call corresponding to the device driver, and then the device driver It will start I/O to check the device in a continuous loop to see if it is ready. It is generally indicated by a return code. After the I/O is over, the device driver will send the data to the specified place and return, and switch back to the user mode.

For example, to initiate a system call  read():

Interrupt-driven I/O

The second I/O mode is implemented using interrupts:

The process is as follows:

  1. The user process initiates a  read() system call to read the disk file, falls into the kernel mode, and writes a notification signal to the device register through the device driver program by its CPU to inform the device controller (here we are the disk controller) to read data;
  2. The disk controller starts the process of disk reading, copying data from the disk to the disk controller buffer;
  3. After the copy is completed, the disk controller will send an interrupt signal to the interrupt controller through the bus. If the interrupt controller still has an interrupt being processed at this time or a higher priority interrupt that arrives at the same time as the interrupt signal, then this The interrupt signal will be ignored, and the disk controller will continue to send the interrupt signal until the interrupt controller accepts it;
  4. After the interrupt controller receives the interrupt signal from the disk controller, it will store the number of a disk device through the address bus, indicating that the device that needs to be concerned about this interruption is the disk;
  5. The interrupt controller sets a disk interrupt signal to the CPU;
  6. After the CPU receives the interrupt signal, it stops the current work, pushes the current PC/PSW and other registers into the stack to save the scene, then takes out the device number from the address bus, finds the entry address of the interrupt service contained in the interrupt vector through the number, and pushes it into the PC register, start running the disk interrupt service, and copy the data from the buffer of the disk controller to the kernel buffer in the main memory;
  7. Finally, the CPU copies the data from the kernel buffer to the user buffer, completes the read operation, read() returns, and switches back to user mode.

DMA I/O

The performance of a concurrent system depends fundamentally on how to efficiently schedule and use CPU resources. Looking back at the previous process of interrupt-driven I/O mode, we can find that the data copy work in steps 6 and 7 is performed by The CPU is done by itself, that is, the CPU is completely occupied and cannot handle other work during the two data copy stages, so there is obviously room for optimization here; the data copy in step 7 is from the kernel buffer to the user buffer area, are all in the main memory, so this step can only be completed by the CPU itself, but the data copy in step 6 is from the buffer of the disk controller to the main memory, which is the data transmission between two devices. One step does not necessarily need to be completed by the CPU, it can be completed with the help of DMA to reduce the burden on the CPU.

The full name of DMA is Direct Memory Access, that is, direct memory access, which is used to provide high-speed data transmission between peripherals and memory or between memory and memory. The whole process does not require CPU participation, and the data is directly moved and copied quickly through the DMA controller, saving CPU resources for other tasks.

Currently, most computers are equipped with DMA controllers, and DMA technology supports most peripherals and memories. With the help of the DMA mechanism, the computer's I/O process can be more efficient:

The DMA controller contains several registers that can be read and written by the CPU: a main memory address register MAR (stores the main memory address to exchange data), an peripheral address register ADR (stores the device code of the I/O device, or Addressing information of the device information storage area), a byte count register WC (to count the total number of words of the transmitted data), and one or more control registers.

  1. The user process initiates a  read() system call to read the disk file, falls into the kernel mode, and its CPU programs it by setting the register of the DMA controller: write the address of the kernel buffer and the disk file into the MAR and ADR registers respectively, and then Write the number of bytes expected to be read into the WC register and start the DMA controller;
  2. According to the information in the ADR register, the DMA controller knows that the peripheral device that needs to be read by this I/O is a certain address of the disk, and then sends a command to the disk controller to notify it to read data from the disk to its internal buffer inside;
  3. The disk controller starts the disk reading process, copies the data from the disk to the disk controller buffer, and checks the checksum of the data in the buffer. If the data is valid, DMA can start;
  4. The DMA controller sends a read request signal to the disk controller through the bus to initiate a DMA transfer. This signal is the same as the read request sent by the CPU to the disk controller in the previous interrupt-driven I/O section. It does not know or does not It doesn't matter whether the read request comes from the CPU or the DMA controller;
  5. Then the DMA controller will guide the disk controller to transfer the data to the address in the MAR register, which is the kernel buffer;
  6. After the data transmission is completed, return an ack to the DMA controller, and the value in the WC register will subtract the corresponding data length. If WC is not 0, repeat steps 4 to 6 until the byte in WC number is equal to 0;
  7. The DMA controller that receives the ack signal will send an interrupt signal to the interrupt controller through the bus. If the interrupt controller still has an interrupt being processed at this time or a higher priority interrupt that arrives at the same time as the interrupt signal, Then this interrupt signal will be ignored, and the DMA controller will continue to send the interrupt signal until the interrupt controller accepts it;
  8. After the interrupt controller receives the interrupt signal from the disk controller, it will store the number of a main memory device through the address bus, indicating that the device that needs to be concerned about this interrupt is the main memory;
  9. The interrupt controller sets a DMA interrupt signal to the CPU;
  10. After the CPU receives the interrupt signal, it stops the current work, pushes the current PC/PSW and other registers into the stack to save the scene, then takes out the device number from the address bus, finds the entry address of the interrupt service contained in the interrupt vector through the number, and pushes it into the PC Register, start running DMA interrupt service, copy data from kernel buffer to user buffer, complete read operation, read() return, switch back to user mode.

Traditional I/O read and write mode

The traditional I/O reading and writing in Linux is  read()/write() done through system calls. read() When data is read from the storage (disk, network card, etc.) to the user buffer, write() the data is written from the user buffer to the storage:

#include <unistd.h>

ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);

The underlying transmission process of reading a disk file once and then writing it out to the network card is as follows:

It can be clearly seen that a total of 4 context switches between the user state and the kernel state are triggered here, namely  read()/write() the switch when calling and returning, 2 DMA copies, 2 CPU copies, and a total of 4 copy operations.

By introducing DMA, we have reduced the number of CPU copies in the process of Linux I/O from 4 times to 2 times, but CPU copies are still expensive operations and have a great impact on system performance, especially those frequently In I/O scenarios, a lot of performance will be lost due to CPU copying. We need to further optimize to reduce or even completely avoid CPU copying.

Zero-copy

What is zero-copy?

Wikipedia explains it as follows:

" Zero-copy" describes computer operations in which the  CPU does not perform the task of copying data from one  memory area to another. This is frequently used to save CPU cycles and memory bandwidth when transmitting a file over a network.

The zero-copy technology means that when the computer performs an operation, the CPU does not need to copy data from a certain memory to another specific area first. This technique is commonly used to save CPU cycles and memory bandwidth when transferring files over a network .

What can zero-copy do?

  • Reduce or even completely avoid the data copy operation between the operating system kernel and the user application address space, thereby reducing the system overhead caused by the context switching between the user mode and the kernel mode.
  • Reduce or even completely avoid data copy operations between operating system kernel buffers.
  • Help user processes bypass the operating system kernel space to directly access the hardware storage interface to operate data.
  • 利用 DMA 而非 CPU 来完成硬件接口和内核缓冲区之间的数据拷贝,从而解放 CPU,使之能去执行其他的任务,提升系统性能。

Zero-copy 的实现方式有哪些?

从 zero-copy 这个概念被提出以来,相关的实现技术便犹如雨后春笋,层出不穷。但是截至目前为止,并没有任何一种 zero-copy 技术能满足所有的场景需求,还是计算机领域那句无比经典的名言:"There is no silver bullet"!

而在 Linux 平台上,同样也有很多的 zero-copy 技术,新旧各不同,可能存在于不同的内核版本里,很多技术可能有了很大的改进或者被更新的实现方式所替代,这些不同的实现技术按照其核心思想可以归纳成大致的以下三类:

  • 减少甚至避免用户空间和内核空间之间的数据拷贝:在一些场景下,用户进程在数据传输过程中并不需要对数据进行访问和处理,那么数据在 Linux 的 Page Cache 和用户进程的缓冲区之间的传输就完全可以避免,让数据拷贝完全在内核里进行,甚至可以通过更巧妙的方式避免在内核里的数据拷贝。这一类实现一般是通过增加新的系统调用来完成的,比如 Linux 中的 mmap(),sendfile() 以及 splice() 等。
  • 绕过内核的直接 I/O:允许在用户态进程绕过内核直接和硬件进行数据传输,内核在传输过程中只负责一些管理和辅助的工作。这种方式其实和第一种有点类似,也是试图避免用户空间和内核空间之间的数据传输,只是第一种方式是把数据传输过程放在内核态完成,而这种方式则是直接绕过内核和硬件通信,效果类似但原理完全不同。
  • 内核缓冲区和用户缓冲区之间的传输优化:这种方式侧重于在用户进程的缓冲区和操作系统的页缓存之间的 CPU 拷贝的优化。这种方法延续了以往那种传统的通信方式,但更灵活。

减少甚至避免用户空间和内核空间之间的数据拷贝

mmap()

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flags, int fd, off_t offset);
int munmap(void *addr, size_t length);

A simple implementation is to replace the  mmap() original  one with another Linux system call during a read and write process read(), mmap() that is, memory map (memory map): map a section of memory buffer (user buffer) in the user process space to the file On the kernel buffer (kernel buffer) where it is located.

Using  mmap() replacement  read()write() the entire process of matching calls is as follows:

  1. Called by the user process  mmap(), from the user state to the kernel state, and map the kernel buffer to the user buffer;
  2. The DMA controller copies data from the hard disk to the kernel buffer;
  3. mmap() Return, the context switches from kernel mode to user mode;
  4. Called by the user process  write(), try to write the file data to the socket buffer in the kernel, and fall into the kernel mode again;
  5. The CPU copies the data in the kernel buffer to the socket buffer;
  6. The DMA controller copies the data from the socket buffer to the network card to complete the data transfer;
  7. write() Return, the context switches from kernel mode to user mode.

In this way, there are two advantages: one is to save memory space, because this section of memory on the user process is virtual and does not really occupy physical memory, but is only mapped to the kernel buffer where the file is located, so it can save half The second is to save a CPU copy, compared with the traditional Linux I/O read and write, the data does not need to be forwarded by the user process, but is copied directly in the kernel. Therefore, the number of copies after use  mmap() is 2 DMA copies, 1 CPU copy, and a total of 3 copy operations, which saves one CPU copy and half of the memory than the traditional I/O method, but because it is also a system call,  mmap() so The switch between user mode and kernel mode is still 4 times.

mmap() Because it saves CPU copy times and memory, it is more suitable for large file transfer scenarios. While  mmap() fully POSIX compliant, it's not perfect because it doesn't always achieve ideal data transfer performance. Firstly, because a CPU copy is still required during the data transmission process, and secondly, the memory mapping technology is a virtual storage operation with high overhead: this operation needs to modify the page table and replace the current TLB with the file data in the kernel buffer The cache inside to maintain the consistency of the virtual memory map. However, because memory mapping is usually aimed at a relatively large data area, for the same size of data, the overhead caused by memory mapping is much lower than the overhead caused by CPU copy. In addition, the user  mmap() may also encounter some special situations that deserve attention. For example, during the  entire transmission process of mmap() -->  write() these two system calls, if other processes suddenly truncate the file, then the user process will Will be killed by a SIGBUS interrupt signal from the bus and generate a core dump for accessing an illegal address. There are two solutions:

  1. Set up a signal handler specially used to deal with SIGBUS signal. If this handler returns directly, the  write() number of bytes written can be returned normally without being interrupted by SIGBUS, and the errno error code will also be set to success. However, this is actually a deceitful solution, because the information brought by the BIGBUS signal is that some serious errors have occurred in the system, but we choose to ignore it. Generally, this method is not recommended.
  2. It is relatively better to solve this problem through the kernel's file lease lock (this is the name of Linux, and it is called opportunistic lock on Windows). We can use the kernel to read/write the leased lock on the file descriptor. When another process tries to truncate the file being transferred by the current user process, the kernel will send a real-time signal to the user: RT_SIGNAL_LEASE signal, this signal will Tell the user that the kernel is destroying the read/write lease lock you added to that file. At this time, the  write() system call will be interrupted, and the current user process will be killed by the SIGBUS signal. The return value is the number of bytes written before the interruption, errno It will also be set to success. The file lease lock needs to be set before memory mapping the file, and finally released before the end of the user process.

sendfile()

In version 2.1 of the Linux kernel, a new system call was introduced  sendfile():

#include <sys/sendfile.h>

ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

From a functional point of view, this system call combines  the two system calls mmap() +  write() into one, achieving the same effect and simplifying the user interface. Some other Unix-like systems like BSD, Solaris, and AIX also have similar implementation, even Windows has a similar API function  TransmitFile.

out_fd and in_fd represent the file descriptors for writing and reading, respectively. in_fd must be a file descriptor pointing to a file, and must be able to support class memory  mmap() mapping. It cannot be a Socket type, and out_fd is before the Linux kernel version 2.6.33 It can only be a file descriptor pointing to Socket, and it can be any type of file descriptor since 2.6.33. off_t is a pointer representing the offset of in_fd, indicating  sendfile() where to start reading from in_fd. After the function returns, this pointer will be updated to the  sendfile() last byte position read, indicating that this call has read a total of How much file data, the last count parameter is the total number of bytes to be transferred in this call.

The process of using  sendfile() to complete a data read and write is as follows:

  1. The user process call  sendfile() falls into the kernel state from the user state;
  2. The DMA controller copies data from the hard disk to the kernel buffer;
  3. The CPU copies the data in the kernel buffer to the socket buffer;
  4. The DMA controller copies the data from the socket buffer to the network card to complete the data transfer;
  5. sendfile() Return, the context switches from kernel mode to user mode.

Based on the fact  sendfile()that there are 2 DMA copies and 1 CPU copy in the entire data transfer process, this is the same as  mmap() +  write() , but because  sendfile() it is only a system call, there is less context switching overhead between user mode and kernel mode than the former. After reading this, smart readers should start asking: " sendfile() Will you encounter file truncation problems similar to  mmap() +  write() ?", Unfortunately, the answer is yes. sendfile() There is also the problem of file truncation, but the good news is that it is sendfile() not only more concise than  mmap() +  write() in interface usage, but also more elegant in handling file truncation: if  sendfile() file truncation is encountered during the process,  sendfile() the system call will be interrupted and killed before returning to The number of bytes transmitted before the user process is interrupted, errno will be set to success, there is no need for the user to set the signal handler in advance, of course, you can set one for personalized processing, and you don’t need to give the file descriptor in advance as before Set a leased lock, because the end result is still the same.

sendfile() Another advantage compared  mmap() to the data is that the data never crosses the boundary between the user state and the kernel state during the transmission process, thus greatly reducing the overhead of storage management. Even so, sendfile() it is still a technology with very narrow applicability, and the most suitable scenario is basically a static file server. And according to the mailing list content of Linus and other kernel maintainers in 2001, in fact, the reason why it was decided to implement it on Linux was  sendfile() only because it had already been implemented on other operating system platforms, and the famous Apache Web server was already in use. , in order to be compatible with the Apache Web server, it was decided to implement this technology on Linux, and the  sendfile() simplicity of the implementation is also well integrated with other parts of the Linux kernel, so Linus agreed to this proposal.

However,  sendfile() there is a big problem in itself. From a different perspective, it is mainly:

  1. The first one is that this interface has not been standardized, resulting in  sendfile() the implementation of the interface on Linux is not the same as that of other Unix-like systems;
  2. Secondly, due to the asynchronous nature of network transmission, it is difficult to implement and  sendfile() connect the technology at the receiving end, so the corresponding technology has not been implemented at the receiving end;
  3. Finally, in terms of performance, because  sendfile() the CPU is still required to participate in the process of transferring disk files from the kernel buffer (page cache) to the socket buffer, it is difficult to prevent the CPU cache from being polluted by the transferred data .

In addition, it needs to be explained that sendfile() the original design of the file is not used to deal with large files, so if you need to deal with very large files, you can use another system call,  sendfile64()which supports addressing and offsetting the contents of larger files.

sendfile() with DMA Scatter/Gather Copy

The technology introduced in the last section  sendfile() has reduced the CPU copy to only one during a data read and write process, but people are always greedy and dissatisfied. Now if you want to remove the only one CPU copy , is there a way to do it?

Of course there is! By introducing support on a new hardware, we can also erase the only remaining CPU copy: Linux introduced DMA's scatter/gather -- scatter/gather function in version 2.4 of the kernel, and modified the code to  sendfile() use and DMA adaptation. Scatter enables DMA copying to no longer need to store data in a continuous memory space, but allows discrete storage, and gather enables the DMA controller to use a small amount of meta-information: a buffer description that contains memory addresses and data sizes character, collect data stored everywhere, and finally restore it into a complete network packet, copy it directly to the network card instead of the socket buffer, avoiding the last CPU copy:

sendfile() + DMA gather The data transfer process is as follows:

  1. User process call  sendfile(), from user mode to kernel mode;
  2. The DMA controller uses the scatter function to copy data from the hard disk to the kernel buffer for discrete storage;
  3. The CPU copies the buffer descriptor containing the memory address and data length to the socket buffer, and the DMA controller can generate the header and trailer of the network packet data packet based on this information
  4. According to the memory address and data size in the buffer descriptor, the DMA controller uses the scatter-gather function to start collecting discrete data from the kernel buffer and grouping packets, and finally directly copies the network packet data to the network card to complete the data transmission;
  5. sendfile() Return, the context switches from kernel mode to user mode.

Based on this scheme, we can also remove the only remaining CPU copy (strictly speaking, there will still be one, but because this time the CPU copies only those tiny meta-information, the overhead is almost negligible) , in theory, there is no CPU involved in the data transmission process, so the CPU cache will no longer be polluted, and the CPU is no longer needed to calculate the data checksum, and the CPU can perform other business computing tasks , and concurrently with DMA I/O tasks, which can greatly improve system performance.

splice()

sendfile() + Although the zero-copy scheme of DMA Scatter/Gather is efficient, it also has two disadvantages:

  1. This solution requires the introduction of new hardware support;
  2. Although  sendfile() the output file descriptor of the Linux kernel can support any type of file descriptor after version 2.6.33 of the Linux kernel, the input file descriptor can still only point to a file.

These two shortcomings limit  sendfile() the applicable scenarios of the + DMA Scatter/Gather scheme. For this reason, Linux introduced a new system call in version 2.6.17  splice(), which is very similar in function to and  sendfile() , but can transfer data between any type of two file descriptors; and in the underlying implementation, splice()and One CPU copy is less than  sendfile() that, which is equivalent to  sendfile() + DMA Scatter/Gather, which completely removes the CPU copy during data transmission.

splice() The system call function is defined as follows:

#include <fcntl.h>
#include <unistd.h>

int pipe(int pipefd[2]);
int pipe2(int pipefd[2], int flags);

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

fd_in and fd_out also represent the file descriptors of the input and output respectively. One of these two file descriptors must point to the pipeline device. This is also an unfriendly restriction. Although the official Linux kernel development is from this system call When it was launched, it promised that it might be refactored to remove this restriction in the future. However, after they made this promise, it was like a stone sinking into the sea. Now 14 years have passed, and there is still no news...

off_in and off_out are the offset pointers of fd_in and fd_out respectively, indicating where the kernel reads and writes data from, len indicates the number of bytes that this call wants to transfer, and the last flags is the flag option bit of the system call Mask, which is used to set the behavior attribute of the system call, is composed of the following 0 or more values ​​combined through the "or" operation:

  • SPLICE_F_MOVE: Indicates  splice() that the attempt is only to move the memory page instead of copying. Setting this value does not mean that the memory page will not be copied. Copying or moving depends on whether the kernel can move the memory page from the pipeline, or whether the memory page in the pipeline is complete Yes; the initial implementation of this flag has many bugs, so it has been invalid since the Linux 2.6.21 version, but it is retained because it may be re-implemented in a future version.
  • SPLICE_F_NONBLOCK: Indicates  splice() not to block I/O, that is, to make  splice() the call a non-blocking call, which can be used to achieve asynchronous data transmission, but it should be noted that the two file descriptors for data transmission should also be marked as O_NONBLOCK in advance Non-blocking I/O, otherwise  splice() the call may still be blocked.
  • splice() SPLICE_F_MORE: Notifies the kernel that more data will be transmitted in the next  system call. This flag is very useful for scenarios where the output terminal is a socket.

splice() It is implemented based on the Linux pipe buffer (pipe buffer) mechanism, so  splice() the two input file descriptors require that one must be a pipe device. A typical  splice() usage is:

int pfd[2];

pipe(pfd);

ssize_t bytes = splice(file_fd, NULL, pfd[1], NULL, 4096, SPLICE_F_MOVE);
assert(bytes != -1);

bytes = splice(pfd[0], NULL, socket_fd, NULL, bytes, SPLICE_F_MOVE | SPLICE_F_MORE);
assert(bytes != -1);

Data transmission process diagram:

splice() The process of reading and writing a disk file to the network card is as follows

  1. The user process calls  pipe(), falls into the kernel state from the user state, creates an anonymous one-way pipe, pipe() returns, and switches the context from the kernel state back to the user state;
  2. User process call  splice(), from user mode to kernel mode;
  3. The DMA controller copies the data from the hard disk to the kernel buffer, "copy" from the write end of the pipeline into the pipeline, splice() returns, and the context returns from the kernel state to the user state;
  4. The user process calls again  splice(), and falls into the kernel state from the user state;
  5. The kernel "copy" the data from the read end of the pipe to the socket buffer, and the DMA controller copies the data from the socket buffer to the network card;
  6. splice() Return, the context switches from kernel mode to user mode.

I believe that after reading the above reading and writing process, readers will definitely be very confused: What about the  improved version that splice() was  agreed sendfile() ? sendfile() Anyway, it only needs one system call, splice() but it needs three times, that's all, there is a pipeline in the middle, and it needs to be copied twice in the kernel space, is this a gross improvement?

When I first learned about  splice() it, I also had this reaction, but after studying it in depth, I gradually realized the mystery. Let me explain in detail:

Let's first understand the pipe buffer pipeline. The pipeline is a channel used for communication between processes on Linux. The pipeline has two ends: the writing end and the reading end. From the perspective of the process, the pipeline is represented as a FIFO byte Flow ring queue:

A pipe is essentially a file in memory, which is essentially a Linux-based VFS. A user process can  pipe() create an anonymous pipe through a system call. After the creation is completed, there will be two inodes of the VFS file structure pointing to it to write end and read end, and return the corresponding two file descriptors, the user process reads and writes the pipe through these two file descriptors; the capacity unit of the pipe is a page of virtual memory, which is 4KB, and the total size is generally 16 Page, based on its ring structure, the pages of the pipeline can be recycled to improve memory utilization. In Linux, the pipe_buffer structure is used to encapsulate the pipe page, and the inode field in the file structure will store a pipe_inode_info structure referring to the pipe, which will store a lot of meta information required for reading and writing the pipe, the head pointer page of the ring queue, Synchronization mechanisms for reading and writing, such as mutexes, waiting queues, etc.:

struct pipe_buffer {
 struct page *page; // 内存页结构
 unsigned int offset, len; // 偏移量,长度
 const struct pipe_buf_operations *ops;
 unsigned int flags;
 unsigned long private;
};

struct pipe_inode_info {
 struct mutex mutex;
 wait_queue_head_t wait;
 unsigned int nrbufs, curbuf, buffers;
 unsigned int readers;
 unsigned int writers;
 unsigned int files;
 unsigned int waiting_writers;
 unsigned int r_counter;
 unsigned int w_counter;
 struct page *tmp_page;
 struct fasync_struct *fasync_readers;
 struct fasync_struct *fasync_writers;
 struct pipe_buffer *bufs;
 struct user_struct *user;
};

The pipe_buffer stores the page, offset and length of the data in memory, and uses these three values ​​to locate the data. Note that the page here is not a page of virtual memory, but a page frame of physical memory, because the pipeline spans The channel of the process, so it cannot be represented by virtual memory, only the page frame of physical memory can be used to locate data; the normal read and write operations of the pipeline are completed through pipe_write()/pipe_read(), by reading/writing data The pipe_buffer of the ring queue to complete the data transmission.

splice() It is implemented based on pipe buffer, but it is zero copy when transferring data through the pipe, because it does not use pipe_write()/pipe_read() to actually write and read data in the pipe buffer when writing and reading , but by assigning the physical memory page frame pointer, offset and length of the data in the memory buffer to the corresponding three fields in the pipe_buffer mentioned above to complete the "copy" of the data, that is, only copying Meta information such as the memory address of the data.

splice() The internal implementation in the Linux kernel source code is  do_splice() a function, and the writing and reading pipelines are through  do_splice_to() and  respectively do_splice_from(). Here we focus on analyzing the source code of the writing pipeline, that is  do_splice_to(), the Linux kernel version I have at hand is v4.8.17. Let's analyze based on this version. As for the source code functions read out  do_splice_from(), the principle is the same.

splice() Call chain for writing data to the pipeline: do_splice() -->  do_splice_to() --> splice_read()

static long do_splice(struct file *in, loff_t __user *off_in,
        struct file *out, loff_t __user *off_out,
        size_t len, unsigned int flags)
{
...

 // 判断是写出 fd 是一个管道设备,则进入数据写入的逻辑
 if (opipe) {
  if (off_out)
   return -ESPIPE;
  if (off_in) {
   if (!(in->f_mode & FMODE_PREAD))
    return -EINVAL;
   if (copy_from_user(&offset, off_in, sizeof(loff_t)))
    return -EFAULT;
  } else {
   offset = in->f_pos;
  }

  // 调用 do_splice_to 把文件内容写入管道
  ret = do_splice_to(in, &offset, opipe, len, flags);

  if (!off_in)
   in->f_pos = offset;
  else if (copy_to_user(off_in, &offset, sizeof(loff_t)))
   ret = -EFAULT;

  return ret;
 }

 return -EINVAL;
}

After entering  do_splice_to() , call again  splice_read():

static long do_splice_to(struct file *in, loff_t *ppos,
    struct pipe_inode_info *pipe, size_t len,
    unsigned int flags)
{
 ssize_t (*splice_read)(struct file *, loff_t *,
          struct pipe_inode_info *, size_t, unsigned int);
 int ret;

 if (unlikely(!(in->f_mode & FMODE_READ)))
  return -EBADF;

 ret = rw_verify_area(READ, in, ppos, len);
 if (unlikely(ret < 0))
  return ret;

 if (unlikely(len > MAX_RW_COUNT))
  len = MAX_RW_COUNT;

 // 判断文件的文件的 file 结构体的 f_op 中有没有可供使用的、支持 splice 的 splice_read 函数指针
 // 因为是 splice() 调用,因此内核会提前给这个函数指针指派一个可用的函数
 if (in->f_op->splice_read)
  splice_read = in->f_op->splice_read;
 else
  splice_read = default_file_splice_read;

 return splice_read(in, ppos, pipe, len, flags);
}

in->f_op->splice_read This function pointer has different implementations depending on the type of file descriptor. For example, in here is a file, so yes,  generic_file_splice_read()if it is a socket,  sock_splice_read()other types will also have corresponding implementations. In short, what we will use here is  generic_file_splice_read() a function, this function will continue to call the internal function  __generic_file_splice_read to complete the following work:

  1. Search in the page cache to see if the content of the file we want to read is already in the cache, if so, use it directly, otherwise, if it does not exist or only part of the data is in the cache, allocate some new memory pages And read in data operation, and increase the reference count of the page frame at the same time;
  2. Based on these memory pages, initialize the splice_pipe_desc structure, which stores the address element information that will save the file data, including the physical memory page frame address, offset, and data length, which is the value of the three positioning data required by pipe_buffer;
  3. Finally, call  splice_to_pipe(), the instance of the splice_pipe_desc structure is the input parameter of the function.
ssize_t splice_to_pipe(struct pipe_inode_info *pipe, struct splice_pipe_desc *spd)
{
...

 for (;;) {
  if (!pipe->readers) {
   send_sig(SIGPIPE, current, 0);
   if (!ret)
    ret = -EPIPE;
   break;
  }

  if (pipe->nrbufs < pipe->buffers) {
   int newbuf = (pipe->curbuf + pipe->nrbufs) & (pipe->buffers - 1);
   struct pipe_buffer *buf = pipe->bufs + newbuf;

   // 写入数据到管道,没有真正拷贝数据,而是内存地址指针的移动,
   // 把物理页框、偏移量和数据长度赋值给 pipe_buffer 完成数据入队操作
   buf->page = spd->pages[page_nr];
   buf->offset = spd->partial[page_nr].offset;
   buf->len = spd->partial[page_nr].len;
   buf->private = spd->partial[page_nr].private;
   buf->ops = spd->ops;
   if (spd->flags & SPLICE_F_GIFT)
    buf->flags |= PIPE_BUF_FLAG_GIFT;

   pipe->nrbufs++;
   page_nr++;
   ret += buf->len;

   if (pipe->files)
    do_wakeup = 1;

   if (!--spd->nr_pages)
    break;
   if (pipe->nrbufs < pipe->buffers)
    continue;

   break;
  }

 ...
}

Here you can clearly see  splice() that the so-called writing data to the pipeline does not actually copy the data, but a tricky operation: only copy the memory address pointer and not actually copy the data. Therefore, the data  splice() is not actually copied in the kernel, so  splice() the system call is also zero copy.

Another thing to note is that the capacity of the pipeline is said to be 16 memory pages, that is, 16 * 4KB = 64 KB. That is to say, it is best not to exceed 64 KB when writing data to the pipeline at a time, otherwise it will  splice() block Live, unless the pipeline is created  and  the pipeline is set to non-blocking pipe2() by passing in  the property.O_NONBLOCK

Even if  splice() the real copy overhead is avoided through the memory address pointer, it still needs to use an additional pipeline to complete the data transfer, which means  sendfile() two more system calls. Doesn't this increase the overhead of context switching? Why not just create the pipe in the kernel and call those two times  splice(), and then expose only one system call to the user? In fact, because  splice() the implementation of zero copy using pipelines instead of hardware  sendfile() is lower than the threshold of + DMA Scatter/Gather, the subsequent  sendfile() underlying implementation has been splice() replaced .

As for why  splice() the API itself is still used in this way, it is because the Linux kernel development team has always wanted to remove this limitation based on the pipeline, but I don’t know why it has been put on hold, so the API has not changed, and we can only wait for the kernel to update it. Someday the team remembered this, and then refactored it so that it no longer depends on the pipeline. Before that, it  splice() still needs to create an additional pipeline as an intermediate buffer. If your business scenario is suitable for use  splice(), but performance Sensitive, if you don’t want to frequently create and destroy pipe buffers, you can refer to the  splice() optimization scheme used by HAProxy: pre-allocate a pipe buffer pool cache pipe, and  spclie() fetch a pipe from the buffer pool every time you call it, and use it up Just put it back, recycle, improve performance.

send() with MSG_ZEROCOPY

The v4.14 version of the Linux kernel in 2017 accepted a patch from the zero-copy function (MSG_ZEROCOPY) implemented by Google engineer Willem de Bruijn in the general sending interface of TCP network packets. Through  send() this new function, user processes can copy The data in the user buffer is sent to the network socket through the kernel space through the zero-copy method. This new technology is more advanced than the several zero-copy methods introduced above, because the previous zero-copy technologies require the user to The process cannot process the processed data but forwards it directly to the target file descriptor. The pressure test data given by Willem de Bruijn in his paper is: using netperf large packet sending test, the performance is improved by 39%, while the data sending performance of the online environment is increased by 5%~8%. The official document states that this Features usually only have a significant performance improvement in the scenario of sending large packets of about 10KB. Initially, this feature only supported TCP, and UDP was not supported until kernel v5.0.

The usage pattern of this function is as follows:

if (setsockopt(socket_fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
        error(1, errno, "setsockopt zerocopy");

ret = send(socket_fd, buffer, sizeof(buffer), MSG_ZEROCOPY);

The first step is to set a SOCK_ZEROCOPY option for the socket to send data, and then  send() set a MSG_ZEROCOPY option when calling to send data. In fact, in theory, you only need to call  setsockopt() or  send() pass this zero-copy option. Choose one, but here you have to set the same option twice. The official statement is to be compatible with  send() a mistake in the previous design of the API: send() the previous implementation would ignore the unknown option. In order to be compatible with those who may have accidentally set MSG_ZEROCOPY option, so it is designed as a two-step setting. But I guess there is another possibility: to provide users with a more flexible usage mode, because this new feature may only have a significant performance improvement in the large package scenario, but the real scenario is very complicated, not only all The scenario of large packets or all small packets may be a mixture of large packets and small packets, so users can call and  setsockopt() set the SOCK_ZEROCOPY option first, and then choose whether to  send() use MSG_ZEROCOPY for zero- copy transfer.

Because  send() the data may be sent asynchronously, there is a point that needs special attention when using MSG_ZEROCOPY:  send() the buffer cannot be reused or released immediately after the call, because the data in the buffer may not have been read by the kernel, so it is necessary to check the error associated with the socket Read the notification message in the queue to see if the data in the buffer has been read by the kernel:

pfd.fd = fd;
pfd.events = 0;
if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
        error(1, errno, "poll");

ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
if (ret == -1)
        error(1, errno, "recvmsg");

read_notification(msg);


uint32_t read_notification(struct msghdr *msg)
{
 struct sock_extended_err *serr;
 struct cmsghdr *cm;

 cm = CMSG_FIRSTHDR(msg);
 if (cm->cmsg_level != SOL_IP &&
  cm->cmsg_type != IP_RECVERR)
   error(1, 0, "cmsg");

 serr = (void *) CMSG_DATA(cm);
 if (serr->ee_errno != 0 ||
  serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
   error(1, 0, "serr");

 return serr->ee _ data;
}

这个技术是基于 redhat 红帽在 2010 年给 Linux 内核提交的 virtio-net zero-copy 技术之上实现的,至于底层原理,简单来说就是通过 send() 把数据在用户缓冲区中的分段指针发送到 socket 中去,利用 page pinning 页锁定机制锁住用户缓冲区的内存页,然后利用 DMA 直接在用户缓冲区通过内存地址指针进行数据读取,实现零拷贝;具体的细节可以通过阅读 Willem de Bruijn 的论文 (PDF) 深入了解。

目前来说,这种技术的主要缺陷有:

  1. 只适用于大文件 (10KB 左右) 的场景,小文件场景因为 page pinning 页锁定和等待缓冲区释放的通知消息这些机制,甚至可能比直接 CPU 拷贝更耗时;
  2. 因为可能异步发送数据,需要额外调用 poll() 和 recvmsg() 系统调用等待 buffer 被释放的通知消息,增加代码复杂度,以及会导致多次用户态和内核态的上下文切换;
  3. MSG_ZEROCOPY 目前只支持发送端,接收端暂不支持。

Linux林拷贝对比绕过内核的直接

无论是传统 I/O 拷贝方式还是引入零拷贝的方式,2 次 DMA Copy 是都少不了的,因为两次 DMA 都是依赖硬件完成的。下面从 CPU 拷贝次数、DMA 拷贝次数以及系统调用几个方面总结一下上述几种 I/O 拷贝方式的差别。

拷贝方式 CPU拷贝 DMA拷贝 系统调用 上下文切换
传统方式(read + write) 2 2 read / write 4
内存映射(mmap + write) 1 2 mmap / write 4
sendfile 1 2 sendfile 2
sendfile + DMA gather copy 0 2 sendfile 2
splice 0 2 splice 2

绕过内核的直接 I/O

可以看出,前面种种的 zero-copy 的方法,都是在想方设法地优化减少或者去掉用户态和内核态之间以及内核态和内核态之间的数据拷贝,为了实现避免这些拷贝可谓是八仙过海,各显神通,采用了各种各样的手段,那么如果我们换个思路:其实这么费劲地去消除这些拷贝不就是因为有内核在掺和吗?如果我们绕过内核直接进行 I/O 不就没有这些烦人的拷贝问题了吗?这就是绕过内核直接 I/O 技术:

这种方案有两种实现方式:

  1. 用户直接访问硬件
  2. 内核控制访问硬件

用户直接访问硬件

这种技术赋予用户进程直接访问硬件设备的权限,这让用户进程能有直接读写硬件设备,在数据传输过程中只需要内核做一些虚拟内存配置相关的工作。这种无需数据拷贝和内核干预的直接 I/O,理论上是最高效的数据传输技术,但是正如前面所说的那样,并不存在能解决一切问题的银弹,这种直接 I/O 技术虽然有可能非常高效,但是它的适用性也非常窄,目前只适用于诸如 MPI 高性能通信、丛集计算系统中的远程共享内存等有限的场景。

这种技术实际上破坏了现代计算机操作系统最重要的概念之一 —— 硬件抽象,我们之前提过,抽象是计算机领域最最核心的设计思路,正式由于有了抽象和分层,各个层级才能不必去关心很多底层细节从而专注于真正的工作,才使得系统的运作更加高效和快速。此外,网卡通常使用功能较弱的 CPU,例如只包含简单指令集的 MIPS 架构处理器(没有不必要的功能,如浮点数计算等),也没有太多的内存来容纳复杂的软件。因此,通常只有那些基于以太网之上的专用协议会使用这种技术,这些专用协议的设计要比远比 TCP/IP 简单得多,而且多用于局域网环境中,在这种环境中,数据包丢失和损坏很少发生,因此没有必要进行复杂的数据包确认和流量控制机制。而且这种技术还需要定制的网卡,所以它是高度依赖硬件的。

Compared with the traditional communication design, the direct hardware access technology brings various restrictions to the program design: Since the data transmission between devices is completed by DMA, the data buffer memory page of the user space must be page pinned (page pinning). Lock), this is to prevent the address of its physical page frame from being swapped to disk or moved to a new address, causing DMA to copy data when the memory page cannot be found at the specified address, causing a page fault, and the page locked The overhead is not smaller than the CPU copy, so in order to avoid frequent page locking system calls, the application must allocate and register a persistent memory pool for data buffering.

The technology that users directly access hardware can obtain extremely high I/O performance, but its application fields and applicable scenarios are also extremely limited, such as node communication in clusters or network storage systems. It requires custom hardware and specially designed applications, but correspondingly minor changes to the operating system kernel and can be easily implemented in the form of kernel modules or device drivers. Direct access to hardware may also cause serious security problems, because user processes have extremely high permissions to directly access hardware, so if your program design is not done well, it may consume limited hardware resources or access illegal addresses , may also indirectly affect other applications that are using the same device, and because the kernel is bypassed, the kernel cannot be controlled and managed for you.

Kernel controls access to hardware

Compared with direct access to hardware technology by users, direct access to hardware technology controlled by the kernel is more secure. It will intervene a little more in the process of data transmission than the former, but it is only a role as an agent and will not participate in In the actual data transmission process, the kernel will control the DMA engine to do the buffer data transmission work for the user process. Similarly, this method is also highly dependent on hardware, such as some network cards integrated with proprietary network stack protocols. One advantage of this technology is that the interface of the user integrated to I/O will not change, it can be used like a common  read()/write() system call, all the dirty work is done in the kernel, and the user interface is very friendly. However, it should be noted that if any unpredictable accident occurs during the use of this technology, which makes it impossible to use this technology for data transmission, the kernel will automatically switch to the most traditional I/O mode, that is, the most performance mode. Bad kind of model.

This technology also has the same problem as the user's direct access to hardware technology: during the process of DMA data transfer, the buffer memory page of the user process must be locked by page pinning, and can only be unlocked after the data transfer is completed. Multiple memory addresses held in the CPU cache are also flushed to ensure data consistency before and after DMA transfers. These mechanisms may lead to worse data transfer performance, because  read()/write() the semantics of the system call cannot notify the CPU in advance that the user buffer will participate in the DMA data transfer transfer, so it cannot be loaded into the high-speed buffer in advance like the kernel buffer. caching to improve performance. Since the memory pages of the user buffer may be distributed anywhere in physical memory, some poorly implemented DMA controller engines may have addressing restrictions that prevent access to these memory regions. Some techniques, such as the IOMMU in the AMD64 architecture, allow to work around these limitations by remapping DMA addresses to physical addresses in memory, but this in turn can cause portability issues as other processor architectures, even Intel 64 The x86 variant EM64T does not have such a feature unit. In addition, there may be other restrictions, such as data alignment issues for DMA transfers, which in turn prevent access to arbitrary buffer memory addresses specified by user processes.

Transfer optimization between kernel buffers and user buffers

So far, the zero-copy technologies we have discussed are all based on reducing or even avoiding CPU data copying between user space and kernel space. Although some technologies are very efficient, most of them have narrow applicability problems, such as  sendfile(), splice() These are very efficient, but they are only applicable to scenarios where user processes do not need to directly process data, such as static file servers or proxy servers that directly forward data.

现在我们已经知道,硬件设备之间的数据可以通过 DMA 进行传输,然而却并没有这样的传输机制可以应用于用户缓冲区和内核缓冲区之间的数据传输。不过另一方面,广泛应用在现代的 CPU 架构和操作系统上的虚拟内存机制表明,通过在不同的虚拟地址上重新映射页面可以实现在用户进程和内核之间虚拟复制和共享内存,尽管一次传输的内存颗粒度相对较大:4KB 或 8KB。

因此如果要在实现在用户进程内处理数据(这种场景比直接转发数据更加常见)之后再发送出去的话,用户空间和内核空间的数据传输就是不可避免的,既然避无可避,那就只能选择优化了,因此本章节我们要介绍两种优化用户空间和内核空间数据传输的技术:

  1. 动态重映射与写时拷贝 (Copy-on-Write)
  2. 缓冲区共享 (Buffer Sharing)

动态重映射与写时拷贝 (Copy-on-Write)

前面我们介绍过利用内存映射技术来减少数据在用户空间和内核空间之间的复制,通常简单模式下,用户进程是对共享的缓冲区进行同步阻塞读写的,这样不会有 data race 问题,但是这种模式下效率并不高,而提升效率的一种方法就是异步地对共享缓冲区进行读写,而这样的话就必须引入保护机制来避免数据冲突问题,写时复制 (Copy on Write) 就是这样的一种技术。

写入时复制Copy-on-writeCOW)是一种计算机 程序设计领域的优化策略。其核心思想是,如果有多个调用者(callers)同时请求相同资源(如内存或磁盘上的数据存储),他们会共同获取相同的指针指向相同的资源,直到某个调用者试图修改资源的内容时,系统才会真正复制一份专用副本(private copy)给该调用者,而其他调用者所见到的最初的资源仍然保持不变。这过程对其他的调用者都是 透明的。此作法主要的优点是如果调用者没有修改该资源,就不会有副本(private copy)被创建,因此多个调用者只是读取操作时可以共享同一份资源。

举一个例子,引入了 COW 技术之后,用户进程读取磁盘文件进行数据处理最后写到网卡,首先使用内存映射技术让用户缓冲区和内核缓冲区共享了一段内存地址并标记为只读 (read-only),避免数据拷贝,而当要把数据写到网卡的时候,用户进程选择了异步写的方式,系统调用会直接返回,数据传输就会在内核里异步进行,而用户进程就可以继续其他的工作,并且共享缓冲区的内容可以随时再进行读取,效率很高,但是如果该进程又尝试往共享缓冲区写入数据,则会产生一个 COW 事件,让试图写入数据的进程把数据复制到自己的缓冲区去修改,这里只需要复制要修改的内存页即可,无需所有数据都复制过去,而如果其他访问该共享内存的进程不需要修改数据则可以永远不需要进行数据拷贝。

COW is a technology built on the virtual memory buffer mapping technology, so it requires the hardware support of the MMU. The MMU will record which memory pages are currently marked as read-only. When a process tries to write data to these memory pages , the MMU will throw an exception to the operating system kernel. When the kernel handles the exception, it allocates a physical memory for the process and copies the data to this memory address, and sends a write operation to the MMU to execute the process again.

The biggest advantage of COW is to save memory and reduce data copying, but it is at the cost of increasing the complexity of the operating system kernel I/O process. When deciding to use COW to copy pages, it is important to note where free pages are allocated. Many operating systems provide a free pool of pages for such requests. These free pages are typically allocated when a process's stack or heap is to be expanded, or when there are copy-on-write pages to manage. The operating system typically allocates these pages using a technique called zero-filling on demand . Zero-fill-on-demand pages are zero-filled before they need to be allocated, thus clearing out old content.

Limitations :

The zero-copy technology of COW is more suitable for the scenario where more reads and less writes make COW events less, because the system overhead caused by COW events is much higher than that generated by one CPU copy. In addition, in the process of actual application, in order to avoid frequent memory mapping, the same memory buffer can be reused. Therefore, you do not need to release the memory page mapping relationship after using the shared buffer only once, but Repeated recycling to improve performance, but the persistence of this memory page mapping will not reduce the system overhead caused by the round-trip movement of the page table and TLB flushing, because the memory page is added every time a COW event is received. When locking or unlocking, the read-only flag (read-ony) of the page must be changed to (write-only).

Buffer Sharing

As can be seen from the previous introduction, the traditional Linux I/O interface is based on copy/copy: data needs to be copied between the operating system kernel space and the user space buffer. Before performing I/O operations, the user process needs to pre-allocate a memory buffer.  read() When using the system call, the kernel will copy the data read from the memory or network card to the user buffer; while using the  write() system call , is to copy the data in the user memory buffer to the kernel buffer.

In order to realize this traditional I/O mode, Linux must perform memory virtual mapping and release during each I/O operation. The efficiency of this memory page remapping mechanism is severely limited by the cache architecture, MMU address translation speed, and TLB hit ratio. It is possible to greatly improve I/O performance if the overhead of virtual address translation and TLB flushing for processing I/O requests can be avoided. Buffer sharing is a technology used to solve the above problems.

The earliest operating system that supports Buffer Sharing is Solaris. Later, Linux also gradually supported this Buffer Sharing technology, but it is still not complete and mature.

Operating system kernel developers have implemented a buffer sharing framework called fbufs, that is, Fast Buffers , using an fbuf buffer as the smallest unit of data transmission. Using this technology requires calling new operations System API, user area and kernel area, and data between kernel areas must be strictly communicated under the fbufs system. fbufs allocates a buffer pool for each user process, which will store pre-allocated (or re-allocated when it can be used) good buffers, and these buffers will be mapped to user memory space and kernel memory space at the same time. fbufs creates buffers with only one virtual memory mapping operation, effectively eliminating most of the performance penalty incurred by maintaining memory consistency.

The traditional Linux I/O interface is completed by copying and transferring data between the user buffer and the kernel buffer. This kind of data transfer requires a large amount of data copying. At the same time, due to the existence of virtual memory technology, I During the /O process, it is also necessary to frequently convert virtual memory addresses to physical memory addresses through the MMU, replace caches, and refresh TLBs. These operations will cause performance loss. And if you use the fbufs framework to realize data transmission, you can first cache all the buffers in the pool for recycling without having to re-allocate each time, and not only the buffers themselves are cached, but also the virtual memory address to the The mapping relationship of physical memory addresses is also cached, which can avoid address translation every time. From the perspective of sending and receiving data, user processes and I/O subsystems such as device drivers and network cards can directly transfer the entire buffer The area itself rather than the data content in it can also be understood as the transfer of memory address pointers, which avoids a large number of data content copies: the user process/IO subsystem writes data to the kernel by sending fbuf one by one instead of directly Correspondingly, the user process/IO subsystem reads data from the kernel by receiving each fbuf, which can reduce the  read()/write() data copy overhead caused by traditional system calls:

  1. The sender user process calls  uf_allocate to obtain a fbuf buffer from its own buffer pool, fills it with content and then calls to  uf_write send a file descriptor pointing to fbuf to the kernel area;
  2. After the I/O subsystem receives fbuf, it calls to  uf_allocb obtain a fubf from the buffer pool of the receiving user process and fills it with the received data, and then sends the file descriptor pointing to fbuf to the user area;
  3. The receiver user process calls  uf_get to receive the fbuf, reads the data for processing, and calls to  uf_deallocate put the fbuf back into its own buffer pool after completion.

Defects of fbufs

共享缓冲区技术的实现需要依赖于用户进程、操作系统内核、以及 I/O 子系统 (设备驱动程序,文件系统等)之间协同工作。比如,设计得不好的用户进程容易就会修改已经发送出去的 fbuf 从而污染数据,更要命的是这种问题很难 debug。虽然这个技术的设计方案非常精彩,但是它的门槛和限制却不比前面介绍的其他技术少:首先会对操作系统 API 造成变动,需要使用新的一些 API 调用,其次还需要设备驱动程序配合改动,还有由于是内存共享,内核需要很小心谨慎地实现对这部分共享的内存进行数据保护和同步的机制,而这种并发的同步机制是非常容易出 bug 的从而又增加了内核的代码复杂度,等等。因此这一类的技术还远远没有到发展成熟和广泛应用的阶段,目前大多数的实现都还处于实验阶段。

8. Java NIO零拷贝实现

在 Java NIO 中的通道(Channel)就相当于操作系统的内核空间(kernel space)的缓冲区,而缓冲区(Buffer)对应的相当于操作系统的用户空间(user space)中的用户缓冲区(user buffer)。

  • 通道(Channel)是全双工的(双向传输),它既可能是读缓冲区(read buffer),也可能是网络缓冲区(socket buffer)。
  • 缓冲区(Buffer)分为堆内存(HeapBuffer)和堆外内存(DirectBuffer),这是通过 malloc() 分配出来的用户态内存。

The off-heap memory (DirectBuffer) needs to be manually reclaimed by the application after use, while the data in the heap memory (HeapBuffer) may be automatically reclaimed during GC. Therefore, when using HeapBuffer to read and write data, in order to avoid the loss of buffer data due to GC, NIO will first copy the internal data of HeapBuffer to a temporary local memory (native memory) in DirectBuffer. This copy involves sun.misc The implementation principle behind the call of .Unsafe.copyMemory() is similar to that of memcpy(). Finally, pass the memory address of the temporarily generated DirectBuffer internal data to the I/O calling function, thus avoiding accessing the Java object to handle I/O reading and writing.

8.1. MappedByteBuffer

MappedByteBuffer is an implementation provided by NIO based on the zero-copy method of memory mapping (mmap), which inherits from ByteBuffer. FileChannel defines a map() method, which can map a size-sized area starting from the position position of a file into a memory image file. The abstract method map() method is defined in FileChannel as follows:

public abstract MappedByteBuffer map(MapMode mode, long position, long size)
        throws IOException;
复制代码
  • mode: Limit the access mode of the memory mapping area (MappedByteBuffer) to the memory image file, including read-only (READ_ONLY), readable and writable (READ_WRITE) and copy-on-write (PRIVATE).
  • position: The starting address of the file mapping, corresponding to the first address of the memory mapping area (MappedByteBuffer).
  • size: The byte length of the file mapping, the number of bytes from the position onwards, corresponding to the size of the memory mapping area (MappedByteBuffer).

Compared with ByteBuffer, MappedByteBuffer adds three important methods: fore(), load() and isLoad():

  • fore(): For the buffer in READ_WRITE mode, force the modification of the buffer content to the local file.
  • load(): Load the contents of the buffer into physical memory and return a reference to the buffer.
  • isLoaded(): Returns true if the contents of the buffer are in physical memory, false otherwise.

The following is an example of using MappedByteBuffer to read and write files:

private final static String CONTENT = "Zero copy implemented by MappedByteBuffer";
private final static String FILE_NAME = "/mmap.txt";
private final static String CHARSET = "UTF-8";

复制代码
  • Write file data: open the file channel fileChannel and provide read permission, write permission and data clear permission, map to a writable memory buffer mappedByteBuffer through fileChannel, write the target data into mappedByteBuffer, and change the buffer through the force() method The content is forced to be written to a local file.
@Test
public void writeToFileByMappedByteBuffer() {
    Path path = Paths.get(getClass().getResource(FILE_NAME).getPath());
    byte[] bytes = CONTENT.getBytes(Charset.forName(CHARSET));
    try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ,
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
        MappedByteBuffer mappedByteBuffer = fileChannel.map(READ_WRITE, 0, bytes.length);
        if (mappedByteBuffer != null) {
            mappedByteBuffer.put(bytes);
            mappedByteBuffer.force();
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

复制代码
  • Read file data: Open file channel fileChannel and provide read-only permission, map to a read-only memory buffer mappedByteBuffer through fileChannel, read the byte array in mappedByteBuffer to get file data.
@Test
public void readFromFileByMappedByteBuffer() {
    Path path = Paths.get(getClass().getResource(FILE_NAME).getPath());
    int length = CONTENT.getBytes(Charset.forName(CHARSET)).length;
    try (FileChannel fileChannel = FileChannel.open(path, StandardOpenOption.READ)) {
        MappedByteBuffer mappedByteBuffer = fileChannel.map(READ_ONLY, 0, length);
        if (mappedByteBuffer != null) {
            byte[] bytes = new byte[length];
            mappedByteBuffer.get(bytes);
            String content = new String(bytes, StandardCharsets.UTF_8);
            assertEquals(content, "Zero copy implemented by MappedByteBuffer");
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

复制代码

The underlying implementation principle of the map() method is introduced below. The map() method is an abstract method of java.nio.channels.FileChannel, which is implemented by the subclass sun.nio.ch.FileChannelImpl.java. The following is the core code related to memory mapping:

public MappedByteBuffer map(MapMode mode, long position, long size) throws IOException {
    int pagePosition = (int)(position % allocationGranularity);
    long mapPosition = position - pagePosition;
    long mapSize = size + pagePosition;
    try {
        addr = map0(imode, mapPosition, mapSize);
    } catch (OutOfMemoryError x) {
        System.gc();
        try {
            Thread.sleep(100);
        } catch (InterruptedException y) {
            Thread.currentThread().interrupt();
        }
        try {
            addr = map0(imode, mapPosition, mapSize);
        } catch (OutOfMemoryError y) {
            throw new IOException("Map failed", y);
        }
    }
    
    int isize = (int)size;
    Unmapper um = new Unmapper(addr, mapSize, isize, mfd);
    if ((!writable) || (imode == MAP_RO)) {
    	return Util.newMappedByteBufferR(isize, addr + pagePosition, mfd, um);
    } else {
    	return Util.newMappedByteBuffer(isize, addr + pagePosition, mfd, um);
    }
}

复制代码

The map() method allocates a piece of virtual memory for the file as its memory mapping area through the local method map0(), and then returns the starting address of the memory mapping area.

  1. File mapping requires creating an instance of MappedByteBuffer in the Java heap. If the first file mapping results in OOM, manually trigger garbage collection, sleep for 100ms and then try to map, and throw an exception if it fails.
  2. Create a DirectByteBuffer instance through Util's newMappedByteBuffer (readable and writable) method or newMappedByteBufferR (read-only) method, where DirectByteBuffer is a subclass of MappedByteBuffer.

The map() method returns the starting address of the memory mapping area, and the data of the specified memory can be obtained by (starting address + offset). This replaces the read() or write() methods to a certain extent, and the bottom layer directly uses the getByte() and putByte() methods of the sun.misc.Unsafe class to read and write data.

private native long map0(int prot, long position, long mapSize) throws IOException;

复制代码

The above is the definition of the native method (native method) map0, which calls the implementation of the underlying C through JNI (Java Native Interface). The implementation of this native function (Java_sun_nio_ch_FileChannelImpl_map0) is located in native/sun/nio/ch/FileChannelImpl under the JDK source code package. c inside this source file.

JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
                                     jint prot, jlong off, jlong len)
{
    void *mapAddress = 0;
    jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
    jint fd = fdval(env, fdo);
    int protections = 0;
    int flags = 0;

    if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
        protections = PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
        protections = PROT_WRITE | PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
        protections =  PROT_WRITE | PROT_READ;
        flags = MAP_PRIVATE;
    }

    mapAddress = mmap64(
        0,                    /* Let OS decide location */
        len,                  /* Number of bytes to map */
        protections,          /* File permissions */
        flags,                /* Changes are shared */
        fd,                   /* File descriptor of mapped file */
        off);                 /* Offset into file */

    if (mapAddress == MAP_FAILED) {
        if (errno == ENOMEM) {
            JNU_ThrowOutOfMemoryError(env, "Map failed");
            return IOS_THROWN;
        }
        return handle(env, -1, "Map failed");
    }

    return ((jlong) (unsigned long) mapAddress);
}

复制代码

It can be seen that the map0() function finally issues a memory mapping call to the underlying Linux kernel through the mmap64() function. The prototype of the mmap64() function is as follows:

#include <sys/mman.h>

void *mmap64(void *addr, size_t len, int prot, int flags, int fd, off64_t offset);

复制代码

The following is a detailed introduction to the meaning of each parameter of the mmap64() function and the optional values ​​of the parameters:

  • addr: The starting address of the file in the memory-mapped area of ​​the user process space. It is a suggested parameter and can usually be set to 0 or NULL. At this time, the kernel determines the real starting address. When flags is MAP_FIXED, addr is a mandatory parameter, that is, an existing address needs to be provided.
  • len: the byte length of the file that needs to be memory mapped
  • prot: Controls access rights of user processes to memory-mapped areas
    • PROT_READ: read permission
    • PROT_WRITE: write permission
    • PROT_EXEC: execute permission
    • PROT_NONE: No permission
  • flags: Controls whether the modification of the memory mapping area is shared by multiple processes
    • MAP_PRIVATE: The modification of the data in the memory mapping area will not be reflected in the real file, and the copy-on-write mechanism is adopted when the data modification occurs
    • MAP_SHARED: Modifications to the memory mapping area will be synchronized to the real file, and the modification is visible to processes sharing the memory mapping area
    • MAP_FIXED: It is not recommended to use. In this mode, the addr parameter specified must provide an existing addr parameter
  • fd: file descriptor. Each map operation will cause the reference count of the file to increase by 1, and each unmap operation or end of the process will cause the reference count to decrease by 1
  • offset: file offset. The file position for mapping, the backward displacement from the start address of the file

The following summarizes the characteristics and shortcomings of MappedByteBuffer:

  • MappedByteBuffer uses virtual memory outside the heap, so the memory size allocated (map) is not limited by the -Xmx parameter of the JVM, but there is also a size limit.
  • If the file exceeds the Integer.MAX_VALUE byte limit, the content behind the file can be remapped through the position parameter.
  • MappedByteBuffer does have high performance when processing large files, but it also has problems such as memory usage and uncertain file closing. Files opened by it will only be closed after garbage collection, and this time point is uncertain.
  • MappedByteBuffer provides the mmap() method for file mapping memory, and also provides the unmap() method for releasing the mapped memory. However, unmap() is a private method in FileChannelImpl and cannot be called directly. Therefore, the user program needs to manually release the memory area occupied by the map by calling the clean() method of the sun.misc.Cleaner class through Java reflection.
public static void clean(final Object buffer) throws Exception {
    AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
        try {
            Method getCleanerMethod = buffer.getClass().getMethod("cleaner", new Class[0]);
            getCleanerMethod.setAccessible(true);
            Cleaner cleaner = (Cleaner) getCleanerMethod.invoke(buffer, new Object[0]);
            cleaner.clean();
        } catch(Exception e) {
            e.printStackTrace();
        }
    });
}

复制代码

8.2. DirectByteBuffer

The object reference of DirectByteBuffer is located in the heap of the Java memory model. The JVM can perform memory allocation and recovery management on the DirectByteBuffer object. Generally, the static method allocateDirect() of DirectByteBuffer is used to create a DirectByteBuffer instance and allocate memory.

public static ByteBuffer allocateDirect(int capacity) {
    return new DirectByteBuffer(capacity);
}

复制代码

The byte buffer inside DirectByteBuffer is located in the direct memory outside the heap (user mode). It allocates memory through the local method allocateMemory() of Unsafe, and the underlying call is the malloc() function of the operating system.

DirectByteBuffer(int cap) {
    super(-1, 0, cap, cap);
    boolean pa = VM.isDirectMemoryPageAligned();
    int ps = Bits.pageSize();
    long size = Math.max(1L, (long)cap + (pa ? ps : 0));
    Bits.reserveMemory(size, cap);

    long base = 0;
    try {
        base = unsafe.allocateMemory(size);
    } catch (OutOfMemoryError x) {
        Bits.unreserveMemory(size, cap);
        throw x;
    }
    unsafe.setMemory(base, size, (byte) 0);
    if (pa && (base % ps != 0)) {
        address = base + ps - (base & (ps - 1));
    } else {
        address = base;
    }
    cleaner = Cleaner.create(this, new Deallocator(base, size, cap));
    att = null;
}

复制代码

In addition, a Deallocator thread is created when DirectByteBuffer is initialized, and the direct memory is reclaimed through the Cleaner's freeMemory() method. The bottom layer of freeMemory() is the free() function of the operating system.

private static class Deallocator implements Runnable {
    private static Unsafe unsafe = Unsafe.getUnsafe();

    private long address;
    private long size;
    private int capacity;

    private Deallocator(long address, long size, int capacity) {
        assert (address != 0);
        this.address = address;
        this.size = size;
        this.capacity = capacity;
    }

    public void run() {
        if (address == 0) {
            return;
        }
        unsafe.freeMemory(address);
        address = 0;
        Bits.unreserveMemory(size, capacity);
    }
}

复制代码

Since DirectByteBuffer is used to allocate local memory of the system, which is not within the control of the JVM, the recovery of direct memory is different from the recovery of heap memory. If direct memory is used improperly, it is easy to cause OutOfMemoryError.

Having said all that, what does DirectByteBuffer have to do with zero copy? As mentioned earlier, when MappedByteBuffer performs memory mapping, its map() method will create a buffer instance through Util.newMappedByteBuffer(). The initialization code is as follows:

static MappedByteBuffer newMappedByteBuffer(int size, long addr, FileDescriptor fd,
                                            Runnable unmapper) {
    MappedByteBuffer dbb;
    if (directByteBufferConstructor == null)
        initDBBConstructor();
    try {
        dbb = (MappedByteBuffer)directByteBufferConstructor.newInstance(
            new Object[] { new Integer(size), new Long(addr), fd, unmapper });
    } catch (InstantiationException | IllegalAccessException | InvocationTargetException e) {
        throw new InternalError(e);
    }
    return dbb;
}

private static void initDBBRConstructor() {
    AccessController.doPrivileged(new PrivilegedAction<Void>() {
        public Void run() {
            try {
                Class<?> cl = Class.forName("java.nio.DirectByteBufferR");
                Constructor<?> ctor = cl.getDeclaredConstructor(
                    new Class<?>[] { int.class, long.class, FileDescriptor.class,
                                    Runnable.class });
                ctor.setAccessible(true);
                directByteBufferRConstructor = ctor;
            } catch (ClassNotFoundException | NoSuchMethodException |
                     IllegalArgumentException | ClassCastException x) {
                throw new InternalError(x);
            }
            return null;
        }});
}

复制代码

DirectByteBuffer is the specific implementation class of MappedByteBuffer. In fact, the Util.newMappedByteBuffer() method obtains the constructor of DirectByteBuffer through the reflection mechanism, and then creates an instance of DirectByteBuffer, which corresponds to a separate construction method for memory mapping:

protected DirectByteBuffer(int cap, long addr, FileDescriptor fd, Runnable unmapper) {
    super(-1, 0, cap, cap, fd);
    address = addr;
    cleaner = Cleaner.create(this, unmapper);
    att = null;
}

复制代码

Therefore, in addition to allowing the direct memory allocation of the operating system, DirectByteBuffer itself also has the function of file memory mapping, which will not be explained here. What we need to pay attention to is that DirectByteBuffer provides random read get() and write() operations of memory image files on the basis of MappedByteBuffer.

  • Random read operation of memory image file
public byte get() {
    return ((unsafe.getByte(ix(nextGetIndex()))));
}

public byte get(int i) {
    return ((unsafe.getByte(ix(checkIndex(i)))));
}

复制代码
  • Random write operations for memory-mapped files
public ByteBuffer put(byte x) {
    unsafe.putByte(ix(nextPutIndex()), ((x)));
    return this;
}

public ByteBuffer put(int i, byte x) {
    unsafe.putByte(ix(checkIndex(i)), ((x)));
    return this;
}

复制代码

The random reading and writing of the memory image file is performed by means of the ix() method. The ix() method calculates the address of the pointer through the first memory address (address) of the memory mapping space and the given offset i, and then the unsafe class The get() and put() methods read or write the data pointed to by the pointer.

private long ix(int i) {
    return address + ((long)i << 0);
}

复制代码

8.3. FileChannel

FileChannel is a channel for reading, writing, mapping and manipulating files, and it is thread-safe in a concurrent environment. The getChannel() method based on FileInputStream, FileOutputStream or RandomAccessFile can create and open a file channel. FileChannel defines two abstract methods, transferFrom() and transferTo(), which implement data transfer by establishing connections between channels.

  • transferTo(): ​​Write the source data in the file into a destination channel of WritableByteChannel through FileChannel.
public abstract long transferTo(long position, long count, WritableByteChannel target)
        throws IOException;

复制代码
  • transferFrom(): Read the data in a source channel ReadableByteChannel to the file in the current FileChannel.
public abstract long transferFrom(ReadableByteChannel src, long position, long count)
        throws IOException;

复制代码

The following is an example of how FileChannel uses the transferTo() and transferFrom() methods for data transfer:

private static final String CONTENT = "Zero copy implemented by FileChannel";
private static final String SOURCE_FILE = "/source.txt";
private static final String TARGET_FILE = "/target.txt";
private static final String CHARSET = "UTF-8";

复制代码

First create two files, source.txt and target.txt, under the class loading root path, and write initialization data to the source file source.txt.

@Before
public void setup() {
    Path source = Paths.get(getClassPath(SOURCE_FILE));
    byte[] bytes = CONTENT.getBytes(Charset.forName(CHARSET));
    try (FileChannel fromChannel = FileChannel.open(source, StandardOpenOption.READ,
            StandardOpenOption.WRITE, StandardOpenOption.TRUNCATE_EXISTING)) {
        fromChannel.write(ByteBuffer.wrap(bytes));
    } catch (IOException e) {
        e.printStackTrace();
    }
}

复制代码

For the transferTo() method, the destination channel toChannel can be any one-way byte writing channel WritableByteChannel; and for the transferFrom() method, the source channel fromChannel can be any one-way byte reading channel ReadableByteChannel. Among them, the channels such as FileChannel, SocketChannel and DatagramChannel implement the WritableByteChannel and ReadableByteChannel interfaces, which are bidirectional channels that support reading and writing at the same time. For the convenience of testing, the following is an example of completing channel-to-channel data transmission based on FileChannel.

  • Copy the data in fromChannel to toChannel via transferTo()
@Test
public void transferTo() throws Exception {
    try (FileChannel fromChannel = new RandomAccessFile(
             getClassPath(SOURCE_FILE), "rw").getChannel();
         FileChannel toChannel = new RandomAccessFile(
             getClassPath(TARGET_FILE), "rw").getChannel()) {
        long position = 0L;
        long offset = fromChannel.size();
        fromChannel.transferTo(position, offset, toChannel);
    }
}

复制代码
  • Copy the data in fromChannel to toChannel via transferFrom()
@Test
public void transferFrom() throws Exception {
    try (FileChannel fromChannel = new RandomAccessFile(
             getClassPath(SOURCE_FILE), "rw").getChannel();
         FileChannel toChannel = new RandomAccessFile(
             getClassPath(TARGET_FILE), "rw").getChannel()) {
        long position = 0L;
        long offset = fromChannel.size();
        toChannel.transferFrom(fromChannel, position, offset);
    }
}

复制代码

The following introduces the underlying implementation principles of the transferTo() and transferFrom() methods. These two methods are also abstract methods of java.nio.channels.FileChannel and are implemented by the subclass sun.nio.ch.FileChannelImpl.java. The bottom layers of transferTo() and transferFrom() are based on sendfile to realize data transfer, among which FileChannelImpl.java defines 3 constants, which are used to indicate whether the kernel of the current operating system supports sendfile and related features of sendfile.

private static volatile boolean transferSupported = true;
private static volatile boolean pipeSupported = true;
private static volatile boolean fileSupported = true;

复制代码
  • transferSupported: used to mark whether the current system kernel supports sendfile() call, the default is true.
  • pipeSupported: It is used to mark whether the current system kernel supports the file descriptor (fd) based on the pipe (pipe) sendfile() call, the default is true.
  • fileSupported: Used to mark whether the current system kernel supports the sendfile() call of the file descriptor (fd) based on the file (file), the default is true.

Let's take the source code implementation of transferTo() as an example. FileChannelImpl first executes the transferToDirectly() method to try data copy in the zero-copy mode of sendfile. If the system kernel does not support sendfile, further execute the transferToTrustedChannel() method to perform memory mapping in the zero-copy mode of mmap. In this case, the destination channel must be of type FileChannelImpl or SelChImpl. If the above two steps fail, execute the transferToArbitraryChannel() method to complete reading and writing based on the traditional I/O method. The specific steps are to initialize a temporary DirectBuffer, read the data of the source channel FileChannel to the DirectBuffer, and then write it to the destination Inside the channel WritableByteChannel.

public long transferTo(long position, long count, WritableByteChannel target)
        throws IOException {
    // 计算文件的大小
    long sz = size();
    // 校验起始位置
    if (position > sz)
        return 0;
    int icount = (int)Math.min(count, Integer.MAX_VALUE);
    // 校验偏移量
    if ((sz - position) < icount)
        icount = (int)(sz - position);

    long n;

    if ((n = transferToDirectly(position, icount, target)) >= 0)
        return n;

    if ((n = transferToTrustedChannel(position, icount, target)) >= 0)
        return n;

    return transferToArbitraryChannel(position, icount, target);
}

复制代码

Next, focus on analyzing the implementation of the transferToDirectly() method, which is the essence of transferTo() implementing zero copy through sendfile. It can be seen that the transferToDirectlyInternal() method first obtains the file descriptor targetFD of the destination channel WritableByteChannel, obtains the synchronization lock and then executes the transferToDirectlyInternal() method.

private long transferToDirectly(long position, int icount, WritableByteChannel target)
        throws IOException {
    // 省略从target获取targetFD的过程
    if (nd.transferToDirectlyNeedsPositionLock()) {
        synchronized (positionLock) {
            long pos = position();
            try {
                return transferToDirectlyInternal(position, icount,
                        target, targetFD);
            } finally {
                position(pos);
            }
        }
    } else {
        return transferToDirectlyInternal(position, icount, target, targetFD);
    }
}

复制代码

Finally, the local method transferTo0() is called by transferToDirectlyInternal(), trying to transfer data in the way of sendfile. If the system kernel does not support sendfile at all, such as the Windows operating system, return UNSUPPORTED and mark transferSupported as false. If the system kernel does not support some features of sendfile, for example, the lower version of the Linux kernel does not support DMA gather copy operation, return UNSUPPORTED_CASE and mark pipeSupported or fileSupported as false.

private long transferToDirectlyInternal(long position, int icount,
                                        WritableByteChannel target,
                                        FileDescriptor targetFD) throws IOException {
    assert !nd.transferToDirectlyNeedsPositionLock() ||
            Thread.holdsLock(positionLock);

    long n = -1;
    int ti = -1;
    try {
        begin();
        ti = threads.add();
        if (!isOpen())
            return -1;
        do {
            n = transferTo0(fd, position, icount, targetFD);
        } while ((n == IOStatus.INTERRUPTED) && isOpen());
        if (n == IOStatus.UNSUPPORTED_CASE) {
            if (target instanceof SinkChannelImpl)
                pipeSupported = false;
            if (target instanceof FileChannelImpl)
                fileSupported = false;
            return IOStatus.UNSUPPORTED_CASE;
        }
        if (n == IOStatus.UNSUPPORTED) {
            transferSupported = false;
            return IOStatus.UNSUPPORTED;
        }
        return IOStatus.normalize(n);
    } finally {
        threads.remove(ti);
        end (n > -1);
    }
}

复制代码

The native method (native method) transferTo0() calls the underlying C function through JNI (Java Native Interface). This native function (Java_sun_nio_ch_FileChannelImpl_transferTo0) is also located in the native/sun/nio/ch/FileChannelImpl.c source file under the JDK source package. The JNI function Java_sun_nio_ch_FileChannelImpl_transferTo0() precompiles different systems based on conditional compilation. The following is the call encapsulation provided by JDK based on the Linux system kernel for transferTo().

#if defined(__linux__) || defined(__solaris__)
#include <sys/sendfile.h>
#elif defined(_AIX)
#include <sys/socket.h>
#elif defined(_ALLBSD_SOURCE)
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/uio.h>

#define lseek64 lseek
#define mmap64 mmap
#endif

JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_transferTo0(JNIEnv *env, jobject this,
                                            jobject srcFDO,
                                            jlong position, jlong count,
                                            jobject dstFDO)
{
    jint srcFD = fdval(env, srcFDO);
    jint dstFD = fdval(env, dstFDO);

#if defined(__linux__)
    off64_t offset = (off64_t)position;
    jlong n = sendfile64(dstFD, srcFD, &offset, (size_t)count);
    return n;
#elif defined(__solaris__)
    result = sendfilev64(dstFD, &sfv, 1, &numBytes);	
    return result;
#elif defined(__APPLE__)
    result = sendfile(srcFD, dstFD, position, &numBytes, NULL, 0);
    return result;
#endif
}

复制代码

For Linux, Solaris, and Apple systems, the bottom layer of the transferTo0() function will execute the sendfile64 system call to complete the zero-copy operation. The prototype of the sendfile64() function is as follows:

#include <sys/sendfile.h>

ssize_t sendfile64(int out_fd, int in_fd, off_t *offset, size_t count);

复制代码

The following briefly introduces the meaning of each parameter of the sendfile64() function:

  • out_fd: the file descriptor to be written
  • in_fd: the file descriptor to be read
  • offset: Specifies the reading position of the file stream corresponding to in_fd, if it is empty, it will start from the starting position by default
  • count: Specifies the number of bytes transferred between the file descriptors in_fd and out_fd

Before Linux 2.6.3, out_fd must be a socket, but after Linux 2.6.3, out_fd can be any file. In other words, the sendfile64() function can not only perform network file transfer, but also implement zero-copy operations on local files.

9. Other zero-copy implementations

9.1. Netty Zero Copy

The zero copy in Netty is not the same as the zero copy at the operating system level mentioned above. What we call Netty zero copy is completely based on (Java level) user mode, and it is more biased towards data operation optimization. This concept is embodied in the following aspects:

  • Netty wraps the tranferTo() method of java.nio.channels.FileChannel through the DefaultFileRegion class, and can send the data in the file buffer directly to the destination channel (Channel) during file transfer
  • ByteBuf can wrap byte arrays, ByteBuf, and ByteBuffer into a ByteBuf object through the wrap operation, thus avoiding the copy operation
  • ByteBuf supports slice operation, so ByteBuf can be decomposed into multiple ByteBufs sharing the same storage area, avoiding memory copy
  • Netty provides the CompositeByteBuf class, which can merge multiple ByteBufs into a logical ByteBuf, avoiding copying between ByteBufs

Among them, the first item belongs to the zero-copy operation at the operating system level, and the last three items can only be regarded as data operation optimization at the user level.

9.2. Comparison between RocketMQ and Kafka

RocketMQ chooses the zero-copy method of mmap + write, which is suitable for data persistence and transmission of small-block files such as business-level messages; while Kafka uses the zero-copy method of sendfile, which is suitable for high throughput of system log messages Data persistence and transmission of a large amount of large-block files. But it is worth noting that Kafka's index file uses the mmap + write method, and the data file uses the sendfile method.

message queue zero copy method advantage shortcoming
RocketMQ mmap + write It is suitable for small block file transfer, and it is very efficient when called frequently Can't make good use of the DMA method, it will consume more CPU than sendfile, the memory security control is complicated, and the JVM Crash problem needs to be avoided
Kafka sendfile The DMA method can be used, which consumes less CPU, has high transfer efficiency of large-block files, and has no memory security issues The efficiency of small block files is lower than that of mmap, and it can only be transferred in BIO mode, not in NIO mode.

question

Is zero copy the best: no.

So for file transfer, it depends on the situation:

  1. When transferring large files, use asynchronous IO + direct IO (so that large files can be operated non-blocking)
  2. When transferring small files, use zero-copy technology

summary

The beginning of this article details the physical memory and virtual memory in the Linux operating system, the concept of kernel space and user space, and the hierarchical structure inside Linux. On this basis, further analyze and compare the difference between the traditional I/O method and the zero-copy method, and then introduce several zero-copy implementations provided by the Linux kernel, including memory mapping mmap, sendfile, sendfile + DMA gather copy and splice mechanisms, and compared them from the perspective of system calls and copy times. Next, start from the source code to analyze the implementation of Java NIO for zero copy, mainly including MappedByteBuffer based on memory mapping (mmap) and FileChannel based on sendfile. Finally, at the end of the article, I briefly explained the zero-copy mechanism in Netty, and the difference between RocketMQ and Kafka message queues in terms of zero-copy implementation.

Summarize

In this article, I mainly explained the underlying principles of Linux I/O, then introduced and analyzed the Zero-copy technology in Linux, and gave the optimization and improvement ideas of Linux for I/O modules.

The zero-copy technology of Linux can be summarized into the following three categories:

  • Reduce or even avoid data copying between user space and kernel space : In some scenarios, the user process does not need to access and process data during data transmission, so the data between the  Page Cache buffer of Linux and the user process The transmission can be completely avoided, so that the data copy is completely carried out in the kernel, and even the data copy in the kernel can be avoided in a more ingenious way. This type of implementation is generally done by adding new system calls, such as mmap(), sendfile(), and splice() in Linux.
  • Direct I/O bypassing the kernel : Allows the user mode process to bypass the kernel and directly transfer data to the hardware. The kernel is only responsible for some management and auxiliary work during the transfer process. This method is actually a bit similar to the first one, and it also tries to avoid data transmission between user space and kernel space, but the first method is to complete the data transmission process in the kernel state, while this method directly bypasses Kernel and hardware communication, the effect is similar but the principle is completely different.
  • Transfer optimization between kernel buffer and user buffer : This approach focuses on the optimization of the CPU copy between the user process' buffer and the operating system's page cache. This method continues the traditional communication method of the past, but is more flexible.

This article comprehensively and detailedly analyzes the underlying principles of I/O of the Linux system from knowledge points such as virtual memory, I/O buffer, user mode & kernel mode, and I/O mode, and analyzes the traditional I/O mode of Linux. Then introduce the introduction and principle analysis of Linux Zero-copy technology. By distinguishing and comparing the zero-copy technology and the traditional I/O mode, lead readers to experience the evolution history of Linux I/O. By helping readers Understanding the optimization and improvement ideas of the Linux kernel for I/O modules is believed not only to let readers understand the design principles of the Linux underlying system, but also to inspire readers in the process of optimizing and improving their own programming in the future.

Reference & Further Reading

Guess you like

Origin blog.csdn.net/u012294613/article/details/131002199