Detailed explanation of zero copy

zero copy

Preface

Zero copy is used in many places, such as netty, kafka, rabbitMQ, so why is zero copy needed? Let’s look at why today.

IO basic principles

Kernel state and user state

In modern operating systems, because system resources (CPU, memory, hard disk) may be accessed by multiple applications at the same time, if they are not protected, conflicts may occur between applications, and malicious applications may cause the system to crash. .

For example, without protection, any memory space can be accessed. If some content is written to an important program running on the system, it may cause the system to crash.

Therefore, in order to prevent users from directly operating the kernel and ensure the security of the kernel, the operating system divides the memory into two spaces: user space and kernel space . Correspondingly, the process in which the kernel program runs in the kernel space is called the kernel state, and the state of the process in which the user program runs in the user space is divided into the user state .

The kernel program has permission to access both the protected kernel space and the hardware device; while the user program in user space does not have such permission. Application programs (user programs in user space) cannot directly call functions defined by the kernel program, nor can they directly read and write in the kernel space.

system call

System call is the collection of all system calls implemented by the operating system, that is, the program interface or application programming interface, which is the interface between the application program and the system.

For example, we usually write some programs in the IDE that read and write the contents of the disk, and the programs in the IDE belong to the user space. After running, the corresponding process is in the user state, but cannot be read in the user space. to the data in the kernel space, and it is necessary to rely on the read and write programs in the kernel to actually read the data. Since the program in the user state cannot directly call the kernel program in the kernel space, it needs to be converted from the user state to the kernel state. Call the read and write kernel functions, and this calling process is also called a system call .

System call process:

  • First, the user mode executes the user program
  • Some programs may use system calls, such as disk read operations.
  • Then a TRAP (trap) instruction will be executed. The function of this instruction is to interrupt and enter the kernel state from the user state.
  • Call the required operating system kernel functions in kernel mode
  • After the kernel program is processed, the interrupt is actively triggered, the system call returns, the CPU execution permission is returned to the user program, and the user program continues to be executed.

buffer

From the above we can know that the process of switching from user mode to kernel mode requires an interruption, and interruption means saving the data and status of the currently running process, and then restoring the previously running process after the interruption is over. Information such as data and status.

In order to reduce the time consuming caused by interruptions, the simplest idea is to reduce the number of interruptions. The predecessors were also very smart and invented the concept of buffers.

Process buffer and kernel buffer

The buffer is divided into process buffer and kernel buffer.

Note: The arrow in the figure below points to the data flow direction.

When making system calls, such as the read() system call, using the read() system call actually only copies the data from the kernel buffer to the process buffer, and actually reads the data from the hardware device to the kernel buffer of the operating system. It is completed through the kernel function of the operating system. As for when the kernel function will be called to perform IO with the hardware device, it is a matter of the operating system; correspondingly, when the write() system call is made, the data in the process buffer is copied to Kernel buffer.

DMA technology

After understanding the basic IO principles, let's take a look at the detailed process inside.

For example, we use the read() system call:

We can see from the figure that when the user process initiates a read() call, the CPU is idle except when the disk writes data to the disk buffer after initiating a request to the disk. The CPU is needed thereafter. Participate, no other tasks can be run during this period.

In order to solve this problem, predecessors invented DMA technology (Direct Memory Access direct memory access). This frees up the CPU from such dirty and tiring work by using the DMA controller to specifically handle the data transfer between the IO device and the memory. After the DMA transfers the data to the memory, it notifies the CPU to process it.

zero copy

read+write

Let’s take a file transfer process as an example:

First we need to read the file from the disk and then send the file out.

process:

  • Initiate the read() system call
  • Switch from user mode to kernel mode, and then copy the disk data to the kernel buffer through DMA copy
  • Then copy the data in the kernel buffer to the user buffer through CPU copy
  • After the copy is completed, switch back to user mode from kernel mode.
  • Then issue the write() system call
  • Switch from user mode to kernel mode, and then copy the data in the user buffer to the socket buffer through CPU copy
  • Then copy the data in the socket buffer to the network card through DMA copy and send it out.
  • Then switch back to user mode

During this process, a total of 4 context switches and 4 data copies occurred. Although these times may not be too long for high-performance CPUs, if the concurrency increases, the accumulation of time will lead to a significant performance improvement. decline.

The easiest way to think of is to improve performance:

  • Reduce the number of context switches
  • Reduce the number of data copies

So how to optimize it?

From the above we can know that four context switches are caused by two system calls, so we can consider how to reduce system calls?

During the data copy process, we have no way to handle the hardware, but in the subsequent process, from the kernel buffer to the user buffer, and then to the socket buffer, during this process, the user buffer does an intermediate dump, which results in It takes two more data copies, and they are both in kernel mode. Is it possible to consider directly copying the data in the kernel buffer to the socket buffer? In this case, one data copy is saved.

mmap+write

The mmap() system call can map the data in the memory buffer to user space, so that there is no need to copy the data in the memory buffer to the user buffer, thus reducing the data copy process once and then calling write () The system calls the function to copy the data in the memory buffer directly to the socket buffer.

In the above process, a total of three data copies occurred, which was one less than the traditional data copy; but in terms of context switching, since we still made two system calls, it still occurred four times.

sendfile()

In Linux2.1, a function is specially provided for file transfer: sendfile().

Function definition:

#include<sys/sendfile.h>
ssize_t senfile(int out_fd,int in_fd,off_t* offset,size_t count);

Parameter meaning:

  • The in_fd parameter is the file descriptor of the content to be read.
  • The out_fd parameter is the file descriptor of the content to be written
  • The offset parameter specifies the position in the read file stream from which to start reading. If it is empty, the default starting position of the read file stream is used.
  • The count parameter specifies the number of bytes transferred between file descriptors in_fd and out_fd.

The sendfile() function can directly copy the data in the kernel buffer to the socket buffer without using the write() system call.

In Java, the bottom layer of the fileChannel.transferTo() function is the sendfile() function called.

So, the whole process becomes:

Now the whole process is: two context switches and three data copies.

However, this is not the final zero-copy process.

sendfile()+SG-DMA

After Linux version 2.4, sendfile() has been optimized and upgraded, and SG-DMA technology has been introduced.

If the network card supports SG-DMA (Scatter-gather DMA), you can directly copy the data in the kernel buffer to the network card, thus reducing another data copy.

This time the entire process only requires two context switches and two data copies.

Summarize

If you want to understand zero copy, you must first have an understanding of the IO process, why there are system calls, why there are buffers, and what is DMA; then starting from the entire more detailed IO process, it is easier to understand how the predecessors optimized it. Easy.

Zero copy sounds like there is no data copying, but in fact it is only reduced from four context switches and four data copies at the beginning to two context switches and two data copies now. It is not really a data copy. No, thus greatly improving system IO performance.

Guess you like

Origin blog.csdn.net/weixin_43589025/article/details/124064988