Interview question: How to understand Linux zero-copy technology?

This article explains the zero-copy technology of Linux. Cloud computing is a huge technical subject that integrates many technologies. Linux is a relatively basic technology. Therefore, learning Linux well will be of great help to the study of cloud computing. This article draws on and summarizes several common zero-copy technologies under Linux.

Why zero copy is needed

The standard I/O interface (read, write) of the traditional Linux system is based on data copy, that is, the data is copy_to_user or copy_from_user. The advantage of this is that the intermediate cache mechanism reduces disk I/O operations. But the disadvantages are also obvious. The copying of a large amount of data and frequent switching between user mode and kernel mode will consume a lot of CPU resources and seriously affect the performance of data transmission. Data shows that in the Linux kernel protocol stack, this copy is time-consuming It even accounts for 57.1% of the entire processing flow of data packets.

What is zero copy

Zero copy is a solution to this problem, which relieves the pressure on the CPU by avoiding copy operations as much as possible. Common zero-copy technologies under Linux can be divided into two categories: one is to remove unnecessary copies for specific scenarios; the other is to optimize the entire copy process. From this point of view, zero copy does not really achieve "0" copy, it is more of an idea, and many zero copy technologies are optimized based on this idea.

 

Several methods of zero copy

Original data copy operation

Before the introduction, let's take a look at the original data copy operation of Linux. As shown in the figure below, if an application needs to read content from a disk file and send it out over the network, like this:

while((n = read(diskfd, buf, BUF_SIZE)) > 0)
write(sockfd, buf , n);

Then the whole process needs to go through:

  • 1) read copies data from the disk file to the buffer opened by the kernel through DMA and other methods;
  • 2) Data is copied from the kernel buffer to the user mode buffer;
  • 3) write copies data from the user mode buffer to the socket buffer opened by the kernel protocol stack;
  • 4) The data is copied from the socket buffer to the network card through DMA and sent out.

It can be seen that at least four data copies occurred in the whole process, two of which were completed by DMA and hardware communication, and the CPU did not directly participate. If these two times were removed, there were still two CPU data copy operations.

Method 1: Direct I/O in user mode

This method allows applications or library functions running in user mode to directly access hardware devices, and data is directly transmitted across the kernel. The kernel does not participate in other than necessary virtual storage configuration work during the entire data transmission process. Any work, this way can directly bypass the kernel, greatly improving performance.

defect:

  • 1) This method can only be applied to applications that do not require kernel buffer processing. These applications usually have their own data caching mechanism in the process address space, called self-caching applications, such as a database management system.
  • 2) This method directly operates disk I/O. Due to the execution time gap between CPU and disk I/O, it will waste resources. To solve this problem, it needs to be used in conjunction with asynchronous I/O.

Method two: mmap

This method, using mmap instead of read, can reduce one copy operation, as follows:

buf = mmap(diskfd, len);
write(sockfd, buf, len);

The application program calls mmap, the data in the disk file is copied to the kernel buffer through DMA, and then the operating system will share this buffer with the application program, so that there is no need to copy to user space. The application calls write, and the operating system directly copies the data from the kernel buffer to the socket buffer, and finally copies it to the network card through DMA to send it out.

640?wx_fmt=png&tp=webp&wxfrom=5&wx_lazy=1&wx_co=1uploading.4e448015.gifUploading... re-upload canceled


defect:

  • 1) mmap hides a trap. When mmap a file, if the file is intercepted by another process, the write system call will be terminated by the SIGBUS signal due to access to an illegal address. SIGBUS will kill the process by default and generate a coredump. The server is terminated in this way, the loss may not be small.

To solve this problem, the file lease lock is usually used: first apply for a lease lock for the file. When other processes want to truncate the file, the kernel will send a real-time RT_SIGNAL_LEASE signal to tell the current process that a process is trying to destroy the file, so write Before being killed by SIGBUS, it will be interrupted, return the number of bytes written, and set errno to success.

The usual practice is to lock before mmap and unlock after the operation:

Method three: sendfile

Starting from the Linux 2.1 kernel, Linux has introduced sendfile, which can also reduce one copy.

#include<sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

sendfile is a data transmission interface that only occurs in kernel mode, without user mode participation, which naturally avoids user mode data copying. It specifies the transfer of data between in_fd and out_fd. Among them, it specifies that the file pointed to by in_fd must be mmapable, and out_fd must point to a socket, which means that data can only be transferred from the file to the socket, and vice versa. . Sendfile does not have the situation that the file is intercepted during mmap, it has its own exception handling mechanism.

defect:

  • 1) Only applicable to applications that do not require user mode processing.

Method 4: DMA-assisted sendfile

The regular sendfile also has a kernel mode copy operation. Can this copy also be removed?

The answer is this DMA-assisted sendfile.

With the help of hardware, this method does not copy the data but the buffer descriptor in the step of data from the kernel buffer to the socket buffer. After completion, the DMA engine directly copies the data from the kernel buffer. Go to the protocol engine to avoid the last copy.

defect:

  • 1) In addition to the defects in 3.4, hardware and driver support are also required.
  • 2) Only applicable to copy data from file to socket.

Method five: splice

Splice removes the limitation of the use range of sendfile and can be used to transfer data between any two file descriptors.

#define _GNU_SOURCE         /* See feature_test_macros(7) */
#include <fcntl.h>
ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

But splice also has limitations. It uses the Linux pipe buffer mechanism, so at least one of its two file descriptor parameters must be a pipe device.

Splice provides a flow control mechanism that blocks write requests through a pre-defined watermark. Experiments have shown that using this method to transfer data from one disk to another will increase throughput by 30%-70% The CPU responsibility will also be reduced by half.

defect:

  • 1) The same applies only to programs that do not require user mode processing
  • 2) At least one of the transmission descriptors is a pipe device.

Method 6: Copy-on-write

In some cases, the kernel buffer may be shared by multiple processes. If a process wants this shared area to perform a write operation, since write does not provide any lock operation, it will cause damage to the data in the shared area. , Copy-on-write is introduced by Linux to protect data.

Copy-on-write means that when multiple processes share the same piece of data, if one of the processes needs to modify this data, then it needs to be copied to its own process address space. This does not affect other processes' Block data operations are copied only when each process needs to be modified, so it is called copy-on-write. This method can reduce system overhead to a certain extent. If a process will never change the accessed data, then it will never need to copy.

defect:

  • 1) Support of MMU is required. MMU needs to know which pages in the process address space are read-only. When data needs to be written to these pages, an exception is sent to the operating system kernel, and the kernel will allocate new storage space for writing demand.

Method Seven: Buffer Sharing

This method completely rewrites I/O operations, because traditional I/O interfaces are all based on data copying. To avoid copying, remove the original set of interfaces and rewrite them. So this method is a more comprehensive zero-copy technology. One of the more mature solutions at present is fbuf (Fast Buffer) first implemented on Solaris.

The idea of ​​Fbuf is that each process maintains a buffer pool. This buffer pool can be mapped to the program address space and the kernel address space at the same time. The kernel and the user share this buffer pool, thus avoiding copying.

defect:

  • 1) Management of the shared buffer pool requires close cooperation between applications, network software, and device drivers
  • 2) Rewriting API is still in the experimental stage.

High-performance network I/O framework-netmap

Based on the idea of ​​shared memory, Netmap is a high-performance framework for sending and receiving raw data packets. It was developed by Luigi Rizzo and others. It contains kernel modules and user-mode library functions. The goal is to achieve high-performance transmission of data packets between user mode and network cards without modifying the existing operating system software and without special hardware support.

Under the Netmap framework, the kernel has a data packet pool. The data packets on the sending ring\receiving ring do not need to be dynamically applied. When data arrives at the network card, when data arrives, a data packet is directly taken from the data packet pool, and then The data is put into this data packet, and then the descriptor of the data packet is put into the receiving ring. The packet pool in the kernel is mapped to user space through mmap technology. The user mode program finally obtains the receiving and sending ring netmap_ring through netmap_if to obtain and send data packets.

Guess you like

Origin blog.csdn.net/baidu_39322753/article/details/104491598