The original 8 pictures, you can understand "zero copy"


Preface

Disks can be said to be one of the slowest hardware in computer systems. The read and write speed is more than 10 times different from memory. Therefore, there are many technologies for optimizing disks, such as zero copy, direct I/O, asynchronous I/O, etc. These optimizations The purpose is to improve the throughput of the system. In addition, the disk cache area in the operating system kernel can effectively reduce the number of disk accesses.

This time, we will use "file transfer" as the entry point to analyze the I/O working method and how to optimize the performance of transferring files.


text

Why is there DMA technology?

Before DMA technology, the I/O process was like this:

  • The CPU sends out the corresponding instruction to the disk controller, and then returns;
  • After the disk controller receives the instruction, it starts to prepare the data, puts the data into the internal buffer of the disk controller, and then generates an interrupt ;
  • After the CPU receives the interrupt signal, it stops the work at hand, and then reads the data in the disk controller's buffer into its own register one byte at a time, and then writes the data in the register to the memory, and during data transmission During this period, the CPU cannot perform other tasks.

In order to facilitate your understanding, I drew a picture:

It can be seen that the entire data transmission process requires the CPU to personally participate in the process of moving the data, and in this process, the CPU cannot do other things.

It's okay to simply move a few characters of data, but if we use a gigabit network card or hard disk to transfer a large amount of data, we must use the CPU to transport it, it must be too busy.

After computer scientists discovered the seriousness of the matter, they invented the DMA technology, that is , Direct Memory Access ( Direct Memory Access ) technology.

What is DMA technology? The simple understanding is that when transferring data between I/O devices and memory, all the work of data transfer is handed over to the DMA controller, and the CPU is no longer involved in any data transfer-related things, so that the CPU can handle other things Affairs .

So what is the process of data transfer using DMA controller? Let's take a look at it in detail below.

Specific process:

  • The user process calls the read method, sends an I/O request to the operating system, requests to read data into its own memory buffer, and the process enters a blocking state;
  • After the operating system receives the request, it further sends the I/O request to DMA, and then allows the CPU to perform other tasks;
  • DMA further sends the I/O request to the disk;
  • The disk receives the DMA I/O request and reads the data from the disk to the disk controller's buffer. When the disk controller's buffer is full, it initiates an interrupt signal to DMA to inform itself that the buffer is full;
  • DMA receives the signal from the disk and copies the data in the disk controller buffer to the kernel buffer. At this time, the CPU is not occupied, and the CPU can perform other tasks ;
  • When the DMA has read enough data, it will send an interrupt signal to the CPU;
  • The CPU receives the DMA signal and knows that the data is ready, so it copies the data from the kernel to the user space, and the system call returns;

It can be seen that during the entire data transfer process, the CPU no longer participates in the work of data transfer, but the whole process is completed by DMA, but the CPU is also essential in this process, because what data is transferred, and where it is transferred, are all The CPU is required to tell the DMA controller.

In the early days, DMA only existed on the motherboard. Nowadays, due to more and more I/O devices, the data transmission requirements are also different, so each I/O device has its own DMA controller.


How bad is traditional file transfer?

If the server needs to provide the function of file transfer, the simplest way we can think of is to read the file on the disk and send it to the client through the network protocol.

Traditional I/O works in that data read and write are copied back and forth from user space to kernel space, while data in kernel space is read or written from the disk through the I/O interface at the operating system level.

The code is usually as follows, usually two system calls are required:

read(file, tmp_buf, len);
write(socket, tmp_buf, len);

The code is very simple. Although it is only two lines of code, a lot of things have happened in it.

First of all, there were 4 context switches between user mode and kernel mode during this period , because there were two system calls, one was read(), and the other was write(). Each system call must first switch from user mode to kernel mode, and wait for the kernel to complete the task. After that, switch from kernel mode back to user mode.

The cost of context switching is not small. A switching takes tens of nanoseconds to a few microseconds. Although the time seems to be short, this kind of time is easy to accumulate and magnify in high concurrency scenarios, which affects the system's performance. performance.

Secondly, there were 4 data copies , two of which were DMA copies, and the other two were copied through the CPU. Let’s talk about this process:

  • For the first copy , the data on the disk is copied to the buffer of the operating system kernel. This copy process is carried through DMA.
  • In the second copy , the data in the kernel buffer is copied to the user's buffer, so our application can use this part of the data. This copying process is completed by the CPU.
  • For the third copy , the data just copied to the user's buffer is copied to the kernel's socket buffer. This process is still handled by the CPU.
  • In the fourth copy , the data in the socket buffer of the kernel is copied to the buffer of the network card. This process is again carried by DMA.

Let's look back at the file transfer process. We only moved one piece of data, but ended up moving 4 times. Too many copies of data will undoubtedly consume CPU resources and greatly reduce system performance.

This simple and traditional file transfer method has redundant above switching and data copying, which is very bad in a high-concurrency system. It adds a lot of unnecessary overhead and will seriously affect system performance.

Therefore, in order to improve the performance of file transfer, it is necessary to reduce the number of "context switching between user mode and kernel mode" and "memory copy" .


How to optimize the performance of file transfer?

Let's take a look first, how to reduce the number of "context switching between user mode and kernel mode"?

When reading disk data, context switching occurs because the user space does not have the authority to operate the disk or network card, and the kernel has the highest authority. The process of operating these devices needs to be completed by the operating system kernel, so it is generally done through When the kernel completes certain tasks, it needs to use the system call functions provided by the operating system.

A system call will inevitably cause two context switches: first switch from user mode to kernel mode, and when the kernel finishes executing the task, switch back to user mode to be executed by the process code.

Therefore, if you want to reduce the number of context switches, you must reduce the number of system calls .

Let's take a look again, how to reduce the number of "data copies"?

As we know before, the traditional file transfer method will go through 4 data copies, and in this, "copy from the kernel's read buffer to the user's buffer, and then from the user's buffer to the socket buffer里", this process is not necessary.

Because in the application scenario of file transfer, we do not "reprocess" the data in the user space, so the data can actually not be transferred to the user space, so the user buffer is not necessary .


How to achieve zero copy?

There are usually two ways to implement zero-copy technology:

  • mmap + write
  • sendfile

Let's talk about how they reduce the number of "context switching" and "data copy".

mmap + write

In front of us know, read()a copy of the data process system will call the kernel buffer to the user's buffer, so in order to reduce the cost of this step, we can mmap()replace the read()call system function.

buf = mmap(file, len);
write(sockfd, buf, len);

mmap()The system call function will directly " map " the data in the kernel buffer to the user space, so that the operating system kernel and user space do not need to perform any data copy operations.

The specific process is as follows:

  • Application process calls mmap()after, DMA disk data will be copied to the kernel buffer. Then, the application process "shares" this buffer with the operating system kernel;
  • The application process is called again write(), and the operating system directly copies the data in the kernel buffer to the socket buffer. All of this happens in the kernel mode, and the CPU handles the data;
  • Finally, copy the data in the socket buffer of the kernel to the buffer of the network card. This process is carried by DMA.

We can see that, by using mmap()in place read(), you can reduce one data copy process.

But this is not the most ideal zero copy, because the data in the kernel buffer still needs to be copied to the socket buffer through the CPU, and it still needs 4 context switches, because the system call is still 2 times.

sendfile

In Linux kernel version 2.1, a system call function for sending files is provided sendfile(). The function form is as follows:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

Its first two parameters are the file descriptors of the destination and the source, the latter two parameters are the offset of the source and the length of the copied data, and the return value is the length of the actual copied data.

First, it can replace the front read()and write()two system calls, so that you can reduce a system call, thus reducing the overhead context switch twice.

Secondly, this system call can directly copy the data in the kernel buffer to the socket buffer, and no longer copy to the user mode, so there are only 2 context switches and 3 data copies. As shown below:

But this is not a true zero-copy technology. If the network card supports SG-DMA ( The Scatter-Gather Direct Memory Access ) technology (different from ordinary DMA), we can further reduce the copy of the data in the kernel buffer through the CPU. The process to the socket buffer.

You can use the following command on your Linux system to check whether the network card supports the scatter-gather feature:

$ ethtool -k eth0 | grep scatter-gather
scatter-gather: on

Thus, from the Linux kernel 2.4start version, support for the case of LAN support SG-DMA technology, sendfile()process system calls point change occurs, the specific process is as follows:

  • The first step is to copy the data on the disk to the kernel buffer through DMA;
  • In the second step, the buffer descriptor and data length are transferred to the socket buffer, so that the SG-DMA controller of the network card can directly copy the data in the kernel buffer to the buffer of the network card. This process does not need to operate the data from The system kernel buffer is copied to the socket buffer, thus reducing one data copy;

Therefore, during this process, only two data copies were made, as shown in the figure below:

This is the so-called zero-copy ( Zero-copy ) technology, because we do not copy data at the memory level, that is to say, the CPU is not used to transport data throughout the process, and all data is transmitted through DMA. .

Compared with the traditional file transfer method, the zero-copy technology file transfer method reduces the number of context switches and data copy times by 2 times . Only 2 times context switches and data copy times are required to complete the file transfer, and 2 times data copy The process does not need to go through the CPU, and both times are carried by DMA.

Therefore, overall, zero-copy technology can increase the performance of file transfer at least twice .

Projects using zero-copy technology

In fact, Kafka, an open source project, uses the "zero copy" technology to greatly increase the I/O throughput rate. This is one of the reasons why Kafka is processing massive amounts of data so fast.

If you traced the code Kafka file transfer, you will find that eventually it calls the Java NIO library transferTomethod:

@Overridepublic 
long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
    
     
    return fileChannel.transferTo(position, count, socketChannel);
}

If the system supports Linux sendfile()system calls, it transferTo()actually will end up using the sendfile()system call function.

There was a specially written program Gangster tested under the same hardware conditions, the performance difference between traditional and zero-copy file transfer copy file transfer, you can see the chart below this test data, using zero-copy can shorten 65%the time , Which greatly improves the throughput of machine transmission data.

Data source: https://developer.ibm.com/articles/j-zerocopy/

In addition, Nginx also supports zero-copy technology. Generally, zero-copy technology is turned on by default, which helps to improve the efficiency of file transfer. The configuration of whether to enable zero-copy technology is as follows:

http {
...
    sendfile on
...
}

The specific meaning of sendfile configuration:

  • Set to on means that the zero-copy technology is used to transfer files: sendfile, so that only 2 context switches and 2 data copies are required.
  • Setting it to off means that using traditional file transfer technology: read + write, 4 context switches and 4 data copies are required.

Of course, to use sendfile, the Linux kernel version must be 2.1 or higher.


What does PageCache do?

Looking back at the file transfer process mentioned earlier, the first step is to copy the disk file data into the "kernel buffer". This "kernel buffer" is actually the disk cache ( PageCache ) .

Since Zero Copy uses PageCache technology, it can further improve performance. Let's take a look at how PageCache does this.

Reading and writing disks is much slower than reading and writing memory, so we should find a way to replace "read and write disk" with "read and write memory". Therefore, we will transfer the data from the disk to the memory through DMA, so that we can replace the read disk with the read memory.

However, the memory space is much smaller than the disk, and the memory is destined to only copy a small part of the data in the disk.

So the question is, which disk data should be copied to the memory?

We all know that when the program is running, it has "locality", so usually, the data that has just been accessed has a high probability of being accessed again in a short time, so we can use PageCache to cache the recently accessed data , when the space is insufficient The cache that has not been accessed for the longest time is eliminated.

Therefore, when reading disk data, it is first to find it in PageCache. If the data exists, it can be returned directly; if not, it is read from the disk and then cached in PageCache.

Another point is that when reading disk data, you need to find the location of the data, but for mechanical disks, the head is rotated to the sector where the data is located, and then the data is read "sequentially", but the physical The action is very time-consuming. In order to reduce its impact, PageCache uses the "read-ahead function" .

For example, assume that the read method will read each 32 KBbyte, although read only beginning to read bytes 0 ~ 32 KB, but it will later kernel 32 ~ 64 KB PageCache to be read, so that the read back The cost of 32-64 KB is very low. If the process reads it before 32-64 KB is eliminated from PageCache, the benefits will be very large.

Therefore, the advantages of PageCache are mainly two:

  • Cache recently accessed data;
  • Pre-reading function;

These two practices will greatly improve the performance of reading and writing disks.

However, when transferring large files (GB-level files), PageCache will not work, so the extra data copy made by DMA will be wasted, resulting in performance degradation. Even if PageCache zero copy is used, performance will be lost.

This is because if you have a lot of GB-level files to transfer, whenever users access these large files, the kernel will load them into the PageCache, so the PageCache space is quickly occupied by these large files.

In addition, because the file is too large, the probability of some parts of the file data being accessed again is relatively low, which will cause two problems:

  • Because PageCache is occupied by large files for a long time, other "hot" small files may not be able to fully use PageCache, so the performance of disk read and write will decrease;
  • The large file data in PageCache does not enjoy the benefits of caching, but it costs DMA to copy it to PageCache once;

Therefore, for the transfer of large files, PageCache should not be used, that is to say, zero copy technology should not be used, because PageCache may be occupied by large files, and the "hot" small files cannot be used in PageCache, so in a high concurrency environment Down, it will bring serious performance problems.


What method is used to transfer large files?

For the transfer of large files, what method should we use?

Let's take a look at the initial example. When the read method is called to read the file, the process will actually be blocked in the read method call, because it has to wait for the return of the disk data, as shown in the following figure:

Specific process:

  • When the read method is called, it will be blocked. At this time, the kernel will initiate an I/O request to the disk. After the disk receives the request, it will address it. When the disk data is ready, it will initiate an I/O interrupt to the kernel. Inform the kernel that the disk data is ready;
  • After the kernel receives the I/O interrupt, it copies the data from the disk controller buffer to PageCache;
  • Finally, the kernel copies the data in PageCache to the user buffer, so the read call returns normally.

For the problem of blocking, asynchronous I/O can be used to solve it, and it works as follows:

It divides the read operation into two parts:

  • In the first half, the kernel initiates a read request to the disk, but it can return without waiting for the data to be in place , so the process can handle other tasks at this time;
  • In the second half, when the kernel copies the data in the disk to the process buffer, the process will receive the kernel's notification and then process the data;

Moreover, we can find that asynchronous I/O does not involve PageCache, so using asynchronous I/O means to bypass PageCache.

I/O that bypasses PageCache is called direct I/O, and I/O that uses PageCache is called cache I/O. Generally, for disks, asynchronous I/O only supports direct I/O.

As mentioned earlier, PageCache should not be used for the transfer of large files, because PageCache may be occupied by large files, so that "hot" small files cannot be used by PageCache.

Therefore, in high-concurrency scenarios, "asynchronous I/O + direct I/O" should be used instead of zero-copy technology for the transmission of large files .

There are two common direct I/O application scenarios:

  • The application has already realized disk data caching, so PageCache may not need to cache again, reducing additional performance loss. In the MySQL database, you can enable direct I/O through parameter settings, and it is not enabled by default;
  • When transferring large files, because large files are difficult to hit the PageCache cache, and will fill the PageCache, the "hot" files cannot make full use of the cache, which increases the performance overhead. Therefore, direct I/O should be used at this time.

In addition, because the direct I/O bypasses PageCache, you cannot enjoy the two optimizations of the kernel:

  • The kernel's I/O scheduling algorithm will cache as many I/O requests as possible in PageCache, and finally " combine " them into a larger I/O request and send it to the disk. This is to reduce disk addressing operations;
  • The kernel will also " pre-read " subsequent I/O requests in PageCache, also to reduce disk operations;

Therefore, when transferring large files, use "asynchronous I/O + direct I/O" to read the file without blocking.

Therefore, when transferring files, we have to use different methods according to the size of the file:

  • When transferring large files, use "Asynchronous I/O + Direct I/O";
  • When transferring small files, use "Zero Copy Technology";

In nginx, we can use the following configuration to use different methods according to the size of the file:

location /video/ { 
    sendfile on; 
    aio on; 
    directio 1024m; 
}

When the file size is greater than directiothe value, use the "asynchronous I / O + direct I / O", otherwise "zero-copy technology."


to sum up

In the early I/O operations, the work of data transmission between memory and disk was completed by the CPU, but at this time the CPU cannot perform other tasks, which will waste CPU resources.

So, in order to solve this problem, DMA technology appeared. Each I/O device has its own DMA controller. Through this DMA controller, the CPU only needs to tell the DMA controller what data we want to transfer and where You can leave with confidence wherever you go. The subsequent actual data transfer work will be completed by the DMA controller, and the CPU does not need to participate in the data transfer work.

The traditional IO working method is to read data from the hard disk and then send it out through the network card. We need to perform 4 context switches and 4 data copies, of which 2 data copies occur in the buffer in the memory and the corresponding hardware device In the meantime, this is done by DMA, and the other 2 times occur between kernel mode and user mode. This data movement is done by CPU.

In order to improve the performance of file transfer, zero-copy technology emerged, which sendfilecombines the two operations of disk reading and network sending through a system call ( method), reducing the number of context switches. In addition, copying data occurs in the kernel, which naturally reduces the number of data copies.

Both Kafka and Nginx implement zero-copy technology, which will greatly improve the performance of file transfer.

Zero-copy technology is based on PageCache. PageCache will cache the recently accessed data and improve the performance of accessing cached data. At the same time, in order to solve the problem of slow addressing of mechanical hard disks, it also assists the I/O scheduling algorithm to achieve IO consolidation and pre-processing. Read, this is also the reason why sequential reading is better than random reading. These advantages further enhance the performance of zero copy.

It should be noted that the zero-copy technology does not allow the process to further process the file content, such as compressing the data before sending it.

In addition, when transferring large files, zero copy cannot be used, because PageCache may be occupied by large files, so that "hot" small files cannot be used by PageCache, and the cache hit rate of large files is not high, then you need to use " "Asynchronous IO + Direct IO" method.

In Nginx, you can set a file size threshold through configuration, use asynchronous IO and direct IO for large files, and use zero copy for small files.


Talk

Hello, everyone, I am Xiaolin, a tool person who specializes in illustrations for everyone. If you think the article is helpful to you, welcome to share it with your friends. This is very important to Xiaolin. Thank you, see you next time!


Recommended reading

To understand the "file system" in one go, rely on these 25 pictures

What happened during the operating system when I typed the letter A on the keyboard...

Guess you like

Origin blog.csdn.net/qq_34827674/article/details/108756999
Recommended