Zero-copy (Zero-copy) and mmap

The disk can be said to be one of the slowest hardware in the computer system. The read and write speed is more than 10 times different from that of the memory. Therefore, there are many technologies for optimizing the disk, such as zero copy, direct I/O, asynchronous I/O, etc. These optimizations The purpose is to improve the throughput of the system. In addition, the disk cache area in the operating system kernel can effectively reduce the number of disk accesses.

This time, we will use "file transfer" as an entry point to analyze the I/O working method and how to optimize the performance of transferring files.

The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

Why use DMA technology?

Before DMA technology, the I/O process was like this:

  • The CPU sends corresponding instructions to the disk controller, and then returns;
  • After the disk controller receives the instruction, it starts to prepare the data, puts the data into the internal buffer of the disk controller, and then generates an interrupt;
  • After the CPU receives the interrupt signal, it stops the work at hand, and then reads the data in the buffer of the disk controller into its own register one byte at a time, and then writes the data in the register to the memory. During this period, the CPU cannot perform other tasks.

It can be seen that the entire data transmission process requires the CPU to personally participate in the process of moving data, and in this process, the CPU cannot do other things.

It is no problem to simply move a few characters of data, but if we use the CPU to move a large amount of data when we use a gigabit network card or hard disk, we will definitely be too busy.

After computer scientists discovered the seriousness of the matter, they invented DMA technology, which is Direct Memory Access (Direct Memory Access) technology.

What is DMA technology? The simple understanding is that when transferring data between I/O devices and memory, the work of data transfer is all handed over to the DMA controller, and the CPU no longer participates in any matters related to data transfer, so that the CPU can handle other tasks. affairs.

So what exactly is the process of using the DMA controller for data transfer? Let's take a look at it in detail.

Specific process:

  • The user process calls the read method, sends an I/O request to the operating system, and requests to read data into its own memory buffer, and the process enters a blocked state;
  • After the operating system receives the request, it further sends the I/O request to DMA, and then lets the CPU perform other tasks;
  • DMA further sends I/O requests to disk;
  • The disk receives the I/O request from the DMA and reads the data from the disk into the buffer of the disk controller. When the buffer of the disk controller is full, it sends an interrupt signal to the DMA to inform itself that the buffer is full;
  • DMA receives the signal from the disk and copies the data in the disk controller buffer to the kernel buffer. At this time, the CPU is not occupied, and the CPU can perform other tasks;
  • When the DMA has read enough data, it will send an interrupt signal to the CPU;
  • The CPU receives the DMA signal and knows that the data is ready, so it copies the data from the kernel to the user space, and the system call returns;

It can be seen that during the entire data transmission process, the CPU no longer participates in the work of data handling, but the whole process is completed by the DMA, but the CPU is also essential in this process, because what data is transferred and where it is transferred from The CPU is required to tell the DMA controller.

How bad is traditional file transfer?

If the server wants to provide the function of file transfer, the simplest way we can think of is: read the file on the disk, and then send it to the client through the network protocol.

The way traditional I/O works is that data reads and writes are copied back and forth from user space to kernel space, and data in kernel space is read or written from disk through the I/O interface at the operating system level.

First of all, there were 4 context switches between the user mode and the kernel mode during the period, because there were two system calls, one was read(), and the other was write(). Each system call had to switch from the user mode to the kernel first. state, after the kernel completes the task, then switch back from the kernel state to the user state.

The cost of context switching is not small. A switch takes tens of nanoseconds to several microseconds. Although the time seems short, in high concurrency scenarios, this kind of time is easy to be accumulated and amplified, thus affecting the performance of the system. performance.

Secondly, there are 4 data copies, two of which are DMA copies, and the other two are copied by the CPU. Let's talk about the process below:

  • For the first copy, the data on the disk is copied to the buffer of the operating system kernel, and the copying process is carried by DMA.
  • The second copy is to copy the data in the kernel buffer to the user's buffer, so our application can use this part of the data, and this copying process is completed by the CPU.
  • For the third copy, the data just copied to the user's buffer is copied to the socket buffer of the kernel. This process is still carried by the CPU.
  • The fourth copy is to copy the data in the socket buffer of the kernel to the buffer of the network card, and this process is carried by DMA again.

Looking back at the file transfer process, we only moved one piece of data, but ended up moving it four times. Excessive data copying will undoubtedly consume CPU resources and greatly reduce system performance.

This simple and traditional file transfer method has redundant context switching and data copying, which is very bad in a high-concurrency system, adding a lot of unnecessary overhead and seriously affecting system performance.

Therefore, in order to improve the performance of file transfer, it is necessary to reduce the number of "context switching between user mode and kernel mode" and "memory copy".

How to optimize the performance of file transfer?

Let's take a look first, how to reduce the number of "context switching between user mode and kernel mode"?

When reading disk data, context switching occurs because user space does not have permission to operate disks or network cards, and the kernel has the highest authority. The process of operating these devices needs to be completed by the operating system kernel, so generally through When the kernel completes certain tasks, it needs to use the system call function provided by the operating system.

And a system call will inevitably have two context switches: first switch from the user state to the kernel state, and then switch back to the user state to be executed by the process code after the kernel finishes executing the task.

Therefore, in order to reduce the number of context switches, it is necessary to reduce the number of system calls.

Let's take a look again, how to reduce the number of "data copies"?

As we know earlier, the traditional file transfer method will go through 4 data copies, and here, "copy from the kernel's read buffer to the user's buffer, and then copy from the user's buffer to the socket's buffer In", this process is not necessary.

Because in the application scenario of file transfer, we do not "reprocess" the data in the user space, so the data does not actually need to be moved to the user space, so the user's buffer is unnecessary.

How to achieve zero copy?

There are usually two ways to implement zero-copy technology:

  • mmap + write
  • sendfile

Let's talk about how they reduce the number of "context switching" and "data copying".

mmap + write:

We know earlier that the read() system call will copy the data in the kernel buffer to the user's buffer, so in order to reduce the overhead of this step, we can replace the read() system call function with mmap().

The mmap() system call function will directly "map" the data in the kernel buffer to the user space, so that the operating system kernel and user space do not need to perform any data copy operations.

The specific process is as follows:

  • After the application process calls mmap(), DMA will copy the disk data to the kernel buffer. Then, the application process "shares" this buffer with the operating system kernel;
  • The application process calls write() again, and the operating system directly copies the data in the kernel buffer to the socket buffer. All this happens in the kernel state, and the CPU carries the data;
  • Finally, copy the data in the socket buffer of the kernel to the buffer of the network card, and this process is carried by DMA.

We can know that by using mmap() instead of read(), the process of one data copy can be reduced.

But this is not the ideal zero copy, because it still needs to copy the data in the kernel buffer to the socket buffer through the CPU, and still needs 4 context switches, because the system call is still 2 times.

sendfile:

In the Linux kernel version 2.1, a special system call function sendfile() is provided for sending files. The function form is as follows:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

Its first two parameters are the file descriptors of the destination and source respectively, and the latter two parameters are the offset of the source and the length of the copied data, and the return value is the length of the actual copied data.

First of all, it can replace the previous two system calls, read() and write(), so that one system call can be reduced, and the overhead of two context switches can also be reduced.

Secondly, this system call can directly copy the data in the kernel buffer to the socket buffer instead of copying to the user state, so there are only 2 context switches and 3 data copies. As shown below:

But this is not a real zero-copy technology. If the network card supports SG-DMA (The Scatter-Gather Direct Memory Access) technology (different from ordinary DMA), we can further reduce the number of copies of data in the kernel buffer through the CPU. The process to the socket buffer.

You can run the following command on your Linux system to check whether the network card supports the scatter-gather feature:

$ ethtool -k eth0 | grep scatter-gather
scatter-gather: on

Therefore, starting from the Linux kernel 2.4version , for the case where the network card supports SG-DMA technology, sendfile()the system call process has undergone some changes. The specific process is as follows:

  • The first step is to copy the data on the disk to the kernel buffer through DMA;
  • In the second step, the buffer descriptor and data length are transmitted to the socket buffer, so that the SG-DMA controller of the network card can directly copy the data in the kernel cache to the buffer of the network card, and this process does not need to transfer the data from the operation The system kernel buffer is copied to the socket buffer, which reduces one data copy;

Therefore, in this process, only two data copies are performed, as shown in the following figure:

This is the so-called zero-copy (Zero-copy) technology, because we do not copy data at the memory level, that is to say, the CPU does not move data throughout the process, and all data is transferred through DMA.

Compared with the traditional file transfer method, the file transfer method of zero-copy technology reduces the number of context switches and data copies by 2 times. Only 2 times of context switches and data copies are needed to complete the file transfer, and 2 data copies The process does not need to pass through the CPU, and the two times are carried by DMA.

Therefore, in general, zero-copy technology can improve the performance of file transfer by at least double.

Projects using zero-copy technology

In fact, Kafka, an open source project, uses "zero-copy" technology, which greatly improves the I/O throughput rate, which is one of the reasons why Kafka processes massive data so fast.

If you trace the code of Kafka file transfer, you will find that it finally calls transferTothe method :

@Overridepublic 
long transferFrom(FileChannel fileChannel, long position, long count) throws IOException { 
    return fileChannel.transferTo(position, count, socketChannel);
}

If the Linux system supports sendfile()system calls, then the systemtransferTo() call function will actually be used at the end .sendfile()

In addition, Nginx also supports zero-copy technology. Generally, zero-copy technology is enabled by default, which is conducive to improving the efficiency of file transfer. The configuration of whether to enable zero-copy technology is as follows:

http {
...
    sendfile on
...
}

The specific meaning of sendfile configuration:

  • Set to on means, use zero-copy technology to transfer files: sendfile, so only 2 context switches and 2 data copies are needed.
  • Setting it to off means that using the traditional file transfer technology: read + write, then 4 context switches and 4 data copies are required.

Of course, to use sendfile, the Linux kernel version must be version 2.1 or higher.

The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

What does PageCache do?

Looking back at the file transfer process mentioned above, the first step is to copy the disk file data into the "kernel buffer", which is actually a disk cache (PageCache).

Since zero copy uses PageCache technology, zero copy can further improve performance. Let's see how PageCache does this next.

Reading and writing disks is much slower than reading and writing memory, so we should find a way to replace "reading and writing disks" with "reading and writing memory". Therefore, we will move the data in the disk to the memory through DMA, so that the read disk can be replaced with the read memory.

However, the memory space is much smaller than that of the disk, and the memory is destined to only copy a small part of the data on the disk.

The question is, which disk data should be copied to memory?

We all know that when the program is running, it has "locality", so usually, the data that has just been accessed has a high probability of being accessed again in a short period of time, so we can use PageCache to cache the recently accessed data. When the space is insufficient The cache that has not been accessed for the longest time is evicted.

Therefore, when reading disk data, look for it first in PageCache. If the data exists, it can be returned directly; if not, it will be read from disk and then cached in PageCache.

Another point is that when reading disk data, you need to find the location of the data, but for mechanical disks, it is through the rotation of the head to the sector where the data is located, and then start to "sequentially" read the data, but the physical Actions are very time-consuming. In order to reduce its impact, PageCache uses the "read-ahead function".

For example, suppose the read method only reads 32 KB bytes at a time. Although read only reads 0-32 KB bytes at the beginning, the kernel will also read the following 32-64 KB bytes into PageCache, so that later The cost of reading 32-64 KB is very low. If the process reads 32-64 KB before it is eliminated from PageCache, the benefit will be very large.

Therefore, the advantages of PageCache are mainly two:

  • Cache recently accessed data;
  • read-ahead function;

These two practices will greatly improve the performance of reading and writing to disk.

However, when transferring large files (GB-level files), PageCache will not work, and it will waste an extra copy of data, resulting in performance degradation. Even if the zero copy of PageCache is used, performance will be lost

This is because if you have many GB-level files to transfer, whenever users access these large files, the kernel will load them into PageCache, so the PageCache space is quickly filled by these large files.

In addition, due to the large size of the file, the probability of some parts of the file data being accessed again may be relatively low, which will cause two problems:

  • Because PageCache is occupied by large files for a long time, other "hot" small files may not be able to fully use PageCache, so the performance of disk read and write will drop;
  • The large file data in PageCache does not enjoy the benefits of caching, but it takes DMA to copy it to PageCache once more;

Therefore, for the transmission of large files, PageCache should not be used, that is to say, zero-copy technology should not be used, because PageCache may be occupied by large files, resulting in "hot" small files that cannot use PageCache, so in a high-concurrency environment This can cause serious performance problems.

How to transfer large files?

So for the transfer of large files, what method should we use?

Let's take a look at the initial example first. When calling the read method to read a file, the process will actually block the call of the read method because it has to wait for the return of disk data, as shown in the following figure:

Specific process:

  • When the read method is called, it will be blocked. At this time, the kernel will initiate an I/O request to the disk. After the disk receives the request, it will address it. When the disk data is ready, it will initiate an I/O interrupt to the kernel. Inform the kernel that the disk data is ready;
  • After the kernel receives the I/O interrupt, it copies the data from the disk controller buffer to PageCache;
  • Finally, the kernel copies the data in the PageCache to the user buffer, so the read call returns normally.

For the blocking problem, asynchronous I/O can be used to solve it. It works as follows:

It divides the read operation into two parts:

  • In the first half, the kernel initiates a read request to the disk, but it can return without waiting for the data to be in place , so the process can handle other tasks at this time;
  • In the second half, when the kernel copies the data in the disk to the process buffer, the process will receive a notification from the kernel and then process the data;

Moreover, we can find that asynchronous I/O does not involve PageCache, so using asynchronous I/O means bypassing PageCache.

I/O that bypasses PageCache is called direct I/O, and I/O that uses PageCache is called cached I/O. Typically, for disks, asynchronous I/O only supports direct I/O.

As mentioned earlier, the transmission of large files should not use PageCache, because the PageCache may be occupied by large files, and the "hot spot" small files cannot use PageCache.

Therefore, in high-concurrency scenarios, for the transmission of large files, "asynchronous I/O + direct I/O" should be used instead of zero-copy technology.

There are two common types of direct I/O application scenarios:

  • If the application has already implemented disk data caching, PageCache may not need to be cached again to reduce additional performance loss. In the MySQL database, direct I/O can be enabled through parameter settings, which is not enabled by default;
  • When transferring large files, it is difficult for large files to hit the PageCache cache, and it will fill up the PageCache so that the "hot" files cannot make full use of the cache, thereby increasing the performance overhead. Therefore, direct I/O should be used at this time.

In addition, because direct I/O bypasses PageCache, you cannot enjoy the optimization of these two points of the kernel:

  • The kernel's I/O scheduling algorithm will cache as many I/O requests as possible in PageCache, and finally "merge" into a larger I/O request and send it to the disk. This is done to reduce disk addressing operations;
  • The kernel will also "pre-read" subsequent I/O requests and place them in the PageCache, also to reduce disk operations;

Therefore, when transferring large files, use "asynchronous I/O + direct I/O", and you can read files without blocking.

Therefore, when transferring files, we need to use different methods according to the size of the file:

  • When transferring large files, use "asynchronous I/O + direct I/O";
  • When transferring small files, use "zero copy technology";

In nginx, we can use the following configuration to use different methods according to the size of the file:

location /video/ { 
    sendfile on; 
    aio on; 
    directio 1024m; 
}

When the file size is greater than directiothe value , use "asynchronous I/O + direct I/O", otherwise use "zero copy technology".

Summarize

Early I/O operations, memory and disk data transmission work are all done by the CPU, and at this time the CPU cannot perform other tasks, which will waste CPU resources.

Therefore, in order to solve this problem, DMA technology appeared. Each I/O device has its own DMA controller. Through this DMA controller, the CPU only needs to tell the DMA controller what data we want to transfer and from where Come, wherever you go, you can leave with confidence. Subsequent actual data transmission work will be completed by the DMA controller, and the CPU does not need to participate in the work of data transmission.

The traditional way of IO is to read data from the hard disk and then send it out through the network card. We need to perform 4 context switches and 4 data copies, of which 2 data copies occur in the buffer in the memory and the corresponding hardware devices Between, this is done by DMA, and the other 2 times happen between kernel mode and user mode, this data moving work is done by CPU.

In order to improve the performance of file transfer, zero-copy technology emerged, which combines two operations of disk reading and network sending through a system call (sendfile method), reducing the number of context switches. In addition, copying data occurs in the kernel, which naturally reduces the number of data copies.

Both Kafka and Nginx implement zero-copy technology, which will greatly improve the performance of file transfers.

Zero-copy technology is based on PageCache. PageCache will cache recently accessed data, improving the performance of accessing cached data. At the same time, in order to solve the problem of slow addressing of mechanical hard disks, it also assists the I/O scheduling algorithm to realize IO merging and pre-awareness. This is why sequential reads perform better than random reads. These advantages further improve the performance of zero copy.

It should be noted that the zero-copy technology does not allow the process to further process the file content, such as compressing the data before sending it.

In addition, when transferring large files, zero copy cannot be used, because the "hot spot" small files may not be able to use PageCache due to PageCache being occupied by large files, and the cache hit rate of large files is not high, then you need to use " Asynchronous IO + direct IO" approach.

In Nginx, you can set a file size threshold through configuration, use asynchronous IO and direct IO for large files, and use zero copy for small files.

Why is Kafka so fast?

  • Sequential reading and writing + mmap, using the mmap api provided by the linux kernel, mmap maps the disk file (device memory) to the memory, supports reading and writing, and the operation on the memory will be reflected on the disk file, but this is not written to the disk in real time Yes, the operating system will actually write the data to the hard disk when the program actively calls flush (kafka also provides a parameter - producer. The producer is called synchronous, and returning to the producer immediately after writing to mmap is called asynchronous).
  • Zero copy, using the sendfile api provided by the linux kernel, sendfile is to transfer the data read into the kernel space directly to the socket buffer for network transmission.

What is mmap?

mmap is a method of operating these devices. The so-called operating devices, such as IO ports (lighting up an LED), LCD controller, and disk controller, are actually reading and writing data to the physical address of the device.

However, since the application program cannot directly manipulate the device hardware address, the operating system provides such a mechanism - memory mapping, which maps the device address to the process virtual address, and mmap is the interface for implementing memory mapping.

The advantage of mmap is that mmap maps device memory to virtual memory, and the user operates virtual memory equivalent to directly operating the device, eliminating the need for copying from user space to kernel space, and increasing data throughput compared to IO operations .

What is memory mapping?

Since mmap is an interface for implementing memory mapping, what is memory mapping? look at the picture below

Each process has an independent process address space. Through the page table and MMU, the virtual address can be converted into a physical address. Each process has independent page table data, which can explain why two different processes have the same virtual address. It corresponds to a different physical address.

What is a virtual address space?

Each process has 4G virtual address space, including 3G user space, 1G kernel space (linux), each process shares kernel space, and independent user space, as shown in the figure below.

The driver runs in the kernel space, so the driver is oriented to all processes.

There are two ways to switch from user space to kernel space:

(1) System call, that is, soft interrupt

(2) Hardware interrupt

What is inside the virtual address space?

Understand what is a virtual address space, so what is installed in the virtual address space? look at the picture below

The virtual space is probably filled with the above data. The memory mapping is probably to map the device address to the red segment in the above picture. Let’s call it the "memory mapping segment". As for which address to map to, it is allocated by the operating system. The operating system divides the process space into three parts:

(1) Unallocated, that is, an address that has not been used by the process

(2) cached, pages cached in ram

(3) Uncached, not cached in ram

The operating system will allocate a virtual address in the unallocated address space to establish a mapping with the device address.

The page cache is an important disk cache in the Linux kernel, which is implemented through a software mechanism. However, the principles of page cache and hardware cache are basically the same. Part of the data in a large-capacity and low-speed device is stored in a small-capacity and fast device. In this way, the fast device will be used as the cache of the low-speed device. When accessing the data in the low-speed device When , data can be obtained directly from the cache without accessing low-speed devices, thus saving the overall access time.

The page cache performs data caching in the size of a page. It stores the most commonly used and important data in the disk into part of the physical memory, so that when the system accesses the block device, it can directly obtain the block device data from the main memory instead of from the disk. Get data from.

In most cases, the kernel uses the page cache when reading and writing to disk. When the kernel reads a file, it first checks in the existing page cache whether the read data already exists. If the page cache does not exist, a new page is added to the cache and then filled with data read from disk. If the current physical memory is free enough, the page will remain in the cache for a long time, so that other processes will no longer access the disk when using the data in the page. The write operation is similar to the read operation, and the data is directly modified in the page cache, but the modified data in the page cache (the page is called Dirty Page at this time) is not written to the disk immediately, but delayed for a few seconds. To prevent the process from modifying the data in the page cache again.

The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/130735479