Let's, from beginning to end, see through the I/O model

What is IO

One sentence summary: IO is the input and output of memory and hard disk

I/O is actually the abbreviation of input and output, that is, input/output.

What about input and output?

For example, when we use the keyboard to type codes, it is actually input, and the pattern displayed on the monitor is output, which is actually I/O.

The disk I/O that we often care about refers to the input and output between the hard disk and memory.

When reading a local file, the data on the disk must be copied to memory, and when modifying a local file, the modified data needs to be copied to the disk.

Network I/O refers to the input and output between the network card and memory.

When data on the network arrives, the network card needs to copy the data into memory. When sending data to other people on the network, the data needs to be copied from the memory to the network card.

So why do you have to interact with memory?

Our instructions are ultimately executed by the CPU, and the reason for this is that the CPU interacts with memory much faster than the CPU can interact directly with these external devices.

Therefore, it is all interacting with memory. Of course, assuming that there is no memory, let the CPU directly interact with external devices, which is also considered I/O.

To sum up: I/O refers to the interaction (data copy) between memory and external devices.

Well, after clarifying what is I/O, let us reveal the inside story of socket communication~

how to communicate

socket

socket creation

First, the server needs to create a socket. In Linux, everything is a file, so the created socket is also a file, and each file has an integer file descriptor (fd) to refer to this file.

int socket(int domain, int type, int protocol);
  • domain: This parameter is used to select the communication protocol family, such as selecting IPv4 communication or IPv6 communication, etc.
  • type: Select the socket type, optional byte stream socket, datagram socket, etc.
  • protocol: Specifies the protocol used. This protocol can usually be set to 0, because the protocol to be used can be deduced from the first two parameters.

For example socket(AF_INET, SOCK_STREAM, 0);, if it indicates that IPv4 is used, and byte stream socket is used, it can be judged that the protocol used is TCP protocol.

The return value of this method is the fd of the created socket.

bind bind

Now we have created a socket, but there is no address pointing to this socket yet.

As we all know, the server application needs to specify the IP and port, so that the client can come to the door for service, so at this time we need to specify an address and port to bind with this socket.

int bind(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

The sockfd in the parameter is the file descriptor of the socket we created. After executing the bind parameter, our socket is one step closer to being accessible.

listen listen

After executing socket and bind, the socket is still in the closed state at this time, that is, it is not listening to the outside world, and then we need to call the listen method to make the socket enter the passive listening state, so that the socket can listen to the client's connection request.

int listen(int sockfd, int backlog);

Pass in the fd of the created socket, and specify the size of the backlog.

When I looked up the information on this backlog, I saw three explanations:

  1. The socket has a queue, which stores completed connections and semi-connections at the same time, and the backlog is the size of the queue.
  2. The socket has two queues, namely the completed connection queue and the semi-connection queue, and the backlog is the sum of the sizes of the two queues.
  3. The socket has two queues, namely the completed connection queue and the semi-connection queue, and the backlog is only the size of the completed connection queue.

Explain what is a semi-join

We all know that TCP requires three handshakes to establish a connection. When the receiver receives the connection establishment request from the requester, it will return ack. At this time, the connection is in a semi-connected state on the receiver. When the receiver receives the ack from the requester again , the connection is in the completed state:
insert image description here
so the above discussion is the storage problem of the connection in these two states.

I checked the information and found that the implementation of the BSD-derived system uses a queue to store connections in these two states at the same time, and the backlog parameter is the size of the queue.

Linux uses two queues to store completed connections and semi-connections respectively, and the backlog is only the queue size of completed connections

accept server connection

Now that we have initialized the listening socket, there will be clients connecting up at this time, and then we need to process these connections that have been established.

From the above analysis, we can know that the connection after the three-way handshake is completed will be added to the completed connection queue.

insert image description here
At this time, we need to get the connection from the completed connection queue for processing, and this fetching action is completed by accpet.

int accept(int sockfd, struct sockaddr *addr, socklen_t *addrlen);

The int value returned by this method is the file descriptor of the socket that has completed the connection, and then the socket can be operated to communicate.

If the completed connection queue has no connections to take, the thread calling accept will block waiting.

So far, the communication process of the server has come to an end, let's look at the operation of the client.

connect client connection

The client also needs to create a socket, that is, call socket(), so we won't go into details here, and we will directly start the connection establishment operation.

The client needs to establish a connection with the server, start the classic three-way handshake operation under the TCP protocol, and then look at the picture drawn above: after the client
insert image description here
creates the socket and calls connect, the connection is in the SYN_SEND state, when the SYN from the server is received After +ACK, the connection becomes ESTABLISHED, which means the three-way handshake is complete.

int connect(int sockfd, const struct sockaddr *addr, socklen_t addrlen);

Calling connect needs to specify the remote address and port to establish a connection, and the communication can start after the three-way handshake is completed.

There is no need to call the bind operation on the client side, and the source IP and random port will be selected by default.

Summary of Connection Establishment Operations

Use a picture to summarize the operation of establishing a connection:
insert image description here
you can see two blocking points here:

  • connect: Need to block and wait for the completion of the three-way handshake.
  • accept: Need to wait for an available completed connection, if the completed connection queue is empty, it will be blocked.

read、write

After the connection is successfully established, you can start sending and receiving messages. Let's take a look at
insert image description here
read to read data. From the server side, it is waiting for the client's request. If the client does not send a request, then calling read will be in a blocking waiting state. There is no data to read, this should be easy to understand.

write is to write data. Generally speaking, after the server accepts the request from the client, it will perform some logic processing, and then return the result to the client. This write may also be blocked.

Some people may ask here that it is understandable that read can't read data and block waiting. Why does write still block? If there is data, don't you just send it directly?

Because we use the TCP protocol, the TCP protocol needs to ensure reliable and orderly transmission of data, and give flow control between the end and the end.

So the sending is not sent directly, it has a sending buffer, we need to copy the data to the sending buffer of TCP first, and TCP controls the sending time and logic by itself, and there may be retransmission or something.

If we send it too fast and the receiver can't handle it, then the receiver will tell through the TCP protocol: Don't send it! I'm too busy. There is a limit to the size of the send buffer. Since the send cannot be sent, and the write is called continuously, the buffer will be full. If it is full, you will not be able to write, so the write will also be blocked.

In summary, both read and write will be blocked.

Summary: Why is network I/O blocked? ——io model

Because the accept, connect, read, and write methods involved in connection establishment and communication may all be blocked.

Blocking will occupy the currently executing thread, making it unable to perform other operations, and frequent blocking wakeup switching contexts will also lead to performance degradation.

Due to blocking, the initial solution was to create multiple threads. However, with the development of the Internet, the number of users has increased dramatically, and the number of connections has also increased. The number of threads that need to be established has also increased. Later, C10K was created. question.

The server can't stand it anymore, what should I do?

Optimize it!

So then I got a non-blocking socket, and then I/O multiplexing, signal-driven I/O, and asynchronous I/O.

In the next article, we will take a good look at these types of I/O models!

Introduction to basic knowledge

In the last article, we have understood the inside story of socket communication, and also understood that there are indeed many blocking points in network I/O. As the number of users increases in blocked I/O, we can only use the method of increasing threads to process more requests. Threads will not only occupy memory resources, but too many thread competitions will lead to frequent context switching and huge overhead.

Therefore, blocking I/O can no longer meet the demand, so the big guys behind continue to optimize and evolve, and propose a variety of I/O models.

Under the UNIX system, there are a total of five I/O models, and today we will have a disk of them!

But before introducing the I/O model, we need to understand the pre-knowledge first.

Kernel space and user space

The following takes a 32-bit system as an example to introduce kernel space (kernel space) and user space (user space).

For a 32-bit operating system, its addressing space (virtual address space, or linear address space) is 4G (2 to the 32nd power). That is to say, the maximum address space of a process is 4G. The core of the operating system is the kernel (kernel), which is independent of ordinary applications, can access the protected memory space, and has all the permissions to access the underlying hardware devices. In order to ensure the security of the kernel, current operating systems generally force user processes not to directly operate the kernel. The specific implementation method basically divides the virtual address space into two parts by the operating system, one part is the kernel space, and the other part is the user space. For the Linux operating system, the highest 1G bytes (from virtual address 0xC0000000 to 0xFFFFFFFF) are used by the kernel, called kernel space. And the lower 3G bytes (from virtual address 0x00000000 to 0xBFFFFFFF) are used by each process, called user space.

We can understand the above paragraph as follows:

In the 4G address space of each process, the highest 1G is the same, that is, the kernel space. Only the remaining 3G is used by the process itself.
In other words, the highest 1G kernel space is shared by all processes!

The following picture describes the allocation of 4G address space for each process (this picture comes from the Internet):
insert image description here
Why it is necessary to distinguish between kernel space and user space
Among all instructions of the CPU, some instructions are very dangerous, and if used incorrectly, the system will crash , such as clearing the memory, setting the clock, etc. If all programs are allowed to use these instructions, the probability of system crash will be greatly increased.

Therefore, the CPU divides instructions into privileged instructions and non-privileged instructions. For those dangerous instructions, only the operating system and its related modules are allowed to use them, and ordinary applications can only use those instructions that will not cause disasters. For example, Intel's CPU divides the privilege level into four levels: Ring0~Ring3.

In fact, the Linux system only uses two operating levels, Ring0 and Ring3 (the same is true for the Windows system). When a process runs at the Ring3 level, it is called running in the user mode, and when it runs at the Ring0 level, it is called running in the kernel mode.

Kernel mode, user mode and system call

insert image description here

Our computers may be running a lot of programs at the same time, and these programs are from different companies.

No one knows whether a certain program running on the computer will go crazy and do some strange operations, such as clearing the memory regularly.

Therefore, the CPU divides non-privileged instructions and privileged instructions, and implements permission control. Some dangerous instructions will not be opened to ordinary programs, but only to privileged programs such as the operating system.

You can understand that our code cannot call those operations that may cause "dangerous", but the kernel code of the operating system can call.

These "dangerous" operations refer to: memory allocation and recovery, disk file read and write, network data read and write, and so on.

If we want to perform these operations, we can only call the APIs opened by the operating system, also known as system calls.

This is like when we go to the administrative hall to do business, those sensitive operations are handled by official personnel for us (system calls), so the reason is the same, and the purpose is to prevent us (ordinary programs) from messing around.

Here are the two previous nouns mentioned:

  • user space
  • kernel space.

The code of our ordinary program runs in the user space, while the code of the operating system runs in the kernel space, and the user space cannot directly access the kernel space. When a process runs in user space, it is in user mode, and when it runs in kernel space, it is in kernel mode .

When a program in user space makes a system call, that is, calls the API provided by the operating system kernel, it will switch context and switch to the kernel state, which is often called falling into the kernel state.

For the previous DOS operating system, there is no concept of kernel space, user space, kernel mode, and user mode. It can be considered that all codes are running in the kernel mode, so the application code written by the user can easily crash the operating system.

For Linux, by distinguishing the design of the kernel space and the user space, the operating system code (the code of the operating system is much more robust than the code of the application program) and the application program code are isolated. Even if an error occurs in a single application program, it will not affect the stability of the operating system, so that other programs can still run normally (Linux is a multitasking system!).
Therefore, the distinction between kernel space and user space is essentially to improve the stability and availability of the operating system.

System calls and state switching

For example, if an application wants to read a file on the disk, it can initiate a "system call" to the kernel to tell the kernel: "I want to read a certain file on the disk". In fact, it is through a special instruction that the process (actually the process is still "outside", just calling the system's api, thus executing the code "entering" the kernel) enters the kernel state (to the kernel space) from the user state, in the kernel In the space, the CPU can execute any instruction, including reading data from the disk. The specific process is to read the data into the kernel space first, then copy the data to the user space and switch from the kernel mode to the user mode. At this point, the application has returned from the system call and obtained the desired data, and can proceed happily.

Simply put, the application outsources high-tech things (reading files from the disk) to the system kernel, and the system kernel does these things professionally and efficiently.

For a process, the process of entering kernel space from user space and finally returning to user space is very complicated. For example, the concept "stack" that we often come into contact with, in fact, the process has a stack in the kernel mode and user mode. When running in user space, the process uses the stack in user space, and when running in kernel space, the process uses the stack in kernel space. Therefore, each process in Linux has two stacks, which are used for user mode and kernel mode respectively.

From user mode to kernel mode. In a nutshell, there are three ways: system calls, soft interrupts and hardware interrupts. Each of these three methods involves a lot of operating system knowledge, so I won't expand here.

Why can't the hard disk data be read directly

It seems redundant to copy data from kernel space to user space, why not just let the disk send the data to the buffer of user space?

The hard disk usually cannot directly access
the fixed-size data blocks operated by the user-space disk based on block-storage hardware devices. The user process may request data blocks of any size or non-alignment. During the data interaction process between the two, the kernel is responsible for the data decomposition. , Recombine the work and play the role of a middleman.

how copying works

Through the above introduction, we know that when an application needs to read a file, the kernel first reads the file content from the disk into the buffer in the kernel through DMA technology (DMA in simple terms is the process of data from hard disk -> memory, The CPU does not need to participate, as long as the CPU sends an instruction to the corresponding hardware of the hard disk and memory, the appendix will introduce in detail), and then the application process reads the data from the kernel buffer to the application buffer. That is, there are two file copies.
In order to improve I/O efficiency and processing capability, the operating system uses a virtual memory mechanism. Virtual memory means replacing physical (hardware RAM) memory addresses with fake (or virtual) addresses. There are many benefits to doing so, which can be summarized into two categories:

More than one virtual address can point to the same physical memory address.
The virtual memory space can be larger than the actual available hardware memory.
insert image description here
The advantage of this is that the copying between the kernel and the user space is omitted.

So why introduce these knowledge points at the beginning?

Because when the program requests to obtain network data, it needs to go through two copies:

  • The program needs to wait for the data to be copied from the NIC to the kernel space.
  • Because the user program cannot access the kernel space, the kernel has to copy the data to the user space, so that the program in the user space can access the data.

The introduction so much is to let you understand why there are two copies, and the system call has overhead , so it is best not to call it frequently.

Then the gap between the I/O models we are talking about today is that the implementation of this copy is different!

Today we will use the read call, that is, to read network data as an example to expand the I/O model.

Let's go!

IO model

When fishing, the fish is in the fish pond at the beginning, and the final sign of our fishing action is that the fish is caught by us from the fish pond and put into the fish basket.

The fish pond inside can be mapped to a disk, the transitional fishhook in the middle can be mapped to kernel space, and the fish basket where the fish are finally placed can be mapped to user space. A complete fishing (IO) operation is the process of transferring (copying) fish (files) from the fish pond (hard disk) to the fish basket (user space).

Two steps: take the bait (the kernel data is ready), put it in the fish basket (copy the data from the kernel state to the user state)

synchronous blocking model

Suppose A is very attentive when fishing by the river, for fear that the fish will slip away, so A keeps staring at the fishing rod, waiting for the fish to take the bait, and concentrates on doing this thing until the fish takes the bait. The fish is caught and put into the fish basket before ending this action, which is blocking IO. The system call will remain blocked until the kernel prepares the data.

insert image description here
When the thread of the user program calls read to obtain network data, the data must first be available, that is, the network card must first receive the data from the client, and then the data needs to be copied to the kernel and then copied to the user space. , the entire process user thread is blocked.

Assuming that no client sends data , the user thread will be blocked and wait until there is data. Even if there is data, the process of two copies has to be blocked and waited.

So this is called the synchronous blocking I/O model.

Its advantage is obvious, simple. After calling read, it doesn't matter until the data comes and is ready to be processed.

The disadvantage is also obvious. One thread corresponds to one connection and has been occupied all the time. Even if there is no data coming from the network card, it will be blocked and waited synchronously.

We all know that threads are relatively heavy resources, which is a bit wasteful.

So we don't want it to wait around like this.

So there is synchronous non-blocking I/O.

Synchronous non-blocking I/O

If B is also fishing by the river, B does not want to spend all his time waiting for the fish to get the bait like A does, so what he does is to read books and refresh himself while waiting for the fish to get the bait Xiaobian's blog, chat and so on. But B doesn't just ignore the fish, he will check at regular intervals to see if there are any fish hooked, and if there are any fish hooked, he will end this action, which is non-blocking IO.
Non-blocking IO often requires the programmer to repeatedly try to read the file descriptor in a cyclic manner. This process is called polling, which is a big waste for the CPU, and generally can only be used in specific scenarios.
insert image description here
From the figure, we can clearly see that synchronous non-blocking I/O is optimized based on synchronous blocking I/O:

When there is no data, you can no longer block and wait stupidly, but directly return an error, telling that there is no ready data yet!

It should be noted here that the user thread will still be blocked in the step of copying from the kernel to the user space.

This model is more flexible than synchronous blocking I/O. For example, if there is no data when calling read, the thread can do other things first, and then continue to call read to see if there is any data.

But if your thread is to fetch data and then process the data, without doing other logic, then this model is a bit problematic.

It means that you are constantly making system calls. If your server needs to handle a large number of connections, then you need a large number of threads to call continuously, context switching is frequent, and the CPU will be busy to death, doing useless work and being busy to death.

So what to do?

So there is I/O multiplexing.

Multiplexed IO

Suppose D is also fishing by the river, but D is a local tyrant, and he takes a lot of fishing rods and places them there by himself, which obviously increases the chances of fish being baited. He only needs to constantly check whether each fishing rod has a fish hooked, which improves the efficiency. In fact, the core is that IO multiplexing can wait for the ready status of multiple file descriptors at the same time.
insert image description here
From the picture, it seems to be similar to the synchronous non-blocking I/O above, but it is not the same, and the threading model is different.

Since synchronous non-blocking I/O is too wasteful to call frequently under too many connections, let's hire a specialist.

The job of this commissioner is to manage multiple connections and help check whether data on the connection is ready.

In other words, you can use only one thread to check whether data is ready for multiple connections.

Specific to the code, this commissioner is select, we can register the connection that needs to be monitored to select, and select will monitor whether the connection it manages has data ready, and if so, it can notify other threads to read the data. This read is the same as before, it will still block the user thread.

In this way, a small number of threads can be used to monitor multiple connections, reducing the number of threads, reducing memory consumption and reducing the number of context switches, which is very comfortable.

Detailed explanation of IO multiplexing

Synchronous blocking and non-blocking are to receive jobs one by one. Synchronous blocking is in order and then the block is received before going to the next one; non-blocking is to skip this one before going to the next one. Select means that students will raise their hands after finishing writing, and then step down to collect homework, but I don't know who it is.

select(array)

advantage

  • It is not necessary to make a system call for each FD, which solves the problem of frequently switching user mode and kernel mode.
  • Cross-platform, linux, Mac, Windows can use this function.
    shortcoming
  • The maximum number of file descriptors monitored by a single process is limited, the maximum is 1024.
  • Each call must copy the file descriptor from user mode to kernel mode.
    And I don't know which file descriptor it is, so I have to traverse it again.

poll (linked list)

advantage

  • Mainly for the limitation of select1024, it is realized by array instead, and other advantages are similar to select.
    shortcoming
  • Still don't know who is the same as select, you need to traverse all of them once.
  • And it can only be used on the linux platform.
  • Every time the file descriptor needs to be copied from user mode to kernel mode.

epoll (red-black tree)

advantage

  • There is no limit to the number of file descriptors for single-process monitoring, generally 3-6W is related to machine memory and the like.
  • There is no need to copy the file descriptor from user mode to kernel mode every time.
  • You can directly know which specific file descriptor it is, without traversing all the file descriptors.
    shortcoming
  • Only supports linux, not cross-platform
    working mode
  • Horizontal trigger (default)
    will remind if the event is not processed
  • Edge trigger,
    after sending once, no matter whether it is processed or not, it will not be sent again

Presumably by now you have understood what I/O multiplexing is.

The so-called multi-channel refers to multiple connections, and multiplexing refers to the ability to monitor so many connections with one thread.

Seeing this, think again, what else can be optimized?

Signal driven IO

If C is also fishing by the river, we can install an alarm (such as a bell) on the fishing rod, and call the police immediately when a fish bites the hook. Then after we received the alarm, we went to catch the fish. . The signal drives the IO model, and the application process tells the kernel: When the datagram is ready, send me a signal, capture the SIGIO signal, and call my signal processing function to obtain the datagram.
insert image description here
Although the above select is not blocked, it has to check at all times to see if any data is ready. Can the kernel tell us that the data has arrived instead of polling?

This function can be realized by signal-driven I/O. The kernel notifies that the data is ready, and then the user thread goes to read (still blocks).

Sound better than I/O multiplexing? So why do you seem to rarely hear signals driving I/O?
Why are I/O multiplexing instead of signal drivers used on the market?

Because our applications usually use the TCP protocol, and the socket of the TCP protocol can generate seven signal events.

That is to say, not only will the signal be signaled when the data is ready, but other events will also be signaled, and this signal is the same signal, so our application has no way of distinguishing what event generated this signal.

Then it's numb!

So our applications basically don't use signals to drive I/O, but if your application uses the UDP protocol, that's okay , because UDP doesn't have so many events.

So signal-driven I/O doesn't look too good to us at this point of view.

Extra step I/O

Suppose E also wants to fish, but he is a little busy, so he hires a person to help him watch the fishing rod. Once a fish is hooked, let this person notify him, and he will come to catch the fish. The kernel notifies the application when the data copy is completed (the signal driver tells the application when it can start copying data).
This time we hired a fishing master. Not only does he fish, but he also texts us when the fish is hooked to let us know it's ready. We only need to entrust him to throw the pole, and then we can run to do other things until his text message. We come back to deal with the fish that have landed.

insert image description here
Although signal-driven I/O is not very friendly to TCP, the idea is correct: it develops asynchronously, but it is not completely asynchronous, because the read behind it will still block the user thread, so it is considered semi-asynchronous.

Therefore, we have to think about how to make it fully asynchronous, that is, to save the blocking of the read step.

In fact, the idea is very clear: let the kernel directly copy the data to the user space and then notify the user thread to achieve real non-blocking I/O!

So asynchronous I/O is actually the user thread calling aio_read, and then including the step of copying the data from the kernel to the user space. All operations are completed by the kernel. After the kernel operation is completed, the callback set before is called. At this time, the user thread is You can continue to perform subsequent operations with the data that has been copied to the user control.

During the whole process, the user thread does not have any blocking points, which is the real non-blocking I/O.

Then the question arises again:

Why is I/O multiplexing commonly used instead of asynchronous I/O?
Because Linux has insufficient support for asynchronous I/O, you can think that it has not been fully implemented, so asynchronous I/O cannot be used.

Some people may say that it is wrong. Tomcat has implemented AIO implementation classes. In fact, these components or some class libraries you use seem to support AIO (asynchronous I/O). In fact, the underlying implementation is simulated by epoll. of.

Windows implements real AIO, but our servers are generally deployed on Linux, so the mainstream is still I/O multiplexing.

So far, you must have understood how the five I/O models evolved.

Later, I will talk about several confusing concepts that are often accompanied by network I/O: synchronous, asynchronous, blocking, and non-blocking.

synchronous/asynchronous/blocking/non-blocking/BIO/NIO/AIO

https://mp.weixin.qq.com/s/EVequWGVMWV5Ki2llFzdHg
https://mp.weixin.qq.com/s/DEd0VY3dhR6B0hjQSEtB7Q

Advanced file transfer optimization - DMA, zero copy, large file transfer

What is DMA

The full name of DMA is Direct Memory Access (Direct Memory Access). It is a microcontroller module used to transfer data from one address space to another with a similar "copy" function. However, unlike other modules that send data, the process of sending data by DMA does not require CPU intervention. ,The DMA bus does not conflict with the CPU bus, only when the transmission is over, an interrupt will be generated to notify the CPU;

Why do we need dma technology

We know that the CPU has many functions such as data transfer, calculation, and control program transfer. The core of the system operation is the CPU.

The CPU is processing a large number of transactions all the time, but some things are not so important, such as data replication and data storage, if we take out this part of the CPU resources and let the CPU handle other complex computing transactions , Is it possible to make better use of CPU resources?

Therefore: transferring data (especially transferring large amounts of data) does not require CPU participation. For example, if you want to copy the data of peripheral A to peripheral B, you only need to provide a data path for the two peripherals, and directly copy the data from A to B without processing by the CPU.

Therefore, when there is a large amount of data transfer or a high rate, using DMA can save a lot of CPU resources. In other words, the use of DMA will not occupy the running time of the CPU;

Taking the DMA of STM32 as an example, as can be seen in the figure below, the bus of DMA is separated from the CPU bus of Cortex-M3, so the operation of DMA will not occupy the bus resources of CPU;
Taking the DMA of STM32 as an example, as can be seen in the figure below, the bus of DMA is separated from the CPU bus of Cortex-M3, so the operation of DMA will not occupy the bus resources of CPU;

Before DMA technology, the I/O process was like this:

The CPU sends corresponding instructions to the disk controller, and then returns;
after the disk controller receives the instructions, it begins to prepare data, puts the data into the internal buffer of the disk controller, and then generates an interrupt; the CPU
receives After the interrupt signal, stop the work at hand, and then read the data in the buffer of the disk controller into its own register one byte at a time, and then write the data in the register to the memory, while the CPU during data transmission cannot perform other tasks.
In order to facilitate your understanding, I drew a picture:
insert image description here
What is the process of data transmission using the DMA controller? Let's take a look at it in detail.

Specific process:

The user process calls the read method, sends an I/O request to the operating system, requests to read data into its own memory buffer, and the process enters a blocked state; after the
operating system receives the request, it further sends the I/O request to DMA, and then lets The CPU performs other tasks;
DMA further sends the I/O request to the disk;
the disk receives the DMA I/O request and reads the data from the disk to the buffer of the disk controller. When the buffer of the disk controller is read When it is full, it sends an interrupt signal to DMA to tell itself that the buffer is full;
DMA receives the signal from the disk and copies the data in the disk controller buffer to the kernel buffer. At this time, the CPU is not occupied, and the CPU can perform other tasks ;
When the DMA reads enough data, it will send an interrupt signal to the CPU;
the CPU receives the DMA signal and knows that the data is ready, so it copies the data from the kernel to the user space, and the system call returns; the
insert image description here
early DMA only existed On the motherboard, due to the increasing number of I/O devices, the requirements for data transmission are also different, so each I/O device has its own DMA controller.

There are three typical DMA working modes:

Memory === "Memory, such as memory copy
Peripherals === "Memory, such as UART, SPI
Memory === "Peripherals, such as UART, SPI

How bad is traditional file transfer?

If the server wants to provide the function of file transfer, the simplest way we can think of is: read the file on the disk, and then send it to the client through the network protocol.

The way traditional I/O works is that data reads and writes are copied back and forth from user space to kernel space, and data in kernel space is read or written from disk through the I/O interface at the operating system level.

The code is usually as follows, generally two system calls are required:

read(file, tmp_buf, len);
write(socket, tmp_buf, len);
The code is very simple, although there are only two lines of code, but a lot of things happened in it.
insert image description here
First of all, there were 4 context switches between the user mode and the kernel mode during the period, because there were two system calls, one was read(), and the other was write(). Each system call had to switch from the user mode to the kernel first. state, after the kernel completes the task, then switch back from the kernel state to the user state.

The cost of context switching is not small. A switch takes tens of nanoseconds to several microseconds. Although the time seems short, in high concurrency scenarios, this kind of time is easy to be accumulated and amplified, thus affecting the performance of the system. performance.

Secondly, there are 4 data copies, two of which are DMA copies, and the other two are copied by the CPU. Let's talk about the process below:

For the first copy, the data on the disk is copied to the buffer of the operating system kernel, and the copying process is carried by DMA.
The second copy is to copy the data in the kernel buffer to the user's buffer, so our application can use this part of the data, and this copying process is completed by the CPU.
For the third copy, the data just copied to the user's buffer is copied to the socket buffer of the kernel. This process is still handled by the CPU.
The fourth copy is to copy the data in the socket buffer of the kernel to the buffer of the network card, and this process is carried by DMA again.
Looking back at the file transfer process, we only moved one piece of data, but ended up moving it four times. Excessive data copying will undoubtedly consume CPU resources and greatly reduce system performance.

This simple and traditional file transfer method has redundant context switching and data copying, which is very bad in a high-concurrency system, adding a lot of unnecessary overhead and seriously affecting system performance.

Therefore, in order to improve the performance of file transfer, it is necessary to reduce the number of "context switching between user mode and kernel mode" and "memory copy".

How to optimize the performance of file transfer?

Let's take a look first, how to reduce the number of "context switching between user mode and kernel mode"?

When reading disk data, context switching occurs because user space does not have permission to operate disks or network cards, and the kernel has the highest authority. The process of operating these devices needs to be completed by the operating system kernel, so generally through When the kernel completes certain tasks, it needs to use the system call function provided by the operating system.

And a system call will inevitably have two context switches: first switch from the user state to the kernel state, and then switch back to the user state to be executed by the process code after the kernel finishes executing the task.

Therefore, in order to reduce the number of context switches, it is necessary to reduce the number of system calls.

Let's take a look again, how to reduce the number of "data copies"?

As we know earlier, the traditional file transfer method will go through 4 data copies, and here, "copy from the kernel's read buffer to the user's buffer, and then copy from the user's buffer to the socket's buffer In", this process is not necessary.

Because in the application scenario of file transfer, we do not "reprocess" the data in the user space, so the data does not actually need to be moved to the user space, so the user's buffer is unnecessary.

How to achieve zero copy?

There are usually two ways to implement zero-copy technology:

  • mmap + write
  • Let's talk about sendfile
    below, how they reduce the number of "context switching" and "data copying".

mmap + write
As we know earlier, the read() system call will copy the data in the kernel buffer to the user's buffer, so in order to reduce the overhead of this step, we can replace the read() system call with mmap() function.

buf = mmap(file, len);
write(sockfd, buf, len);
The mmap() system call function will directly "map" the data in the kernel buffer to the user space, so that the operating system kernel and user space will not Any further data copy operations are required.
insert image description here
The specific process is as follows:

After the application process calls mmap(), DMA will copy the disk data to the kernel buffer. Then, the application process "shares" this buffer with the operating system kernel;
the application process then calls write(), and the operating system directly copies the data in the kernel buffer to the socket buffer. Move data;
finally, copy the data in the socket buffer of the kernel to the buffer of the network card, and this process is carried by DMA.
We can know that by using mmap() instead of read(), the process of one data copy can be reduced.

But this is not the ideal zero copy, because it still needs to copy the data in the kernel buffer to the socket buffer through the CPU, and still needs 4 context switches, because the system call is still 2 times.

In the Linux kernel version 2.1, sendfile
provides a system call function sendfile() specially for sending files. The function form is as follows:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
Its first two parameters are the file descriptors of the destination and source respectively, and the last two parameters are the source The offset and the length of the copied data, the return value is the length of the actual copied data.

First of all, it can replace the previous two system calls, read() and write(), so that one system call can be reduced, and the overhead of two context switches can also be reduced.

Secondly, this system call can directly copy the data in the kernel buffer to the socket buffer instead of copying to the user state, so there are only 2 context switches and 3 data copies. As shown in the figure below:
insert image description here
But this is not a real zero-copy technology. If the network card supports SG-DMA (The Scatter-Gather Direct Memory Access) technology (different from ordinary DMA), we can further reduce the number of data stored in the kernel buffer by the CPU. The process of copying the data to the socket buffer.

You can run the following command on your Linux system to check whether the network card supports the scatter-gather feature:

$ ethtool -k eth0 | grep scatter-gather
scatter-gather: on
Therefore, starting from version 2.4 of the Linux kernel, when the network card supports SG-DMA technology, the process of the sendfile() system call has undergone some changes. Specifically The process is as follows:

The first step is to copy the data on the disk to the kernel buffer through DMA;
the second step is to transfer the buffer descriptor and data length to the socket buffer, so that the SG-DMA controller of the network card can directly copy the data in the kernel buffer The data is copied to the buffer of the network card. This process does not need to copy the data from the operating system kernel buffer to the socket buffer, which reduces the data copy once; therefore, in
this process, only 2 data Copy, as shown in the figure below:
insert image description here
This is the so-called zero-copy (Zero-copy) technology, because we do not copy data at the memory level, that is to say, we do not move data through the CPU in the whole process, and all data is transmitted through DMA . .

Compared with the traditional file transfer method, the file transfer method of zero-copy technology reduces the number of context switches and data copies by 2 times. Only 2 times of context switches and data copies are needed to complete the file transfer, and 2 data copies The process does not need to pass through the CPU, and the two times are carried by DMA.

Therefore, in general, zero-copy technology can improve the performance of file transfer by at least double.

Projects using zero-copy technology
In fact, Kafka, an open source project, uses "zero-copy" technology, which greatly improves the I/O throughput rate, which is one of the reasons why Kafka processes massive amounts of data so quickly.

If you trace the code of Kafka file transfer, you will find that it finally calls the transferTo method in the Java NIO library:

@Overridepublic 
long transferFrom(FileChannel fileChannel, long position, long count) throws IOException {
    
     
    return fileChannel.transferTo(position, count, socketChannel);
}

If the Linux system supports the sendfile() system call, then transferTo() will actually end up using the sendfile() system call function.

There used to be a big guy who wrote a program and tested it. Under the same hardware conditions, the performance difference between traditional file transfer and zero-copy file transfer, you can see the following test data graph, using zero-copy can shorten the performance by 65% The time, greatly improving the throughput of machine transmission data.
insert image description here
In addition, Nginx also supports zero-copy technology. Generally, zero-copy technology is enabled by default, which is conducive to improving the efficiency of file transfer. The configuration of whether to enable zero-copy technology is as follows:

http {
    
    
...
    sendfile on
...
}

The specific meaning of sendfile configuration:

  • Set to on means, use zero-copy technology to transfer files: sendfile, so only 2 context switches and 2 data copies are needed.
  • Setting it to off means that using the traditional file transfer technology: read + write, then 4 context switches and 4 data copies are required.
    Of course, to use sendfile, the Linux kernel version must be version 2.1 or higher.

What does PageCache do? - small file transfer

Looking back at the file transfer process mentioned above, the first step is to copy the disk file data into the "kernel buffer", which is actually a disk cache (PageCache).

Since zero copy uses PageCache technology, zero copy can further improve performance. Let's see how PageCache does this next.

Reading and writing disks is much slower than reading and writing memory, so we should find a way to replace "reading and writing disks" with "reading and writing memory". Therefore, we will move the data in the disk to the memory through DMA, so that the read disk can be replaced with the read memory.

However, the memory space is much smaller than that of the disk, and the memory is destined to only copy a small part of the data on the disk.

The question is, which disk data should be copied to memory?

We all know that when the program is running, it has "locality", so usually, the data that has just been accessed has a high probability of being accessed again in a short period of time, so we can use PageCache to cache the recently accessed data. When the space is insufficient The cache that has not been accessed for the longest time is evicted.

Therefore, when reading disk data, look for it first in PageCache. If the data exists, it can be returned directly; if not, it will be read from disk and then cached in PageCache.

Another point is that when reading disk data, you need to find the location of the data, but for mechanical disks, it is through the rotation of the head to the sector where the data is located, and then start to "sequentially" read the data, but the physical Actions are very time-consuming. In order to reduce its impact, PageCache uses the "read-ahead function".

For example, suppose the read method only reads 32 KB bytes at a time. Although read only reads 0-32 KB bytes at the beginning, the kernel will also read the following 32-64 KB bytes into PageCache, so that later The cost of reading 32-64 KB is very low. If the process reads 32-64 KB before it is eliminated from PageCache, the benefit will be very large.

Therefore, the advantages of PageCache are mainly two:

  • Cache recently accessed data;
  • read-ahead function;

These two practices will greatly improve the performance of reading and writing to disk.

However, when transferring large files (GB-level files), PageCache will not work, and it will waste one more data copy made by DMA, resulting in performance degradation. Even if the zero copy of PageCache is used, performance will be lost

This is because if you have many GB-level files to transfer, whenever users access these large files, the kernel will load them into PageCache, so the PageCache space is quickly filled by these large files.

In addition, due to the large size of the file, the probability of some parts of the file data being accessed again may be relatively low, which will cause two problems:

Since PageCache is occupied by large files for a long time, other "hot" small files may not be able to fully use PageCache, so the performance of disk read and write will drop;
large file data in PageCache, because they do not enjoy the cache benefits, but it takes DMA to copy to PageCache one more time;
therefore, PageCache should not be used for the transfer of large files, that is to say, zero-copy technology should not be used, because it may cause "hot spots" due to PageCache being occupied by large files Small files cannot use PageCache, which will cause serious performance problems in a high-concurrency environment.

How to transfer large files?

So for the transfer of large files, what method should we use?

Let's take a look at the initial example first. When calling the read method to read a file, the process will actually block the call of the read method because it has to wait for the disk data to return, as shown in the figure below: The specific process
insert image description here
:

When the read method is called, it will be blocked. At this time, the kernel will initiate an I/O request to the disk. After the disk receives the request, it will address it. When the disk data is ready, it will initiate an I/O interrupt to the kernel. Inform the kernel that the disk data is ready;
after the kernel receives the I/O interrupt, it copies the data from the disk controller buffer to the PageCache;
finally, the kernel copies the data in the PageCache to the user buffer, so the read call is returned normally.
For the problem of blocking, asynchronous I/O can be used to solve it. It works as shown in the figure below:
insert image description here
it divides the read operation into two parts:

In the first half, the kernel initiates a read request to the disk, but it can return without waiting for the data to be in place, so the process can handle other tasks at this time; in the second half, when the kernel
copies the data in the disk to the process buffer, the process will After receiving the notification from the kernel, process the data;
moreover, we can find that asynchronous I/O does not involve PageCache, so using asynchronous I/O means bypassing PageCache.

I/O that bypasses PageCache is called direct I/O, and I/O that uses PageCache is called cached I/O. Typically, for disks, asynchronous I/O only supports direct I/O.

As mentioned earlier, the transmission of large files should not use PageCache, because the PageCache may be occupied by large files, and the "hot spot" small files cannot use PageCache.

Therefore, in high-concurrency scenarios, for the transmission of large files, "asynchronous I/O + direct I/O" should be used instead of zero-copy technology.

There are two common types of direct I/O application scenarios:

If the application has already implemented disk data caching, PageCache may not need to be cached again to reduce additional performance loss. In the MySQL database, you can enable direct I/O through parameter settings, which is not enabled by default;
when transferring large files, it is difficult for large files to hit the PageCache cache, and it will fill up the PageCache, resulting in "hot" files that cannot make full use of the cache, thus Increased performance overhead, so direct I/O should be used at this time.
In addition, because direct I/O bypasses PageCache, you cannot enjoy the optimization of these two points of the kernel:

The kernel's I/O scheduling algorithm will cache as many I/O requests as possible in PageCache, and finally "merge" into a larger I/O request and send it to the disk. This is done to reduce disk addressing operations;
The kernel will also "pre-read" subsequent I/O requests in PageCache, which is also to reduce disk operations;
therefore, when transferring large files, use "asynchronous I/O + direct I/O". The file can now be read without blocking.

Therefore, when transferring files, we need to use different methods according to the size of the file:

When transferring large files, use "asynchronous I/O + direct I/O";
when transferring small files, use "zero-copy technology";
in nginx, we can use the following configuration to use according to the size of the file different way:

location /video/ { sendfile on; aio on; directio 1024m; } When the file size is greater than the directio value, use "asynchronous I/O + direct I/O", otherwise use "zero copy technology".




Summarize

Early I/O operations, memory and disk data transmission work are all done by the CPU, and at this time the CPU cannot perform other tasks, which will waste CPU resources.

Therefore, in order to solve this problem, DMA technology appeared. Each I/O device has its own DMA controller. Through this DMA controller, the CPU only needs to tell the DMA controller what data we want to transfer and from where Come, wherever you go, you can leave with confidence. Subsequent actual data transmission work will be completed by the DMA controller, and the CPU does not need to participate in the work of data transmission.

The traditional way of IO is to read data from the hard disk and then send it out through the network card. We need to perform 4 context switches and 4 data copies, of which 2 data copies occur in the buffer in the memory and the corresponding hardware devices Between, this is done by DMA, and the other 2 times happen between kernel mode and user mode, this data moving work is done by CPU.

In order to improve the performance of file transfer, zero-copy technology emerged, which combines two operations of disk reading and network sending through a system call (sendfile method), reducing the number of context switches. In addition, copying data occurs in the kernel, which naturally reduces the number of data copies.

Both Kafka and Nginx implement zero-copy technology, which will greatly improve the performance of file transfers.

Zero-copy technology is based on PageCache. PageCache will cache recently accessed data, improving the performance of accessing cached data. At the same time, in order to solve the problem of slow addressing of mechanical hard disks, it also assists the I/O scheduling algorithm to realize IO merging and pre-awareness. This is why sequential reads perform better than random reads. These advantages further improve the performance of zero copy.

It should be noted that the zero-copy technology does not allow the process to further process the file content, such as compressing the data before sending it.

In addition, when transferring large files, zero copy cannot be used, because the "hot spot" small files may not be able to use PageCache due to PageCache being occupied by large files, and the cache hit rate of large files is not high, then you need to use " Asynchronous IO + direct IO" approach.

In Nginx, you can set a file size threshold through configuration, use asynchronous IO and direct IO for large files, and use zero copy for small files.

Data Sources and Advanced

WeChat public account Kobayashi coding, reply system, network can download pdf version of the book

appendix

What are the extended IO models in Java?

In Java, there are three main IO models, namely blocking IO (BIO), non-blocking IO (NIO) and asynchronous IO (AIO). The IO-related APIs provided in Java actually rely on the IO operations at the operating system level when processing files. For example, after Linux 2.6, both NIO and AIO in Java are implemented through epoll, while on Windows, AIO is implemented through IOCP.
BIO, NIO, and AIO in Java can be understood as the encapsulation of various IO models of the operating system by the Java language. When using these APIs, programmers do not need to care about the knowledge of the operating system, nor do they need to write different codes according to different operating systems. Just use the Java API.

more for further reading

Five models of server concurrency (understand)

insert image description here

https://zhuanlan.zhihu.com/p/527426524 See here for details

Appendix io ​​Advanced

I/O software target

device independence

Now let us turn to the study of I/O software. A very important goal of I/O software design is device independence.

What do you mean? This means that we can write applications that access any device without specifying a specific device in advance.

For example, if you write an application that can read files from a device, then this application can read from a hard disk, DVD, or USB, and there is no need to customize the application for each device. This actually embodies the concept of device independence.

The computer operating system is the medium of these hardwares, because different hardwares have different instruction sequences, so the operating system is needed to convert between instructions.

A metric closely related to device independence is uniform naming. The codename of the device should be an integer or a string, they should not depend on the specific device.

In UNIX, all disks can be integrated into the file system, so users don’t need to remember the specific name of each device, just remember the corresponding path. If the path cannot be remembered, it can also be found through commands such as ls specific integration location.

error handling

Besides device independence, a second important goal of I/O software implementation is error handling.

Normally, errors should be handled at the hardware level. If the device controller detects a read error, it will do its best to repair the error.

If the device controller cannot handle this problem, then the device driver should handle it. The device driver will try to read the operation again. Many errors are accidental. If the device driver cannot handle this error, it will throw the error up to the It is processed at the hardware level (upper layer). In many cases, the upper layer does not need to know how the lower layer solves the error.

It's much like the project manager doesn't have to tell the boss every decision; the programmer doesn't have to tell the project manager how to write every line of code. This approach is not transparent enough.

Synchronous and asynchronous transfers

The third goal realized by I/O software is synchronous (synchronous) and asynchronous (asynchronous, ie interrupt-driven) transmission. Let me talk about what is synchronous and asynchronous here.

Data in isochronous transfers is usually sent in blocks or frames. The sender and receiver should have synchronized clocks before data transfer.

In asynchronous transmission, data is usually sent in the form of bytes or characters. Asynchronous transmission does not require a synchronous clock, but a parity bit is added to the data before transmission. The following are the main differences between synchronous and asynchronous
insert image description here
Back to the topic. Most physical IO (physical I/O) is asynchronous. The CPU in physical I/O is very smart. After the CPU transfer is completed, it will turn to do other things. It communicates with the interrupt, and the CPU will not return to the transfer until the interrupt occurs.

I/O is divided into two types: physical I/O and logical I/O (Logical I/O).
Physical I/O is usually the actual fetching of data from storage devices such as disks. Logical I/O is fetching data to memory (block, buffer).

buffer

The next problem with I/O software is buffering. Normally, data sent from one device does not go directly to the final device. During this period, it will go through a series of verification, inspection, buffering and other operations before it can be reached.

For example, a packet sent from the network will go through a series of checks before reaching the buffer first, thereby eliminating buffer fill rates and buffer overloads.

shared and exclusive

The final problem that I/O software raises is that of shared versus exclusive devices. Some I/O devices can be shared by many users.

Some devices, such as disks, are generally not a problem for multiple users to use, but some devices must be exclusive, that is, only allow a single user to use them before they can be used by other users.

The method of controlling io

Next, let's explore how to use programs to control I/O devices. There are three methods of controlling I/O devices

  • Using program to control I/O
  • Using interrupt-driven I/O
  • Using DMA to drive I/O

Using program-controlled I/O, also known as programmable I/O, refers to data transfers initiated by the CPU under driver software control to access registers or other memory on a device. The CPU issues the command and then waits for the I/O operation to complete.

Since the CPU is much faster than the I/O modules, the problem with programmable I/O is that the CPU has to wait a long time for the result. The CPU will use polling or busy waiting when waiting, and as a result, the performance of the entire system will be severely reduced.

Programmable io:
In view of the defects of programmable I/O above, we propose an improved solution. We want to be able to do other things while the CPU is waiting for the I/O device. After the I/O device is completed, it will Generate an interrupt that stops the current process and saves the current state.

dma io:
The Chinese name for DMA is direct memory access, which means that the CPU grants the I/O module permission to read or write memory without involving the CPU. That is, DMA does not require the participation of the CPU.

This process is managed by a chip called a DMA controller (DMAC). Since DMA devices can transfer data directly between memories, rather than using the CPU as an intermediary, congestion on the bus is relieved.

DMA increases system concurrency by allowing the CPU to perform tasks while the DMA system transfers data across the system and memory buses.

I/O Hierarchy

I/O software is usually organized into four layers, and their general structure is shown in the figure below.
insert image description here
Each layer and its upper and lower layers have clear functions and interfaces. Let's take the opposite approach to the computer network, that is, to understand these programs from the bottom up.

Below is another diagram, this one showing all the layers of the I/O software system and their main functions.
insert image description here

Reactor 和 Proactor

https://mp.weixin.qq.com/s/px6-YnPEUCEqYIp_YHhDzg

broken thoughts

This directory is good
insert image description here

Guess you like

Origin blog.csdn.net/S_ZaiJiangHu/article/details/129033241