Operating System - Network System

Other articles

Operating System - Overview
Operating System - Memory Management
Operating System - Processes and Threads
Operating System - Interprocess Communication
Operating System - File System
Operating System - Device Management
Operating System - Networking System
A machine will want The content of the expression is sent according to a certain agreed format. When another machine receives the information, it can also be parsed according to the agreed format, so as to obtain the content that the sender wants to express accurately and reliably. This agreed-upon format is the Networking Protocol.


Why is the network layered?

We first build a relatively simple scene here, and we will explain it based on this scene in the next few sections.


Let's assume that there are three machines involved. Linux server A and Linux server B are on different network segments, and the intermediate Linux server acts as a router for forwarding.


image.png
Speaking of network protocols, we also need to briefly introduce two network protocol models, one is the OSI standard seven-layer model , and the other is the industry standard TCP/IP model . Their correspondence is shown in the following figure:
image.png
Why is the network layered? Because the network environment is too complex, it is not a system that can be controlled centrally. There are hundreds of millions of servers and devices in the world, each with its own system, but all of them can be divided into multiple levels and combinations through the same set of network protocol stacks to meet the communication needs of different servers and devices.

Here we briefly introduce several layers of network protocols.
At which level do we start? Start with the third layer, the network layer, because this layer has the familiar IP addresses. Therefore, this layer is also called the IP layer.
The IP addresses we usually see look like this: 192.168.1.100/24. The slash is preceded by the IP address, which is dot-separated into four parts of 8 bits each, for a total of 32 bits. The 24 after the slash means that in the 32 bits, the first 24 bits are the network number, and the last 8 bits are the host number.


Why divide like this? We can imagine that although the whole world forms a large Internet, you can also access American websites, but this network is not a whole. Your community has a network, your company also has a network, and China Unicom, China Mobile, and telecom operators also have their own networks, so a large network is divided into smaller networks.


So how do you differentiate these networks? This is the concept of a network number. There will be multiple devices in a network, and the network numbers of these devices are the same, but the host numbers are different. If you don't believe me, you can observe your mobile phone, TV, and computer at home.


Every device connected to the network has at least one IP address, which is used to locate the device. Whether it is the computer of a classmate next to you that is close at hand, or an e-commerce website far away, you can locate it by IP address. Therefore, an IP address is similar to a mailing address on the Internet and has a global positioning function .


Even if you want to visit an address in the United States, you can start from the network around you, through constant inquiries, through multiple networks, and finally reach the destination address, which is similar to the process of a courier delivering a package. The protocol for inquiring is also at the third layer, called the Routing protocol. The device that forwards network packets from one network to another is called a router.


All in all, what the third layer does is that the network packet from a starting IP address, along the route pointed by the routing protocol, passes through multiple networks, and is forwarded by multiple routers to reach the destination IP address.

From the third layer, we look down, the second layer is the data link layer. Sometimes we just call it Layer 2 or MAC Layer. The so-called MAC is the unique hardware address that each network card has (not absolutely unique, only with a relatively high probability). Although this is also an address, this address has no global positioning function.


Just like the little brother who delivers food to you, it is impossible to find your home based on the tail number of the mobile phone, but the tail number of the mobile phone has a local positioning function, but this positioning mainly relies on "roar". When the delivery boy arrives on your floor, he starts shouting: "The one with the last number xxxx, you have delivered the delivery!"
The positioning function of the MAC address is limited to a network, that is, between IP addresses under the same network number, you can use the MAC address location and communication. To obtain the MAC address from the IP address, the ARP protocol is used to obtain the MAC address by sending a broadcast packet locally, that is, "roar".


Because of the limited number of machines on the same network, the benefit of passing the MAC address is simplicity. If it matches the MAC address, it will be received. If it does not match, it will not be received. There is no such complex protocol as the so-called routing protocol. Of course, the disadvantage is that the scope of the MAC address cannot go out of the local network, so once the communication across the network, although the IP address remains the same, the MAC address needs to be changed every time it passes through a router.


Let's look at the previous picture. Server A sends network packets to server B, the original IP address is always 192.168.1.100, and the destination IP address is always 192.168.2.100, but in network 1, the original MAC address is MAC1, and the destination MAC address is the router's MAC2. After the router forwards it, the original MAC address is MAC1. , the original MAC address is the router's MAC3, and the destination MAC address is MAC4.


So what the second layer does is a mechanism for network packets to locate and communicate between servers in the local network.




Let's look further down, the first layer, the physical layer, this layer is the physical device. For example, the network cable connected to the computer, the WiFi we can connect to


Looking up from the third layer, the fourth layer is the transport layer, which has two well-known protocols TCP and UDP. In particular, TCP is widely used. In the code logic of the IP layer, it is only responsible for sending data from one IP address to another IP address, and the IP layer does not care about packet loss, out-of-order, retransmission, and congestion. The code logic to deal with these problems is written in the TCP protocol of the transport layer.
We often say that TCP is a reliable transport protocol, and it is also difficult to deal with it. Because the layers from the first to the third layer are unreliable, the network packets are lost if they are lost. It is the TCP layer that uses various numbering, retransmission and other mechanisms to make the originally unreliable network for the higher layers. Seems "reliable. There is no application layer where the years are quiet, but the TCP layer helps you move forward.


Above the transport layer is the application layer. For example, the HTTP we enter in the browser and the Servlet written by the Java server are all at this layer.


Layers 2 to 4 are processed in the Linux kernel, and application layers such as browsers, Nginx, and Tomcat are all in user mode. The processing of network packets in the kernel does not distinguish between applications.


From the fourth layer onwards, it is necessary to distinguish which application the network packet is sent to. In the TCP and UDP protocols of the transport layer, there is a concept of ports, and different applications listen to different ports. For example, the server Nginx listens on 80, Tomcat listens on 8080; another example is the client browser listening on a random port, and the FTP client listening on another random port.


The intercommunication mechanism between the application layer and the kernel is through the Socket system call. Therefore, people often ask which layer Socket belongs to. In fact, it does not belong to any layer. It belongs to the concept of operating system, not the concept of network protocol layering. It's just that the operating system chooses the implementation mode for the network protocol. The processing code of the second to fourth layers is in the kernel, and the processing code of the seventh layer is left to the application itself. The two need to communicate across the kernel mode and the user mode, and a system is required. The call completes this connection, which is the Socket.
insert image description here

send packets

After the network is divided into layers, the transmission of data packets is the process of layer-by-layer encapsulation.


As shown in the figure below, the servers Nginx and Tomcat deployed on Linux server B both listen on ports 80 and 8080 through Socket. At this time, the data structure of the kernel is known. If it encounters a message sent to these two ports, it is sent to these two processes.


On the client on Linux server A, open a Firefox to connect to Ngnix. Also through Socket, the client will be assigned a random port 12345. Similarly, open a Chrome to connect to Tomcat, and also assign random port 12346 through Socket.
image.jpeg
In the client browser, we encapsulate the request as HTTP protocol and send it to the kernel through Socket. In the network protocol stack of the kernel, a data structure for maintaining connection, sequence number, retransmission, and congestion control is created at the TCP layer, and the HTTP packet is added with a TCP header and sent to the IP layer. MAC layer, MAC layer plus MAC header, sent from the hardware network card.

The network packet will arrive at the switch of network 1 first. We often call the switch a layer-2 device, this is because the switch will only process to the second layer, and then it will take off the MAC header of the network packet and find that the target MAC is the network port on the right side of itself, so from this network Out the mouth.

The network packet will reach the Linux router in the middle, and the network card on its left will receive the network packet, find that the MAC address matches, and hand it over to the IP layer, where the IP layer searches the routing table based on the information in the IP header. Where is the next hop and which network port should it be sent from? In this example, it will eventually be sent out from the network port on the right. We often refer to a router as a layer-3 device because it only handles up to the third layer.

The packets sent from the network port on the right side of the router will go to the switch of network 2, or will undergo a Layer 2 process and be forwarded to the network port on the right side of the switch.

Finally, the network packet will be forwarded to Linux server B, which finds that the MAC address matches, removes the MAC header, and hands it to the upper layer. The IP layer finds that the IP address matches, removes the IP header, and hands it to the upper layer. The TCP layer will find that it is a correct network packet according to the serial number and other information in the TCP header, and will cache the network packet and wait for the application layer to read it.

The application layer listens to a certain port through Socket, so when reading, the kernel will send the network packet to the corresponding application according to the port number in the TCP header.

The header and body of the HTTP layer are parsed by the application layer. Through parsing, the application layer knows the client's request, such as buying a commodity or requesting a web page. When the application layer finishes processing the HTTP request, it will still encapsulate the result as an HTTP network packet and send it to the kernel through the Socket interface.

The kernel will be encapsulated layer by layer, sent out from the physical network port, through the switch of network 2, the Linux router will reach network 1, and through the switch of network 1, it will reach Linux server A. On Linux server A, after layer-by-layer decapsulation, it is sent to the client's application and browser through the socket interface and according to the client's random port number. So the browser can display a colorful page.


Even in such a simple environment, the process of sending network packets is so complicated.

zero copy

Why have DMA technology

Before DMA technology, the I/O process was like this:

  • The CPU sends the corresponding command to the disk controller, and then returns;
  • After the disk controller receives the instruction, it starts to prepare the data, puts the data into the internal buffer of the disk controller, and then generates an interrupt ;
  • After the CPU receives the interrupt signal, it stops the work at hand, and then reads the data in the buffer of the disk controller into its own register one byte at a time, and then writes the data in the register to the memory, while the data is transmitted. During this period, the CPU cannot perform other tasks.

In order to facilitate your understanding, I drew a picture:





you can see that the entire data transmission process requires the CPU to personally participate in the process of moving the data, and the CPU cannot do other things in this process.


It is no problem to simply carry a few characters of data, but if we use the CPU to carry a large amount of data with a gigabit network card or hard disk, it will definitely be too busy.


After computer scientists discovered the seriousness of the matter, they invented DMA technology, that is, direct memory access ( Direct Memory Access ) technology.


What is DMA technology? A simple understanding is that when data transfer between I/O devices and memory is performed, the data transfer work is all handed over to the DMA controller, and the CPU no longer participates in any data transfer-related matters, so that the CPU can handle other things. affairs .


So what exactly is the process of using the DMA controller for data transfer? Let's take a look at it in detail.





Specific process:

  • The user process calls the read method, sends an I/O request to the operating system, and requests to read data into its own memory buffer, and the process enters a blocking state;
  • After the operating system receives the request, it further sends the I/O request to DMA, and then lets the CPU perform other tasks;
  • DMA further sends I/O requests to disk;
  • The disk receives the I/O request from the DMA, and reads the data from the disk to the buffer of the disk controller. When the buffer of the disk controller is full, it sends an interrupt signal to the DMA to inform itself that the buffer is full;
  • DMA receives the signal from the disk and copies the data in the disk controller buffer to the kernel buffer. At this time, the CPU is not occupied, and the CPU can perform other tasks ;
  • When the DMA has read enough data, it will send an interrupt signal to the CPU;
  • The CPU receives the DMA signal and knows that the data is ready, so it copies the data from the kernel to the user space, and the system call returns;

It can be seen that in the whole process of data transmission, the CPU no longer participates in the work of data handling, but the whole process is completed by DMA, but the CPU is also essential in this process, because what data is transmitted and where it is transmitted from The CPU is required to tell the DMA controller.
In the early days, DMA only existed on the motherboard. Nowadays, due to the increasing number of I/O devices and the different requirements for data transmission, each I/O device has its own DMA controller.

How bad is traditional file transfer?

If the server wants to provide the function of file transfer, the easiest way we can think of is to read the file on the disk and send it to the client through the network protocol.
The way traditional I/O works is that data reads and writes are copied back and forth from user space to kernel space, while data in kernel space is read or written from disk through an I/O interface at the operating system level.
The code is usually as follows, which generally requires two system calls:

read(file, tmp_buf, len);
write(socket, tmp_buf, len);

The code is very simple, although it is only two lines of code, but a lot of things happen here.

First of all, a total of 4 context switches between user mode and kernel mode occurred during the period , because two system calls occurred, one is and the other read()is write(), each system call must first switch from user mode to kernel mode, and wait for the kernel to complete the task. After that, switch from kernel mode back to user mode.


The cost of context switching is not small. A switching takes tens of nanoseconds to several microseconds. Although the time seems very short, in high concurrency scenarios, such time is easy to accumulate and magnify, thus affecting the system performance. performance.


Secondly, 4 data copies have also occurred , two of which are DMA copies, and the other two are copied by the CPU. Let's talk about the process:

  • For the first copy , the data on the disk is copied to the buffer of the operating system kernel. The copying process is carried by DMA.
  • The second copy is to copy the data in the kernel buffer to the user's buffer, so our application can use this part of the data. The copying process is completed by the CPU.
  • The third copy is to copy the data just copied to the user's buffer, and then copy it to the kernel's socket buffer. This process is still carried by the CPU.
  • The fourth copy is to copy the data in the socket buffer of the kernel to the buffer of the network card. This process is carried by DMA again.


Let's look back at the file transfer process. We only moved one piece of data, but ended up moving it 4 times. Excessive data copying will undoubtedly consume CPU resources and greatly reduce system performance.


This simple and traditional file transfer method has redundant context switching and data copying, which is very bad in a high-concurrency system. It adds a lot of unnecessary overhead, which will seriously affect the system performance.


Therefore, in order to improve the performance of file transfer, it is necessary to reduce the number of "context switching between user mode and kernel mode" and "memory copy" .

How to optimize the performance of file transfer?

Let's take a look first, how to reduce the number of "context switches between user mode and kernel mode"?
When reading disk data, the reason for context switching is that user space does not have permission to operate the disk or network card, and the kernel has the highest authority. When the kernel performs certain tasks, it needs to use the system call functions provided by the operating system.


A system call will inevitably have two context switches: first, switch from user mode to kernel mode, and when the kernel finishes executing the task, switch back to user mode and hand it over to the process code for execution.


Therefore, in order to reduce the number of context switches, it is necessary to reduce the number of system calls .


Let's see, how to reduce the number of "data copies"?
As we know earlier, the traditional file transfer method will go through 4 data copies, and here, "copy from the kernel's read buffer to the user's buffer, and then copy from the user's buffer to the socket's buffer. ”, this process is not necessary.


Because in the application scenario of file transfer, we do not "reprocess" the data in the user space, so the data can actually not be moved to the user space, so the user's buffer is unnecessary .

How to achieve zero copy

There are usually two ways to implement zero-copy technology:

  • mmap + write
  • sendfile

Let's talk about how they reduce the number of "context switches" and "data copies".

mmap + write

As we know earlier, read()the data of the kernel buffer will be copied to the user's buffer during the system call, so in order to reduce the overhead of this step, we can mmap()replace read()system call function with.

buf = mmap(file, len);
write(sockfd, buf, len);

mmap()The system call function will directly " map " the data in the kernel buffer to the user space, so that the operating system kernel and user space do not need to perform any data copy operations.

The specific process is as follows:

  • mmap()After the application process calls , the DMA will copy the data from the disk to the kernel's buffer. Then, the application process "shares" this buffer with the operating system kernel;
  • When the application process is called again write(), the operating system directly copies the data in the kernel buffer to the socket buffer. All this happens in the kernel mode, and the CPU carries the data;
  • Finally, copy the data in the socket buffer of the kernel to the buffer of the network card. This process is carried by DMA.

We can know that by mmap()using instead read(), the process of one data copy can be reduced.
But this is not the ideal zero-copy, because the data of the kernel buffer still needs to be copied to the socket buffer by the CPU, and 4 context switches are still required, because the system call is still 2 times.

sendfile

In Linux kernel version 2.1, a special system call function for sending files is provided . The sendfile()function form is as follows:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);	

Its first two parameters are the file descriptors of the destination and source respectively, the last two parameters are the offset of the source and the length of the copied data, and the return value is the length of the actual copied data.


First of all, it can replace the previous read()and write()these two system calls, so that one system call can be reduced, and the overhead of two context switches can be reduced.


Secondly, the system call can directly copy the data in the kernel buffer to the socket buffer, instead of copying it to the user mode, so there are only 2 context switches and 3 data copies. As shown in the figure below:

But this is not a real zero-copy technology. If the network card supports SG-DMA ( The Scatter-Gather Direct Memory Access ) technology (different from ordinary DMA), we can further reduce the amount of memory in the kernel buffer through the CPU. The process of copying the data to the socket buffer.


You can run the following command on your Linux system to check whether the network card supports the scatter-gather feature:

$ ethtool -k eth0 | grep scatter-gather
scatter-gather: on

Therefore, starting from the Linux kernel 2.4version , when the network card supports SG-DMA technology, sendfile()the process of system calls has changed a bit. The specific process is as follows:

  • The first step is to copy the data on the disk to the kernel buffer through DMA;
  • In the second step, the buffer descriptor and data length are transferred to the socket buffer, so that the SG-DMA controller of the network card can directly copy the data in the kernel cache to the buffer of the network card. This process does not need to transfer the data from the operation. The system kernel buffer is copied to the socket buffer, which reduces a data copy;

Therefore, in this process, only 2 data copies were performed, as shown in the figure below:

This is the so-called Zero -copy technology, because we did not copy the data at the memory level, that is to say, the whole process was not carried by the CPU. Data, all data are transferred through DMA.

Compared with the traditional file transfer method, the file transfer method of zero-copy technology reduces the number of context switches and data copies by 2 times. Only 2 times of context switches and data copies are required to complete the file transfer, and 2 times of data copy. The process does not need to go through the CPU, and both times are carried by DMA.

Therefore, in general, zero-copy technology can at least double the performance of file transfer .

What does PageCache do?

Looking back at the file transfer process mentioned above, the first step is to copy the disk file data into the "kernel buffer", which is actually the disk cache ( PageCache ) .


Since zero-copy uses PageCache technology, zero-copy can further improve performance. Let's take a look at how PageCache does this.


Reading and writing to disk is much slower than reading and writing to memory, so we should find a way to replace "read-write disk" with "read-write memory". Therefore, we will transfer the data from the disk to the memory through DMA, so that the read disk can be replaced by the read memory.
However, the memory space is much smaller than the disk, and the memory is destined to copy only a small part of the data on the disk.


The question is, which disk data to choose to copy to memory?


We all know that when the program is running, it has "locality", so usually, the probability that the data that has just been accessed will be accessed again in a short time is very high, so we can use PageCache to cache the recently accessed data . When the space is insufficient The cache that has not been accessed for the longest time is eliminated.


Therefore, when reading disk data, first look for it in the PageCache, if the data exists, it can be returned directly; if not, it is read from the disk, and then cached in the PageCache.


Another point is that when reading disk data, you need to find the location of the data, but for a mechanical disk, it is to rotate the head to the sector where the data is located, and then start reading the data "sequentially", but the physical Actions are time-consuming, and to reduce their impact, PageCache uses a "read-ahead feature" .


For example, suppose the read method only reads each time32 KBAlthough read will only read bytes from 0 to 32 KB at the beginning, the kernel will also read the next 32 to 64 KB into the PageCache, so that the cost of reading 32 to 64 KB is very low. If the process reads it before 32-64 KB is eliminated from the PageCache, the benefit is very large.


Therefore, the advantages of PageCache are mainly two:

  • cache recently accessed data;
  • read-ahead function;

These two practices will greatly improve the performance of reading and writing disks.


However, when transferring large files (GB-level files), PageCache will not work, then the extra data copy done by DMA will be wasted, resulting in performance degradation, even if the zero copy of PageCache is used,
performance

This is because if you have a lot of GB-level files to transfer, whenever users access these large files, the kernel will load them into the PageCache, so the PageCache space is quickly filled with these large files.


In addition, because the file is too large, the probability of some parts of the file data being accessed again is relatively low, which will bring two problems:

  • Because PageCache is occupied by large files for a long time, other "hot" small files may not be able to fully use PageCache, so the performance of disk read and write will decrease;
  • The large file data in PageCache does not enjoy the benefits of caching, but it costs DMA to copy to PageCache one more time;

Therefore, for the transmission of large files, PageCache should not be used, that is to say, zero-copy technology should not be used, because the "hot" small files may not be able to use PageCache because PageCache is occupied by large files, so in a high-concurrency environment can cause serious performance issues.

How to transfer large files?

What method should we use for the transmission of large files?


Let's take a look at the initial example. When calling the read method to read a file, the process will actually block in the read method call because it has to wait for the return of the disk data, as shown in the following figure:

The specific process:

  • When the read method is called, it will be blocked. At this time, the kernel will initiate an I/O request to the disk. After the disk receives the request, it will address it. When the disk data is ready, it will initiate an I/O interrupt to the kernel. Tell the kernel that the disk data is ready;
  • After the kernel receives the I/O interrupt, it copies the data from the disk controller buffer to the PageCache;
  • Finally, the kernel copies the data in the PageCache to the user buffer, so the read call returns normally.

For the blocking problem, asynchronous I/O can be used to solve it. It works as follows:

it divides the read operation into two parts:

  • In the first half, the kernel initiates a read request to the disk, but it can return without waiting for the data to be in place , so the process can handle other tasks at this time;
  • In the second half, when the kernel copies the data from the disk to the process buffer, the process will receive a notification from the kernel and then process the data;


Moreover, we can find that asynchronous I/O does not involve PageCache, so using asynchronous I/O means bypassing PageCache.


I/O that bypasses PageCache is called direct I/O, and I/O that uses PageCache is called cached I/O. Typically, for disk, asynchronous I/O only supports direct I/O.


As mentioned earlier, the transfer of large files should not use PageCache, because the "hot" small files may not be able to use PageCache because PageCache is occupied by large files.


Therefore, in high concurrency scenarios, "asynchronous I/O + direct I/O" should be used instead of zero-copy technology for the transmission of large files .


There are two common direct I/O application scenarios:

  • The application has implemented the cache of disk data, so there is no need for PageCache to cache again, reducing additional performance loss. In MySQL database, direct I/O can be enabled by parameter setting, the default is not enabled;
  • When transferring large files, it is difficult for large files to hit the PageCache cache, and it will fill up the PageCache, so that "hot" files cannot make full use of the cache, thus increasing the performance overhead. Therefore, direct I/O should be used at this time.

In addition, since direct I/O bypasses PageCache, you cannot enjoy these two kernel optimizations:

  • The kernel's I/O scheduling algorithm will cache as many I/O requests as possible in the PageCache, and finally " merge " them into a larger I/O request and send it to the disk, in order to reduce the addressing operation of the disk;
  • The kernel will also " pre-read " subsequent I/O requests in the PageCache, also to reduce disk operations;

Therefore, when transferring large files, use "asynchronous I/O + direct I/O" to read files without blocking.


Therefore, when transferring files, we have to use different methods according to the size of the file:

  • When transferring large files, use "asynchronous I/O + direct I/O";
  • When transferring small files, "zero-copy technology" is used;

In Nginx, we can use the following configuration to use different methods according to the size of the file:

location /video/ { 
    sendfile on; 
    aio on; 
    directio 1024m; 
}

When the file size is greater than the directiovalue , use "asynchronous I/O + direct I/O", otherwise use "zero copy technology".

Guess you like

Origin blog.csdn.net/zhouhengzhe/article/details/123320542