Research summary on some small problems in network programming

foreword

I originally planned to continue to summarize the fifth concurrency model, but I felt something was wrong. One is because I have really entered the Reactor concurrency mode, and I don’t understand it very deeply, and I always feel that there are some shortcomings; the other is that the sample code I originally implemented was not very well written, and it is not very styled. So I thought about summarizing some of the relatively fragmented knowledge points I encountered before, and then studying the implementation of the network library in [1], imitating a simple network library, but continuing to summarize the Reactor part.

Here are a few small questions:
(1), about the "shocking group problem".
(2), about reuseport in socket network programming.
(3) Research on the principles of select, poll, and epoll.
(4) Discussion on the work of the kernel and the work of the user space in the socket api.
(5),...

Let's explore one by one.


1. About the "Shocking Group Problem"

  The shocking group problem was also mentioned in previous blogs. The main reason is that multiple processes or threads are blocking and waiting for a common resource. When the resources are available, these processes or threads are awakened together to cause resource waste. The phenomenon of performance degradation (the main waste is that after the operating system wakes up all processes, most processes cannot obtain resources and can only return to the blocking queue, which will cause context switching in vain).

For the shocking group phenomenon, it is a broad concept, which can be encountered in many situations in multi-threaded and multi-process programming. Especially in network programming, there are two common shocking groups, one is the accept grouping, and the other is the grouping phenomenon of IO multiplexing (selec, poll, epoll, etc.). The following article explains it very well, so I won't make a fuss about it. ^_^


Detailed explanation of Linux shocking group effect (the most detailed one)

Here is a little question of mine: In the underlying implementation of the shocking group problem of IO multiplexing, how to wake up "partially" blocked processes? The statement in [1] is that the event has been processed before some processes wake up, so the kernel will not wake up the remaining processes. enen…, makes sense, but according to my current understanding of the "wake-up" of the operating system, it is to move all processes blocked on the condition to the ready queue when the condition is met (if only blocked on one condition when on).
  I really hope that someone who knows something can tell me about it.

2. About reuseport in socket network programming

For the "accept shock group" phenomenon described above, the Linux kernel introduced another solution - REUSEPORT after version 3.9.

The "accept" shock group is that multiple processes or threads are blocked on the same listening socket (listenFd) at the same time. When a connection arrives, the kernel will wake up the blocked processes or threads. After the linux version 2.6 kernel, the kernel will wake up one of the processes or threads. In essence, this is a solution that the modern kernel helps us solve, avoiding the consumption caused by locking.

The reuseport mentioned here is a solution for the kernel to help us solve the "accept shocking group problem" in another form (there may be other functions, which will not be discussed here for the time being). If the socket is set as the reuseport option, then it can be bound to the same port with other sockets (they also need to be set as the reuseport option), specifically the same addr+port. To be more specific, the kernel will send the requests from the client (the purpose of the request is addr+port) to these socketXs in a balanced manner. As shown in the diagram below.

insert image description here
The above figure is taken from [2], and the detailed description can refer to [2].


enen... The basic content has been introduced, but if we dig a little deeper, there are still a few things that I don't quite understand:
(1), why the kernel has already solved the "accept" shocking problem, and it still needs to make a reuseport?
(2) What is the difference between these two kernel solutions to the "shocking crowd" problem?

Let's study these slowly later.

3. Research on the principles of select, poll, and epoll

The IO multiplexing mechanism can be regarded as a relatively "beautiful" mechanism, because it provides a new way of thinking for concurrent programming.

long long ago, most of the implemented server-side models are iterative servers, and the next service can only be performed after one service is completed.

Then there are multi-process concurrent servers, but multi-process servers have two basic problems, one is relatively high overhead (including generation, destruction, switching, etc.), and the other is communication problems, and the communication between processes is a bit troublesome.

After the thread model came out, some concurrent servers began to use threads instead of processes, and use multi-thread models to develop server-side concurrency models. Each thread monitors a service (client connection or other forms of IO). The multi-threading model also has its inherent problems. One is that the number of concurrent connections is limited by the maximum number of threads in the system (generally, the kernel has a maximum number of threads), and the other is that threads share process space. Communication (shared data) will have problems such as mutual exclusion and synchronization (including programming difficulty and consumption of locking and unlocking) (once you fall into a deadlock problem, it is easy to crash, ^ _ ^|||).

Later, I don’t know which big guy creatively developed an IO multiplexing mechanism. In a process or thread, multiple IOs (multiple client connections or other forms of IO requests) can be blocked at the same time. When one of them Or when some IO events come (readable, writable or abnormal), it will return to notify the upper layer user. In this way, the concurrency of the server side is greatly improved to some extent.

Therefore, based on IO reuse, the more popular IO concurrency model framework is the Reactor model.


The origin and basic concepts of IO multiplexing are mentioned above. The following mainly introduces the basic implementation principles of select, poll, and epoll under the Linux system.

3.1 The basic principle of select

According to my more abstract and larger understanding: the realization of select mainly consists of the following steps:

= "select to enter

(1), the user provides the fd_set concerned (including events of interest to each fd: readable, writable, abnormal, etc.) (2), the kernel
copies the fd_set from the user space to the kernel space (please refer to 【4】)
(3), traverse fd_set, and hang the current process (current) in the waiting queue of different fds.
(4) If no fd event of interest occurs, the process is blocked (the schdule_timeout timeout and signal arrival are not considered here). When an event of interest arrives, the blocked process is woken up.
(5) The awakened process re-traverses the events in fd_set, collects the prepared fd, and finally copies it to the user space (the returned thing is the entire file descriptor array, which is not all ready IO, which requires user code self-judgment)

= "select return

The above is my rough understanding. There are many wordings and statements that may not be very accurate, please forgive me.


For a more detailed discussion, please refer to [3].

In order to help understand, I also stole a picture from [5] ^ _ ^|||.

insert image description here



Here are some disadvantages of the select mechanism:

  • Due to the use of bitmaps to store fd and some kernel settings, select supports up to 1024 fd.
  • Every time you call select to enter the kernel or return from the kernel, you need to transfer the fd_set of the user space and the kernel space. If the number of fd is relatively large, the overhead will not be small.
  • When implemented inside select, there is a big loop ( for(; ; ) ), and after each wake-up, all fd_set collections need to be traversed (linear scan, complexity is O(N)), and the overhead should also be "leveraged".
  • In essence, select returns the entire file descriptor array to user space, requiring user code to traverse the array to find out which file descriptors have IO events,


3.2 The basic principle of poll

For poll, its basic principle is similar to the above select, the difference is that the fd description of the poll mechanism is in the form of a linked list (select is a bitmap), so the maximum number of file descriptors supported is more than 1024.


3.3 The basic principle of epoll


For epoll, it basically solves several shortcomings of the above selection. Its basic principle can be briefly explained by the following picture (⁄(⁄⁄•⁄ω⁄•⁄)⁄ is also "stolen" from [5]).

insert image description here

Please refer to [3], [5] and [6] for detailed introduction. The following mainly summarizes how epoll solves several shortcomings of select above.

(1) There is a limit to the number of fd supported by select.
  In epoll, a red-black tree (as the part circled in red in the above figure) is used to store the IO of interest. In theory, there is no limit to the number of fds (may be limited by hardware conditions such as memory, etc., which are not considered here)


(2) For each call to select to enter the kernel and return from the kernel, a transfer copy of fd_set is required.
   The epoll mechanism contains an epoll_ctl system call, which will be copied only when the fd is inserted for the first time (essentially inserted into an event of interest, which is encapsulated into an epitem in the kernel), and each subsequent epoll_wait (epoll mechanism blocking calls), will not re-copy fd from user space to the kernel.
   At the same time, the kernel and user space also share a piece of memory, which is used to store the fd_set brought back with IO when epoll_wait returns. This avoids the overhead of copying data from kernel space to user space


(3), about the "embarrassment" of re-scanning linearly every time an event of interest arrives (the process is woken up) in select.
  There are two important data structures in the epoll mechanism, one is a red-black tree (as circled in red in the above figure), which is used to store events of interest registered by users. The other is the ready queue (the part circled in blue in the figure above), which stores the ready IO in the event of interest, (epoll_wait() reads data from this queue.)

  Then why does epoll not need each linear scan? Because when an IO event arrives, its registered callback function (registered in epoll_ctl()) will add the event to the ready queue, so that epoll_wait() does not need to be re-selected like the do_select() function inside select. Iterate over all file descriptors. So the complexity of O(1) is reached.

(4) Question about the fact that select finally returns the file descriptor array.
   In the implementation of epoll, because of the data structure of the ready queue (not the queue that the process is ready for), all epoll_wait() essentially waits for data in the ready queue, and then pulls the data from the queue to the user space ( In fact, it is shared memory), and the data given in the end are all "genuine" IO requests, and the user code will never be allowed to make judgments.


Regarding the summary of the principle of IO multiplexing, let’s go here first. I originally planned to look at the source code seriously, but at that time, I was “sucked” by the source code. It's a little difficult. Although I have read a little bit in general, I still don't understand many details. Naturally, there are also some essences of the source code that I haven't realized: For example, "some people say that the most important thing in the epoll mechanism is the callback mechanism, which is similar to event processing?" You still need to experience it slowly later. ^_^

4. Discussion on the work of the kernel and the work of the user space in the socket api.

The last part is mainly to talk about kernel work and user space work.

Generally speaking, for non-kernel developers (operating systems), we are mostly app developers (only the subdivision layers are different, framework class system classes may rely more on the bottom layer of appd, java web business applications may rely more on app upper layer). We all eventually need to call the various services provided by the kernel (which can be briefly understood as system calls), so in essence, the kernel and applications together constitute our entire set of "applications" (which can be understood from user space and kernel space) .

When we call the system API, such as the socket API that is often encountered in network programming, the kernel does a lot of things for us at the bottom, for example, when connecting/accepting, the network protocol stack in the kernel helps us automatically complete the three-way handshake Wait for the connection process. When writing (or reading), the upper-layer user just writes the data to be transmitted to the kernel buffer (to be precise, this step is also done by the kernel for us, we simply call the write system call), and then the kernel helps We transmit data to the remote end according to the TCP protocol.

Imagine a scenario where there are two server processes A and B, which provide services in the same concurrency model. Process A connects to 10 clients, and process B connects to 100 clients. Then the problem arises. Generally, under normal circumstances (don’t be serious, this is a common situation. A and B have the same environment, and the CPUs are still comparable. Idle,...), in a given period of time, which one of A and B takes up the most kernel time, or who does the kernel serve as a service (connection blocking, reading and writing blocking, etc.) for a long time?

You can think of it with your toes, it must be B. Within a certain period of time, as a hardworking "nurse", the kernel must feed B more milk. But here we need to pay a little attention, how if A and B are allocated the same time slice, the B process kernel takes up more time, then the user's time slice occupies less. Similarly, the A process kernel takes up less time slices, and it has more time slices to execute user-mode tasks. (Of course, in general, the B process should still take advantage of some asynchronous interrupts, because the time slice of the process is not considered during asynchronous interrupts)

The above is summed up based on my own understanding. If there is anything wrong, I hope everyone can correct me. ^_^|||.


reference

[1], Detailed Explanation of Linux Shocking Group Effect (the most detailed one)
[2], Must-know socket and TCP connection process
[3], research and summary of select/poll/epoll principle
[4], epoll principle analysis #2 : select & poll
[5], the principle and difference of select, poll, epoll
[6], epoll principle analysis #3: epoll

Guess you like

Origin blog.csdn.net/plm199513100/article/details/112789845