C10K-C10M advanced (true understanding of high concurrency)

1. What are the restrictions on the TCP that a computer can connect to?
The limit on the file descriptors that can be opened. The default is 1024, which can be modified.
The port number limit is 65535, because the 16-bit port number in the TCP header, of which 1024 or more can be used. If it is a client, it can open more than 60,000 files at most. If it is a server, it must be multiplied by the number of ip.
However, in the end, the number of TCP connections will still be limited by memory, operating system, etc., but it should be no problem to reach hundreds of thousands now.

C10K problem
In the early days of the website, 100 people was a lot. Later, in the era of web1.0, the number of people increased, but they only simply downloaded html to browse the web. But in the 2.0 era, users have grown geometrically, and not only simple browsing, but also complex interactions. Therefore, the concurrent TCP connections of the website at the same time may have exceeded 100 million. The original servers were all based on the process/thread model (typically Apache multithreading), and when a new TCP connection arrives, a process (or thread) needs to be allocated. And the process is the most expensive resource of the operating system, and a machine cannot create many processes. If it is C10K, 10,000 processes will be created, so the operating system cannot bear it for a single machine.

Features: The relationship between performance and the number of connections and machine performance is often nonlinear. It often cannot handle 2 times the concurrent throughput on a new server with 2 times the performance. This is because when the strategy is wrong, the consumption of a large number of operations is linearly related to the current number of connections n. The relationship between the resource consumption of a single task and the current number of connections will be O(n)
The more connections, the longer the time spent by a single connection.

**Essence:** The problem of the operating system. Synchronous blocking I/O model. Too many process threads are created, data copying is frequent (disk file-page cache-user-kernel socket buffer-network card), and process/thread context switching consumes a lot , causing the operating system to crash and drain CPU resources. This is the C10K problem the essence!
(Thread switching is related to the number of threads, because it is necessary to linearly search for available idle threads, so epoll is required to make thread switching independent of the number of threads)

Solution: Non-blocking IO+IO multiplexing mechanism
. It is unrealistic to assign a process to each connection;
that is, wait for data to arrive before assigning threads to process, avoiding meaningless context switching.

Select uses fd_set to check if an event has occurred. Disadvantages are fd_set size limitations, overhead of copying and checking one by one, and overhead of repeated initialization.
poll mainly solves the first two problems of select: pass events that need attention to the kernel through a pollfd array to eliminate the upper limit of file handles, and use different fields to mark attention events and events to avoid repeated initialization.
If epoll returns from the call, it only provides the application with a file handle that has changed state (probably data ready).
Since each interface of epoll, kqueue, and IOCP has its own characteristics, program transplantation is very difficult, so these interfaces need to be modified. Encapsulation to make them easy to use and transplant, among which the libevent library is one of them
(in fact, these only solve the problem of thread switching, the problem of data copying is not solved, and the use of sendfile zero-copy technology may be better)

To sum up, don't let the CPU handle irrelevant things, such as blocking and waiting for data to arrive, such as context switching of processing processes, creation and destruction, data copying, etc.
Make the most of your CPU.

It is not impossible to solve the C10M problem.
The OS kernel is not the solution to the C10M problem. On the contrary, the OS kernel is the key to the C10M problem.
As of now, the price of an X86 server with 40gpbs, 32-cores, and 256G RAM is several thousand dollars on the Newegg website. In fact, with such a hardware configuration, it can handle more than 10 million concurrent connections. If they can't, it's because you chose the wrong software, not the underlying hardware.

The core of the problem: don't let the OS kernel do all the heavy lifting: offload packet processing, memory management, processor scheduling, etc. Leave it to the application to handle.

For example, Microsoft's DPDK has designed a fast path for the data plane.

1. You must know that the linux kernel is a time-sharing system from the beginning, the most important thing is to ensure fairness, CFS algorithm. And our actual tasks may be different, so it is best to let the application complete the scheduling to reduce the burden of kernel scheduling and meaningless scheduling;
2. Core binding , tasks may run on different CPU cores, especially now It is a NUMA architecture, which will lead to context switching and cache hit rate problems;
3. Data packets directly interact with the application layer . The Linux protocol stack is complex and cumbersome, and data packets passing through it will cause a huge drop in performance and will Uses a lot of memory resources. For example, the network card driver of DPDK is at the application layer, and the data does not pass through the kernel. (Of course, the data packets sent to oneself must pass through the kernel)
4. The large page mechanism is enabled to reduce address translation overhead. (4kb-2mb)

IO Model Exploration
A typical IO involves 4 copies.
1. Server request receiving, read process. Request the disk/network card to copy to the kernel page, and then copy to the user for processing; write, after processing, copy to the socket kernel cache, and then copy to the network card to send;

When designing the server-side concurrency model, there are two key points:
1) How the server manages connections and obtains input data; (blocking, non-blocking polling, IO multiplexing, signals)
2) How the server processes requests.
( thread pool, multi-threaded, single-threaded )

An input operation usually consists of two distinct phases:
1) waiting for data to be ready;
2) copying data from the kernel to the process.

Blocking IO: It means waiting until the data comes, and suspending the thread during the waiting period. Generally, this requires one thread per connection, which is rarely used. The advantage is that blocking does not occupy CPU

Non-blocking I/O: If there is no data, it will not be blocked, but will be polled all the time, and the real-time performance is better. Polling will constantly ask the kernel, which will take up a lot of CPU time, and the system resource utilization rate is low, so the general Web server does not use this I/O model.

I/O multiplexing: CPU polling is no longer needed, and you will be notified once there is ready data. There is no CPU consumption, and a large number of sockets can be monitored to save resources by using one descriptor;

Signal-driven I/O model: When the data is ready, the process will receive a SIGIO signal, and the I/O operation function can be called in the signal processing function to process the data.
Advantages: Threads are not blocked while waiting for data, which can improve resource utilization. IO multiplexing will block when there is no data.
Signal-driven I/O Although it is useful for processing UDP sockets, this kind of signal notification means that a datagram has arrived, or an asynchronous error has been returned.
However, for TCP, the signal-driven I/O method is almost useless, because there are many conditions that cause such notifications (because there are connection states, such as connection completion, connection close, close connection request initiation, completion, etc.), each It will consume a lot of resources to judge by one, and the signal will generate an interrupt context switch, which loses all the advantages compared with the previous methods.

All of the above are synchronous IO models.
Asynchronous I/O models.
Synchronous IO refers to the second stage (after the data is copied from the kernel to the user state after arrival, we need to participate in it ourselves).
Why do we obviously want to read the data, why do we have to initiate it first? A select request to ask the data status, and then initiate a real read data request, can there be a once-and-for-all way, I just send a request and I tell the kernel that I want to read the data, and then I don’t care about anything, and then the kernel To help me with all the rest?
Highest efficiency and least blocking.
But this requires the operating system to do a lot of things, but this method is not stable. At present, there is only IOAP of windows. Linux also has AIO later, but it is unstable, so the main use is IO multiplexing technology.

What is described above is how to handle the connection and get the input?
Here's how to handle the request?
1. One thread per connection . There are many problems. When the amount of concurrency is large, the memory usage is large; when there is no data, it will be blocked; frequent creation and destruction of threads also requires CPU;

2. Reactor mode
The most classic is IO multiplexing + thread pool. Reactor single-thread (redis is this), reactor multi-thread, master-slave reactor mode (IO event reading will also take time, and it will be distributed to multiple slave reactors).
Here we need to consider whether the bottleneck is the IO event distribution part or the thread pool. processing part. Generally speaking, it is a thread pool, because the number of threads is generally the same as the number of cores or slightly more than 4 or 8. And we usually have 100, 1000, 10000 clients concurrently, the thread pool is very busy, even if the allocation of ready IO events is reduced from the reactor, the queue of the thread pool will only become longer. Waiting for processing may also cause queue overflow.
(Here comes the key point, we analyzed **, IOaccept and IO monitoring and distribution are not bottlenecks, the bottleneck is that processing requests are too late, and the biggest drawback is: because the message queue is multi-threaded to fetch messages, it needs to be locked. So it greatly reduces Speed. How to reduce the speed of message queue message fetching in the thread pool is very important.**)

The Proactor Model
We discovered the reactor non-blocking synchronous model. That is, waiting for IO to be ready does not block, but the thread needs to wait during the time when the data is copied from the kernel to the user. It is best to asynchronously give the operating system to complete it. Disadvantages of using asynchronous IO
: programming complexity, under the Linux system, Linux 2.6 was introduced, and asynchronous I/O is not perfect at present. Memory usage, the buffer must be kept during the time period of the read or write operation, which may cause continuous uncertainty, and each concurrent operation requires an independent cache. Compared with the Reactor mode, the Socket is ready to read or write Before, it is not required to open the cache;

What is high concurrency
? It usually refers to the number of requests that the system can handle at the same time per unit time. Simply put, it is QPS

Note: High concurrency is actually an effective squeeze on CPU resources. For computationally intensive tasks such as encryption and decryption, it is meaningless to talk about high concurrency, because the CPU is always in use. Just add the machine.
We are talking about IO intensive including network database IO.

Control variable method
(here we can talk about why we only made the server layer and did not involve other layers, because for high concurrency, each layer may be a bottleneck, we mainly study the methods of the service layer that can improve concurrency)
insert image description here
To achieve high concurrency, we need load balancing, service layer, cache layer, and persistence layer to be highly available and performant. Even in step 5, we can optimize by compressing static files, HTTP2 pushing static files, and CDN. We can write several books on optimization for each layer here.
This article mainly discusses the service layer, which is the part circled in red. It no longer considers the impact of talking about databases and caches.

insert image description here
This step by step is to squeeze the CPU resources layer by layer, so that it will not block to the greatest extent, and will not do other things (context switching, data copying, etc.). The coroutine does not
switch from user mode to kernel mode, because only the pointer changes Just for a moment. Let me talk about the difference between our coroutines and rust and go coroutines.
Another advantage of coroutines is that synchronous writing achieves asynchronous performance.

Anyway, when testing, try to make the CPU utilization as full as possible without blocking. And make sure they are doing the right thing, not dealing with context switching, thread creation and destruction, data copying, address translation, CPU multi-core cross access, interrupts, etc.
C10M is also the same idea.

Finding the problem is often more difficult than solving it. When we understand high concurrency, we will find that high concurrency and high performance are not limited by programming language, but only by your thinking.

Guess you like

Origin blog.csdn.net/weixin_53344209/article/details/130816632
Recommended