Computing high performance [single server high performance]

Of course, the architect’s perspective needs to pay special attention to the design of high-performance architecture, and the design of high-performance architecture mainly focuses on two aspects:

  • Try to improve the performance of a single server, and maximize the performance of a single server
  • If a single server cannot support performance, design a server cluster solution

In addition to the above two points, whether the final system can achieve high performance is also related to specific implementation and coding. But the architecture design is the basis of high performance. If the architecture design does not achieve high performance, there will be limited room for improvement in specific implementation and coding. Vividly speaking, the architecture design determines the upper limit of system performance, and the implementation details determine the lower limit of system performance

Single server high performance

One of the keys to the high performance of a single server is the network programming model adopted by the server. There are two key points in the design of the network programming model:

  • How the server manages links
  • How the server handles the request

And these two points are ultimately related to the I/O model and process model of the operating system:

  • I/O model: blocking, non-blocking, synchronous, asynchronous
  • Process model: single process, multi process, multi thread, single thread

PPC

PPC is the abbreviation of Process per Connection, and its meaning is to create a new process every time there is a new connection to handle the request of the connection. This is the model adopted by the traditional UNIX network server. The basic process is as follows:
Insert picture description here

  • The parent process accepts the link, and then forks the child process
  • The child process processes the link-related request, and then the child process closes the connection
  • After the parent process forks the child process, it calls close. In fact, it does not close the connection, but decrements the reference count of the connected file descriptor. When the connection is actually closed, after the child process also calls close, link the corresponding file description After the symbol reference count becomes 0, the operating system will actually close the connection

The PPC mode is simple to implement and is more suitable for situations where the number of server connections is not that many. For example, the database server. For ordinary business servers, before the rise of the Internet, because the server's traffic and concurrency were not so large, this model actually worked quite well. After the rise of the Internet, the concurrency and visits of servers have increased dramatically from dozens to tens of thousands. The disadvantages of this model have been highlighted, mainly in the following aspects:

  • Fork is costly: From the perspective of the operating system, the cost of creating a process is very high, requiring a lot of kernel resources to be allocated, and the memory image needs to be copied from the parent process to the child process. Even if the current operating system uses Copy on Write (Copy on Write) technology when copying memory images, the overall cost of creating a process is still very high
  • Complicated communication between parent and child processes: When the parent process "forks" the child process, file descriptors can be passed from the parent process to the child process through memory image copying, but after the "fork" is completed, the communication between the parent and child processes is more troublesome, and IPC (Interprocess Communication) and other process communication schemes. For example, the child process needs to tell the parent process how many requests it has processed before close to support the parent process for global statistics, then the child process and the parent process must use the IPC scheme to transmit information
  • The increase in the number of processes puts greater pressure on the operating system: if each connection has a longer survival time and new connections come in continuously, the number of processes will increase, and the frequency of operating system process scheduling and switching will also increase. The higher you come, the greater the pressure on the system. Therefore, in general, the maximum number of concurrent connections that the PPC solution can handle is only a few hundred

prefork

In view of the different shortcomings of the PPC mode, different solutions have been produced. In the PPC mode, a new process is "forked" to process the connection request when the connection comes in. Because the "fork" process is expensive, the user may feel slow when accessing it. The emergence of prefork mode is to solve this problem. Prefork is to create a process in advance (pre-fork). The system creates a process in advance when it starts, and then starts to accept user requests. When a new connection comes in, it can be Eliminating the operation of the "fork" process, allowing users to access faster and better experience, the basic diagram of prefork:
Insert picture description here

  • The key to the realization of prefork is that multiple child processes accept the same socket. When a new connection enters, the operating system guarantees that only one process can accept successfully, but there is also a problem here: the phenomenon of “shock group” means that although there is only A child process can accept successfully, but all child processes blocked on accept will be awakened, which leads to unnecessary process scheduling and context switching. The operating system can solve this problem. The kernel has been resolved after the Linux 2.6 version. Accept the surprise group problem
  • The prefork mode, like PPC, still has the problems of complicated parent-child process communication and limited number of concurrent connections supported, so there are not many practical applications at present
  • The Apache server provides MPM prefork mode, which is recommended for sites that require reliability or compatibility with old software. By default, it supports a maximum of 256 concurrent connections.

TPC

TPC is the abbreviation of Thread per Connection, and its meaning is to create a new thread every time there is a new connection to handle the request of this connection. Compared with processes, threads are lighter, and the cost of creating threads is much less than that of processes. At the same time, multiple threads share the process memory space, and thread communication is simpler than process communication. Therefore, TPC actually solves or weakens the high cost of PPC fork and the complicated communication between parent and child processes.

The basic process of TPC is as follows:

  • The parent process accepts the connection (accept in the figure)
  • The parent process creates a child thread (pthread in the figure)
  • The child thread handles the read and write requests of the connection (the child thread read, business processing, write in the figure)
  • The child thread closes the connection (close in the child thread in the figure)

Insert picture description here
It is not difficult to find that compared with PPC, the main process does not need to close the link. The reason is that the child thread shares the process space of the main process, and the connected file descriptor is not copied, so only one close is required.

Although TPC solves the problems of high fork cost and complicated process communication, it also introduces new problems:

  • First of all, although creating a thread is cheaper than creating a process, it is not without cost. There are still performance problems when high concurrency (such as tens of thousands of connections per second)
  • Secondly, there is no need for inter-process communication, but mutual exclusion and sharing between threads introduces complexity, which may lead to deadlock problems accidentally
  • Finally, multiple threads will affect each other. When a thread is abnormal, it may cause the entire process to exit (such as memory out of bounds)
  • In addition to introducing new problems, TPC still has the problem of CPU thread scheduling and switching costs. Therefore, the TPC solution is basically similar to the PPC solution. In the scenario of several hundreds of concurrent connections, the PPC solution is more adopted, because the PPC solution will not have the risk of deadlock, and there will be no interaction between multiple processes. Higher stability

prethread

  • In the TPC mode, when a connection comes in, a new thread is created to handle the connection request. Although the thread creation is more lightweight than the process creation, there is still a certain price. The prethread mode is to solve this problem
  • Similar to prefork, prethread mode will create threads in advance, and then start accepting user requests. When a new connection comes in, you can omit the operation of creating threads, making users feel faster and have a better experience
  • Because data sharing and communication between multiple threads are more convenient, in fact, the implementation of prethread is more flexible than prefork. Common implementations are as follows:
    • The main process accepts, and then hands the connection to a thread for processing
    • The child threads all tried to accept. In the end, only one
      Insert picture description here
      Qiancheng accepted succeeded. The basic diagram of the scheme is as follows . The MPM worker mode of the Apache server is essentially a prethread scheme, but with a slight improvement, the Apache server will create multiple processes, each Create multiple threads in the process. This is mainly for stability. Even if a thread in a child process is abnormal and causes the entire child process to exit, there will be other child processes that continue to provide services and will not cause the entire server to hang up. .
      In theory, prethread can support more concurrent connections than prefork. Apache server MPM worker mode supports 16 × 25 = 400 concurrent processing threads by default.

Reactor

The main problem of the PPC solution is that each connection has to create a process, and the process is destroyed after the connection is over. This is actually a great waste. In order to solve this problem, a natural idea is to reuse resources, that is, no longer separate Create a process for each connection, but create a process pool, assign the connection to the process, a process can handle the business of multiple connections

  • After the introduction of the resource pool processing method, a new question will be raised: How can the process efficiently handle the business of multiple connections? When a process is connected, the process can adopt the processing flow of "read->business processing->write". If there is no data to read in the current connection, the process will be blocked on the read operation. This blocking method is connected one by one There is no problem in the process scenario, but if a process handles multiple connections, the process is blocked on the read operation of a certain connection. At this time, even if other connections have readable data, the process cannot handle it. Obviously this cannot be done. High performance
  • The simplest way to solve this problem is to change the read operation to non-blocking, and then the process continuously polls multiple connections. This method can solve the blocking problem, but the solution is not elegant. First, polling consumes CPU; secondly, if a process handles tens of thousands of connections, the efficiency of polling is very low

In order to better solve the above problems, a natural idea is to process only when there is data on the connection. This is the source of I/O multiplexing technology, the term I/O multiplexing It is more common in the communications industry. For example, time division multiplexing (GSM), code division multiplexing (CDMA), frequency division multiplexing (GSM), etc., which mean "the process and technology of transmitting multiple signals or data streams on one channel", but if you take Applying this meaning to the computer field will cause confusion, because purely from the superficial meaning, the "channel" in the communication field is similar to the "connection" in the computer field, and the "data flow" in the communication field is similar to the "data flow" in the computer field. "It's similar. If you directly copy the definition of multiplexing in the communication field to the computer field, you will understand multiplexing as "transmitting multiple data on one connection, which is too different from the actual I/O multiplexing. I/O multiplexing in the field of computer networks, where "multiplex" refers to multiple connections, and "multiplexing" refers to multiple connections multiplexing the same blocking object. This blocking object is related to the specific implementation. .Take Linux as an example, if you use select, the common blocking image is the fd_set used by select, if you use epoll, it is the file descriptor created by epoll_create

In summary, I/O multiplexing technology has the following two key implementation points:

  • When multiple connections share a blocking object, the process only needs to wait on one blocking object instead of polling all connections
  • When a connection has new data that can be processed, the operating system will notify the process, and the process returns from the blocked state to start business processing

I/O multiplexing combined with thread pools perfectly solves the problems of PPC and TPC models, and "great gods" gave it a very good name: Reactor, Chinese means "reactor", in fact, here is " "Response" means "event reaction", which can be understood in layman's terms as "When an event comes, I have a corresponding response." Reactor mode is also called Dispatcher mode (many open source systems will see the class of this name, which actually implements the Reactor mode), which is closer to the meaning of the mode itself, that is, I/O multiplexing unified monitoring event, received Dispatch to a process
. The core components of the Reactor model include Reactor and processing resource pool (process pool or thread pool). Reactor is responsible for monitoring and distributing events, and processing resource pool is responsible for processing events. At first glance, the implementation of Reactor It is relatively simple, but in fact, in combination with different business scenarios, the specific implementation of the Reactor mode is flexible and changeable, mainly reflected in the following two points:

  • The number of Reactor can be changed: it can be one Reactor or multiple Reactor
  • The number of resource pools can be changed: taking the process as an example, it can be a single process or multiple processes (threads are similar)

Combining the above two factors, there are theoretically 4 choices, but because the "multi-reactor single-process" implementation scheme is more complicated and has no performance advantages than the "single-reactor single process" scheme, so "multi-reactor single process" "The scheme is only a theoretical scheme, and has no application in practice. The final Reactor mode has the following three typical implementation schemes:

  • Single Reactor single process / single thread
  • Single Reactor multi-threaded
  • Multi-Reactor multi-process/thread.

The above scheme specifically chooses the process or the thread, and more is related to the programming language and platform. For example, Java language generally uses threads (for example, Netty), and C language uses processes and threads (for example, Nginx uses processes and Memcache uses threads).

Single Reactor single process/thread

The schematic diagram of the single Reactor single process/thread scheme is as follows (take the process as an example)
Insert picture description here
Select, accept, read, send are standard network programming APIs, dispatch and "business processing" are operations completed by Xu Yahao
The plan is described in detail as follows:

  • The Reactor object monitors connection events through select, and distributes them through dispatch after receiving the event
  • If it is a connection establishment event, it is handled by Acceptor. Acceptor accepts the connection through accept and creates a Handler to handle various subsequent events after the connection
  • If it is not a connection establishment event, Reactor will call the Handler corresponding to the connection (the Handler created in step 2) to respond
  • Handler will complete the complete business process of read->business processing->send

The advantage of the single-reactor single-process model is that it is very simple, there is no inter-process communication, no process competition, and all are completed in the same process. But its shortcomings are also very obvious, the specific manifestations are as follows:

  • There is only one process, and the performance of the multi-core CPU cannot be used; multiple systems can only be deployed to utilize the multi-core CPU, but this will bring complexity in operation and maintenance. Originally, only one system needs to be maintained. This method requires maintenance on one machine. Multiple systems
  • When the Handler is processing the business on a certain connection, the entire process cannot handle the events of other connections, which can easily lead to performance bottlenecks

Therefore, the single Reac tor single process solution has few application scenarios in practice, and is only suitable for scenarios where business processing is very fast. At present, the more famous open source software uses single Reactor single process Redis.

C language writing system generally uses single Reactor single process, because there is no need to create threads in the process; while Java language writing generally uses single Reactor single thread, because the Java virtual machine is a process, there are many threads in the virtual machine, business The thread is just one of the threads

Single Reactor multi-threaded

In order to avoid the shortcomings of the single-reactor single-process/thread solution, it is obvious to introduce multi-process/multi-thread, which leads to the second solution: Single Reactor multi
-currency single-Reactor multi-thread solution diagram is as follows: The
Insert picture description here
detailed description of the solution is as follows:

  • In the main thread, the Reactor object monitors connection events through select, and distributes them through dispatch after receiving the event
  • If it is a connection establishment event, it is handled by Acceptor. Acceptor accepts the connection through accept and creates a Handler to handle various subsequent events after the connection
  • If it is not a connection establishment event, Reactor will call the Handler corresponding to the connection (the Handler created in step 2) to respond
  • The Handler is only responsible for responding to events and does not perform business processing; after the Handler reads the data through read, it will be sent to the Processor for business processing
  • The processor will complete the real business processing in an independent sub-money process, and then send the response result to the Handler of the main process for processing; the Handler returns the response result to the client through send after receiving the response

The single-reactor multi-thread solution can make full use of the processing power of multi-core and multi-CPU, but it also has the following problems:

  • Multi-threaded data sharing and access are more complicated. For example, after the child thread completes the business processing, the result must be passed to the Reactor of the main thread for sending. This involves the mutual exclusion and protection mechanism of shared data. Taking Java's NIO as an example, Selector is thread-safe, but the set of keys returned by Selector.selectKeys() is not thread-safe. The processing of selected keys must be single-threaded or synchronized measures are taken to protect
  • Reactor is responsible for monitoring and responding to all events and only runs in the main thread. It will become a performance bottleneck when instantaneous high concurrency

The "single Reactor multi-thread" scheme is mentioned here, but the "single Reactor multi-process" scheme is not mentioned. The main reason is that if multiple processes are used, after the child process completes the business processing, the result is returned to the parent process and the parent process is notified to send it to Which client is very troublesome, because the parent process only monitors events on each connection through Reactor and then distributes it. The child process is not a connection when communicating with the parent process. If you want to simulate the communication between the parent process and the child process as a connection, and add Reactor to monitor, it is more complicated. When using multithreading, because multithreading shares data, it is very convenient to communicate between threads. Although additional consideration should be given to the synchronization problem when sharing data between threads, the complexity is lower than the complexity of inter-process communication mentioned above. a lot of

Multi-Reactor Multi-Process/Thread

In order to solve the problem of single Reactor and multithreading, the most intuitive method is to change the single Reactor to multiple Reactor, which leads to the third solution: multiple Reactor and multiple processes/threads.
The schematic diagram of the multi-reactor multi-process/thread scheme is as follows (take the process as an example)
Insert picture description here
. The detailed description of the scheme is as follows:

  • The mainReactor object in the parent process monitors the connection establishment event through select, and receives the event through the Acceptor after receiving the event, and assigns the new connection to a child process
  • The subReactor of the child process adds the connection allocated by the mainReactor to the connection queue for monitoring, and creates a Handler to handle various events of the connection
  • When a new event occurs, subReactor will call the corresponding Handler (that is, the Handler created in step 2) to respond
  • Handler completes the complete business process of read -> business processing -> send

The multi-reactor multi-process/thread solution seems to be more complicated than the single-reactor multi-thread, but the actual implementation is simpler. The main reasons are as follows:

  • The responsibilities of the parent process and the child process are very clear, the parent process is only responsible for receiving new connections, and the child process is responsible for completing subsequent business processing
  • The interaction between the parent process and the child process is very simple, the parent process only needs to pass the new connection to the child process, and the child process does not need to return data
  • The child processes are independent of each other and do not need to be synchronized and shared (here only limited to the network model-related select, read, send, etc., do not need to be synchronized and shared, "business processing" may still need to be synchronized and shared)

At present, the well-known open source system that uses multiple Reactor and multiple processes is Nginx, and multiple Reactor and multiple threads are used to implement Memcache and Netty.

Nginx uses a multi-reactor multi-process model, but the solution is different from the standard multi-reactor multi-process. The specific difference shows that only the listening port is created in the main process, and the mainReactor is not created to "accept" the connection. Instead, the Reactor of the child process "accept" the connection, and only one child process at a time is controlled by the lock to "accept". After the process "accept" the new connection, it will be placed in its own Reactor for processing and will not be allocated to other child processes.

Proactor

Reactor is a non-blocking synchronous network model, because the real read and send operations require synchronization of user processes. The "synchronous" here means that the user process is synchronous when performing I/O operations such as read and send. If the I/O operation is changed to asynchronous, the performance can be further improved. This is the asynchronous network model Proactor

The Chinese translation of Proactor as "proactive device" is more difficult to understand. The similar word is proactive, which means "active", so we can understand it better by translating it into "active device". Reactor can be understood as "I will notify you when the event comes, and you will handle it", while Proactor can be understood as "I will handle the event when it comes, and I will notify you when the event is processed." "I" here is the operating system kernel, and "events" are I/O events such as new connections, data to read, and data to write.

Proactor model diagram:
Insert picture description here

The plan is described in detail as follows:

  • Proactor Initiator is responsible for creating Proactor and Handler, and registering both Proactor and Handler to the kernel through the Asynchronous Operation Processor
  • The Asynchronous Operation Processor is responsible for processing registration requests and completing I/O operations
  • Asynchronous Operation Processor notifies Proactor after completing I/O operations
  • Proactor calls back different Handlers for business processing according to different event types
  • Handler completes business processing, Handler can also register a new Handler to the kernel process

Theoretically, Proactor is more efficient than Reactor. Asynchronous I/O can make full use of the DMA feature and allow I/O operations and calculations to overlap. However, to achieve true asynchronous I/O, the operating system needs to do a lot of work. At present, true asynchronous I/O is implemented through IOCP under Windows, while AIO under Linux is not perfect, so high concurrent network is implemented under Linux Reactor mode is the main mode of programming. So even if boost asio claims to implement the proactor model, in fact it uses IOCP under Windows, but under Linux it is an asynchronous model simulated by Reactor mode (using epoll)

Guess you like

Origin blog.csdn.net/dawei_yang000000/article/details/108556754