Learn architecture from scratch - high performance computing

I. Overview

        High performance is what every programmer pursues. Whether they are building a system or writing a set of code, they all hope to achieve high performance. And high performance is the most complicated part. Disk, operating system, CPU, memory, cache, network, programming language, database, architecture, etc., each may affect the high performance of the system. One line of inappropriate debug log, one An inappropriate index may reduce the server's performance from 30,000 TPS to 8,000 TPS. A tcp_nodelay parameter may extend the response time from 2ms to 40ms. Therefore, achieving high-performance computing is a very complex and challenging matter. Different stages in the software system development process are related to whether high performance can ultimately be achieved.

Where are the design points for high-performance architecture design?

(1) Try to improve the performance of a single server and maximize the performance of a single server.

(2) If a single server cannot support performance, design a server cluster solution.

(3) Specific implementation and coding, architectural design determines the upper limit of system performance, and implementation details determine the lower limit of system performance.

High performance architecture design map:



2. Single server high performance

What are the key points for high performance on a single server?

(1) How the server manages connections (I/O model: blocking, non-blocking, synchronous, asynchronous).

(2) How the server handles requests (process, multi-process, multi-thread).

What are the high-performance modes for a single server?



model

A brief description

PPC

Every time there is a new connection, a new process is created to specifically handle the request for this connection. The improved version is prefork, which creates processes in advance.

TPC

Every time there is a new connection, a new thread is created to specifically handle the request for this connection. The improved version is prethread, which creates threads in advance.

Reactor

That is, the Dispatcher mode, a non-blocking synchronous network model, I/O multiplexing and unified monitoring of events, and after receiving the event, allocate (Dispatch) to a certain process or thread.

Proactor

It is an asynchronous network model that supports the reuse and distribution of multiple event handlers when asynchronous events are completed.



2.1 PPC

        PPC is the abbreviation of Process Per Connection. Every time there is a new connection, a new process is created to specifically handle the connection request. This is the model used by traditional UNIX network servers. The basic flow chart is:





        Description: The parent process accepts the connection (accept in the figure), the parent process "fork" the child process (fork in the figure), the child process processes the read and write requests of the connection (child process read, business processing, write in the figure), and the child process closes the connection (close in the child process in the picture).

        Note: After the parent process "fork" the child process, it directly calls close, which seems to close the connection. In fact, it just reduces the reference count of the file descriptor of the connection by one. The real closing of the connection is after the child process also calls close. The operating system will not actually close the connection until the reference count of the file descriptor corresponding to the connection becomes 0.

        The PPC mode is simple to implement and is more suitable for situations where the number of connections to the server is not that many, such as a database server. For ordinary business servers, before the rise of the Internet, since the number of server visits and concurrency was not that large, this model actually worked quite well. The world's first web server CERN httpd adopted this model ( For details, you can refer to https://en.wikipedia.org/wiki/CERN_httpd). After the rise of the Internet, server concurrency and visits increased dramatically from dozens to tens of thousands. The disadvantages of this model became prominent, mainly reflected in the following aspects:

        Fork is expensive: From the perspective of the operating system, the cost of creating a process is very high. It needs to allocate a lot of kernel resources and copy the memory image from the parent process to the child process. Even if the current operating system uses the Copy on Write (copy-on-write) technology when copying the memory image, the overall cost of creating a process is still very high.

        Communication between parent and child processes is complicated: When the parent process "forks" the child process, the file descriptor can be transferred from the parent process to the child process through memory image copying. However, after the "fork" is completed, the communication between the parent and child processes is more troublesome and requires the use of IPC (Interprocess). Communication) and other process communication solutions. For example, if the child process needs to tell the parent process how many requests it has processed before closing to support the parent process in global statistics, then the child process and the parent process must use the IPC scheme to transfer information.

        The number of supported concurrent connections is limited: If each connection survives for a long time, and new connections continue to come in, the number of processes will increase, and the frequency of operating system process scheduling and switching will become higher and higher. The pressure will also increase. Therefore, under normal circumstances, the maximum number of concurrent connections that a PPC solution can handle is several hundred.

        Based on the high cost of fork, an improved version of prefork has emerged. The system pre-creates the process when it starts, and then starts to accept user requests. When a new connection comes in, the operation of forking the process can be omitted. Let users access faster and have a better experience. The basic diagram of prefork is:





        The key to the implementation of prefork is that multiple child processes accept the same socket. When a new connection comes in, the operating system ensures that only one process can finally accept successfully. But there is a "shock" phenomenon.

        Shocking group: Although only one child process can accept successfully, all child processes blocked on accept will be awakened, which leads to unnecessary process scheduling and context switching (the kernel has solved the accept problem after Linux version 2.6) question)

        The prefork mode, like PPC, still has the problems of complex parent-child process communication and a limited number of supported concurrent connections, so there are not many practical applications at present.

        Apache server provides MPM prefork mode, which is recommended for sites that require reliability or compatibility with old software. By default, it supports a maximum of 256 concurrent connections.

2.2 TPC

        TPC is the abbreviation of Thread Per Connection, which means that every time there is a new connection, a new thread is created to specifically process the connection request.

        Compared with processes, threads are more lightweight, and the cost of creating threads is much less than that of processes; at the same time, multi-threads share the process memory space, and thread communication is simpler than process communication.

        TPC actually solves or weakens the problem of high fork cost of PPC and the complicated communication between parent and child processes.





        Description: The parent process accepts the connection (accept in the figure), the parent process creates a child thread (pthread in the figure), the child thread processes the connection's read and write requests (the child thread read, business processing, write in the figure), and the child thread closes the connection (Figure close in neutron thread).

        Note: Compared with PPC, the main process does not need to close the connection. The reason is that the child thread shares the process space of the main process, and the connected file descriptor is not copied, so it only needs to be closed once.

        Although TPC solves the problems of high fork cost and complex process communication, it also introduces new problems, specifically as follows:

(1) Although creating a thread is cheaper than creating a process, it is not without cost. There are still performance issues when concurrency is high;

(2) There is no need for inter-process communication, but mutual exclusion and sharing between threads introduces new complexity, which may accidentally lead to deadlock problems;

(3) Multi-threads will have problems affecting each other. When an exception occurs in a thread, it may cause the entire process to exit;

        In addition to introducing new problems, TPC also has problems with CPU thread scheduling and switching costs.

        The TPC scheme is basically similar to the PPC scheme in essence. In the scenario of hundreds of concurrent connections, the PPC scheme is used more, because the PPC scheme will not have the risk of deadlock, and will not affect each other between multiple processes. Stability higher.

        In view of the disadvantages of TPC's cost of creating threads, the prethread method is derived from the prefork method. The prethread mode will create threads in advance and then start accepting user requests. When a new connection comes in, you can save the need to create a thread. Operation makes users feel faster and experience better. The prethread method adopted by MySql.

        Since data sharing and communication between multiple threads are more convenient, the implementation of prethread is actually more flexible than that of prefork. The common implementation methods are as follows:

(1) The main process accepts and then hands the connection to a thread for processing.

(2) The child threads all try to accept, but in the end only one thread accepts successfully. The basic diagram of the solution is as follows:





        The MPM worker mode of the Apache server is essentially a prethread solution, but with slight improvements. The Apache server will first create multiple processes, and then create multiple threads in each process. This is mainly done to consider stability, that is: even if a thread in a child process is abnormal and causes the entire child process to exit, there will be other The child process continues to provide services without causing the entire server to hang.

        High concurrency needs to be divided according to two conditions: the number of connections and the number of requests.

  1. Massive connections (tens of thousands) and massive requests: such as rush sales, Double Eleven, 12306, etc.

  2. Constant connection (tens to hundreds) massive requests: such as middleware

  3. Massive connection constant requests (QPS exceeds a thousand): such as portal websites

  4. Constant connection constant request (QPS dozens or hundreds): such as internal operation system, management system

        PPC and TPC should be more suitable for systems with relatively large throughput, long connections and a small number of connections.

        bio: blocking io, PPC and TPC belong to this type nio: multiplexing io, reactor is based on this technology aio: asynchronous io, Proactor is based on this technology

        I/O multiplexing is a mechanism through which a process/thread can monitor multiple connections. Once a connection is ready (usually read-ready or write-ready), it can notify the program to perform corresponding read and write operations.

        IO multiplexing is to wait for multiple file descriptors to be ready at the same time, which is provided in the form of system calls. If all file descriptors are not ready, the system call blocks, otherwise the call returns, allowing the user to perform subsequent operations.

        When multiple connections share a blocking object, the process only needs to wait on one blocking object without polling all connections. Common implementation methods include select, epoll, kqueue, etc.

        epoll is an IO multiplexing technology under Linux that can handle millions of socket handles very efficiently. epoll registers a file descriptor through epoll_ctl() in advance. Once a file descriptor is ready, the kernel will use a callback mechanism similar to callback to quickly activate the file descriptor. When the process calls epoll_wait(), it will be notified ( Compared with select, it does not traverse the file descriptor, but uses the mechanism of listening for callbacks)

        Select/poll polls the socket list after receiving the notification to see which socket can be read. Ordinary socket polling refers to repeatedly calling the read operation.

The difference between epoll and select:

(1) The number of select handles is limited. There is such a statement in the linux/posix_types.h header file: #define __FD_SETSIZE 1024, which means that select can monitor up to 1024 fds at the same time, but epoll does not. Its limit is the maximum open file handle. number.

(2) The biggest advantage of epoll is that it will not reduce efficiency as the number of FDs increases. Polling processing is used in selectec. The data structure in it is similar to the data structure of an array, and epoll maintains a queue. Look at the queue directly. It’s fine if it’s not empty. epoll will only operate on "active" sockets (in the kernel implementation, epoll is implemented according to the callback function on each fd), and only "active" sockets will actively call the callback function (add this handle to the queue ), other idle status handles will not.

        If there are not a large number of idle -connections or dead-connections, epoll will not be much more efficient than select/poll.

(3) Use mmap to accelerate message passing between the kernel and user space. Whether it is select/poll or epoll, the kernel needs to notify the user space of the FD message. How to avoid unnecessary memory copies is very important. At this point, epoll is implemented by mmap the same memory between the kernel and user space.

2.3 Reactor

        I/O multiplexing combined with the thread pool perfectly solves the problems of PPC and TPC, and gives it a very cool name: Reactor. Reactor mode is also called Dispatcher mode (you will see this name class in many open source systems, which actually implements Reactor mode), which is closer to the meaning of the mode itself, that is, I/O multiplexing unified monitoring events, received Dispatch to a process after the event.

        The core components of the Reactor pattern include Reactor and processing resource pools (process pools or thread pools), where Reactor is responsible for listening and distributing events, and the processing resource pool is responsible for processing events.

Combined with different business scenarios, the specific implementation plan of the model can be flexible and changeable, such as:

  • Single Reactor single process (thread);

  • Single Reactor multi-thread;

  • Multi-Reactor multi-process (thread);

  • Multi-Reactor single process/thread (meaningless);

        The specific choice of process or thread in the above scheme is more related to the programming language and platform. For example, the Java language generally uses threads (for example, Netty), and the C language can use both processes and threads. For example, Nginx uses processes and Memcache uses threads.

2.3.1 Single Reactor single process (thread)





You can see that there are three objects in the process: Reactor, Acceptor, and Handler:

  • The role of the Reactor object is to listen and distribute events;

  • The role of the Acceptor object is to obtain the connection;

  • The role of the Handler object is to handle business;

        Select, accept, read, and send are standard network programming APIs, and dispatch and business processing are operations that need to be completed.

Process description:

        The Reactor object monitors connection events through select, and distributes them through dispatch after receiving the events;

        If it is a connection establishment event, it is handled by the Acceptor. The Acceptor accepts the connection through accept and creates a Handler to handle various subsequent events of the connection;

        If it is not a connection establishment event, Reactor will call the Handler corresponding to the connection (Handler created in step 2) to respond, and the Handler will complete the complete business process of read->business processing->send;

        Advantages: The mode is very simple, there is no inter-process communication, and there is no process competition;

        Disadvantages: There is only one process and the performance of multi-core CPU cannot be exerted; when the Handler is processing the business on a certain connection, the entire process cannot handle the events of other connections, which can easily lead to performance bottlenecks.

        Applicable scenarios: Only applicable to scenarios where business processing is very fast. Currently, the more famous open source software that uses a single Reactor and a single process is Redis.

        Multi-threading is also used in Redis6.0, but the basic design idea is still to handle it in a single-process, single-thread manner. It only multi-threads the read, parsing processing, and write, and the execution of the command is still a single-thread method.

2.3.2 Single Reactor multi-threading





Process description:

        The Reactor object in the main thread monitors connection events through select, and distributes them through dispatch after receiving the events;

        If it is an event of connection establishment, it will be processed by the Acceptor, which accepts the connection through accept and creates a Handler to handle various subsequent events of the connection;

        If it is not a connection establishment event, Reactor will call the Handler corresponding to the connection (the Handler created in step 2) to respond;

The Handler is only responsible for responding to events and does not perform business processing; after the Handler reads the data through read, it will be sent to the Processor for business processing;

        The Processor will complete the real business processing in an independent sub-thread, and then send the response result to the Handler of the main process for processing; after the Handler receives the response, it returns the response result to the client through send;

        Although it has overcome the shortcomings of the single Reactor single process/thread solution and can make full use of the processing power of multi-core CPUs, it also has the following problems:

(1) Multi-threaded data sharing and access are relatively complex and involve mutual exclusion and protection mechanisms for shared data;

(2) Reactor is responsible for monitoring and responding to all events, and only runs in the main thread. It will become a performance bottleneck when concurrency is high in an instant;

Why there is no single Reactor multi-process solution:

        If multiple processes are used, after the child process completes the business processing, it will return the results to the parent process and notify the parent process to which client it is sent, which is very troublesome. Because the parent process only listens to events on each connection through Reactor and then allocates them, the child process does not communicate with the parent process through a connection. If you want to simulate the communication between the parent process and the child process as a connection and add a Reactor to listen, it is more complicated. When using multi-threading, because multi-threads share data, communication between threads is very convenient. Although additional consideration must be given to synchronization issues when sharing data between threads, this complexity is much lower than the complexity of inter-process communication.

2.3.3 Multi-Reactor multi-process/thread

In order to solve the problem of single Reactor and multi-threading, the most intuitive way is to change the single Reactor to multiple Reactors:





Process description:

(1) The mainReactor object in the parent process monitors the connection establishment event through select. After receiving the event, it receives it through the Acceptor and assigns the new connection to a child process;

(2) The subReactor of the subprocess adds the connection allocated by the mainReactor to the connection queue for monitoring, and creates a Handler to handle various events of the connection;

(3) When a new event occurs, subReactor will call the Handler corresponding to the connection (that is, the Handler created in step 2) to respond. The Handler completes the complete business process of read→business processing→send;

        The multi-Reactor multi-process/thread solution looks more complicated than the single-Reactor multi-thread solution, but it is actually simpler to implement. The main reasons are:

(1) The responsibilities of the parent process and the child process are very clear. The parent process is only responsible for receiving new connections, and the child process is responsible for completing subsequent business processing.

(2) The interaction between the parent process and the child process is very simple. The parent process only needs to pass the new connection to the child process, and the child process does not need to return data.

(3) Sub-processes are independent of each other and do not need to be synchronized and shared (this is limited to select, read, send, etc. related to the network model, which do not need to be synchronized and shared. "Business processing" may still need to be synchronized and shared)

        The current famous open source system Nginx uses multi-Reactor and multi-process. Implementations of multi-Reactor and multi-threading include Memcache and Netty.

        Nginx adopts a multi-Reactor and multi-process mode, but the solution is different from the standard multi-Reactor and multi-process mode. The specific difference is that only the listening port is created in the main process, and the mainReactor is not created to "accept" the connection. Instead, the Reactor of the sub-process "accepts" the connection, and the lock is used to control that only one sub-process can "accept" at a time. After the process "accepts" the new connection, it will be placed in its own Reactor for processing and will not be assigned to other child processes.

2.4 Proactor

        Reactor is a non-blocking synchronous network model, because the real read and send operations require user process synchronization operations. If the I/O operation is changed to asynchronous, performance can be further improved. This is the asynchronous network model Proactor.





Process description:

(1) Proactor Initiator is responsible for creating Proactor and Handler, and registering Proactor and Handler to the kernel through Asynchronous Operation Processor;

(2) Asynchronous Operation Processor is responsible for processing registration requests and completing I/O operations. Asynchronous Operation Processor notifies Proactor after completing I/O operations;

(3) Proactor calls back different Handlers according to different event types for business processing.

(4) The Handler completes business processing, and the Handler can also register a new Handler to the kernel process;

        Theoretically, Proactor is more efficient than Reactor. Asynchronous I/O can make full use of DMA features to overlap I/O operations and calculations. However, to achieve true asynchronous I/O, the operating system needs to do a lot of work.

        DMA: Direct Memory Access (direct memory access) is an important feature of all modern computers. It allows hardware devices of different speeds to communicate without relying on a large amount of interrupt load from the CPU; currently, true asynchronous I is implemented under Windows through IOCP /O, and AIO under Linux system is not perfect.

        Disadvantages: Proactor implementation logic is complex; it relies on the operating system's support for asynchronous operations. Currently, there are few operating systems that implement pure asynchronous operations. Windows has implemented a complete set of asynchronous programming interfaces that support sockets. This set of interfaces is IOCP, which is asynchronous I/O implemented at the operating system level. It is asynchronous I/O in the true sense. Therefore, high-performance network programs can be implemented in Windows. A more efficient Proactor solution can be used. The Linux system does not support asynchronous IO very well and is not very complete.

Applicable scenarios: Event-driven programs that asynchronously receive and process multiple service requests simultaneously.

2.5The difference between Reactor and Proactor

  • Reactor is a non-blocking synchronous network mode that senses ready read and write events. Every time an event is sensed (such as a readable ready event), the application process needs to actively call the read method to complete the data reading, that is, the application process must actively read the data in the socket reception cache into the application process memory. , this process is synchronous, and the application process can process the data only after reading the data.

  • Proactor is an asynchronous network mode and perceives completed read and write events. When initiating an asynchronous read and write request, you need to pass in the address of the data buffer (used to store the result data) and other information, so that the system kernel can automatically help us complete the data reading and writing work. The entire reading and writing work here is performed by the operator The system does it, and it does not require the application process to actively initiate read/write to read and write data like Reactor. After the operating system completes the reading and writing work, it will notify the application process to process the data directly.

        Therefore, Reactor can be understood as "when an event comes, the operating system notifies the application process and lets the application process handle it", while Proactor can be understood as "when an event comes, the operating system processes it, and then notifies the application process after processing". The "events" here are I/O events such as new connections, data to be read, and data to be written. The "processing" here includes reading from the driver to the kernel and reading from the kernel to user space.

        To give an example in real life, the Reactor model means that the courier is downstairs and calls you to tell you that the courier has arrived in your community. You need to go downstairs to pick up the courier yourself. In Proactor mode, the courier delivers the package directly to your door and then notifies you.

        Both Reactor and Proactor are network programming models based on "event distribution". The difference is that the Reactor model is based on "to-be-completed" I/O events, while the Proactor model is based on "completed" I/O events. O events.

3. Cluster high performance

        Although the performance of computer hardware has developed rapidly, it still pales in comparison to the development speed of business. Especially after entering the Internet era, the development speed of business far exceeds the development speed of hardware. The development of these businesses cannot be supported by the performance of a single machine, and a machine cluster must be used to achieve high performance.

        If one person can't do it, find a few more people to do it.

        Improving performance through a large number of machines is not just as simple as adding machines. It is a complex task to cooperate with multiple machines to achieve high performance.

        The design of cluster high-performance architecture mainly involves two aspects: task allocation and task decomposition.

3.1 Task allocation

        Task distribution means that each machine can handle complete business tasks, and different tasks are assigned to different machines for execution. As follows, one server becomes two servers:





(1) It is necessary to add a task allocator. This allocator may be a hardware network device (for example, F5, switch, etc.), a software network device (for example, LVS), or a load balancing software (for example, Nginx, HAProxy ), or it may be a self-developed system (gateway). Choosing a suitable task allocator is also a complex matter, and requires comprehensive consideration of various factors such as performance, cost, maintainability, and availability.

(2) There is connection and interaction between the task allocator and the real business server. It is necessary to select an appropriate connection method and manage the connection. For example, connection establishment, connection detection, how to deal with connection interruption, etc.

(3) The task allocator needs to add an allocation algorithm. For example, whether to use a polling algorithm, distribute according to weight, or distribute according to load. If distributed according to the load of the server, the business server must also be able to report its status to the task allocator.

        Task distribution is actually what is often called load balancing. Work tasks are balanced and distributed to multiple operating units through load technology. Load balancing is built on the network structure and is an important means to provide system high availability, processing capabilities and relieve network pressure.

        Different task allocation algorithms have different goals. Some are based on load considerations, some are based on performance (throughput, response time) considerations, and some are based on business considerations.

3.1.1 Load balancing classification

It can be divided into two categories according to the implementation location:

(1) Server load: hardware load balancing and software load balancing;

(2) Client load. (Ribbon, Dubbo, Thrift in spring-cloud)

It can be divided into three categories according to different implementation methods:

(1) DNS load balancing;

(2) Hardware load balancing;

(3) Software load balancing.

3.1.2 The difference between server load balancing and client load balancing

Based on the server-side implementation of responsible balancing (regardless of software and hardware), the server-side list will be hung under the load-balancing device (software and hardware) and faulty nodes will be eliminated through heartbeat detection to ensure that all server nodes in the list can be accessed normally. Both software loads and hardware loads can be built based on an architecture similar to the following.





        In client load balancing, client nodes maintain a list of services they need to access, and these lists come from the service registry. Compared with server load balancing, client load balancing is a very niche concept. Client load balancing is defined in the spring-cloud distributed framework component Ribbon. When using the spring-cloud distributed framework, the same service is likely to be started multiple times at the same time. When a request comes, for these multiple services, the way Ribbon determines which service to use for this request through policies is client load balancing. . In the spring-cloud distributed framework, client load balancing is transparent to developers, just add the @LoadBalanced annotation. The core difference between client load balancing and server load balancing lies in the service list itself. The client load balancing service list is maintained by the client, and the server load balancing service list is maintained separately by the intermediate service.

3.1.3DNS load balancing

DNS is the simplest and most common load balancing method, and is generally used to achieve geographical level balancing.

advantage:

1. Simple and low cost: The load balancing work is handled by the DNS server, and there is no need to develop or maintain the load balancing equipment yourself.

2. Nearby access, improve access speed: When DNS is parsed, it can be resolved to the server address closest to the user based on the request source IP, which can speed up access and improve performance.

shortcoming:

1. Untimely updates: DNS cache time is relatively long. After modifying the DNS configuration, due to caching reasons, many users will continue to access the IP address before the modification. Such access will fail, failing to achieve the purpose of load balancing, and It also affects the normal use of services by users.

2. Poor scalability: The control right of DNS load balancing is with the domain name provider, and it is impossible to do more customized functions and expansion features for it according to business characteristics.

3. The distribution strategy is relatively simple: DNS load balancing supports few algorithms; it cannot distinguish the differences between servers (the load cannot be judged based on the status of the system and services); and it cannot sense the status of the back-end server.

3.1.4 Hardware load balancing

        Hardware load balancing implements the load balancing function through separate hardware devices. This type of device is similar to routers and switches and can be understood as a basic network device used for load balancing. Mainly include F5 & A10, etc.

advantage:

1. Powerful functions: fully supports load balancing at all levels, supports comprehensive load balancing algorithms, and supports global load balancing.

2. Powerful performance: In comparison, software load balancing supports up to 100,000 concurrency, which is already very powerful, and hardware load balancing can support more than 1 million concurrency.

3. High stability: Commercial hardware load balancing has undergone good rigorous testing and large-scale use, and has high stability.

4. Support security protection: In addition to the load balancing function, the hardware balancing device also has security functions such as firewall and anti-DDoS attacks.

shortcoming:

1. Expensive.

2. Poor expansion capability.

3.1.5 Software load balancing

        Software load balancing implements the load balancing function through load balancing software. Commonly used ones are Nginx and LVS. Nginx is the 7-layer load balancing of software, and LVS is the 4-layer load balancing of the Linux kernel. The difference between Layer 4 and Layer 7 is protocol and flexibility.

advantage:

1. Simple: Both deployment and maintenance are relatively simple.
2. Cheap: Just buy a Linux server and install the software.
3. Flexible: Layer 4 and Layer 7 load balancing can be selected according to the business; it can also be more conveniently expanded according to the business. For example, the customized functions of the business can be realized through Nginx plug-ins.


Disadvantages:
1. Average performance: one Nginx can support approximately 50,000 concurrencies
2. The function is not as powerful as hardware load balancing.
3. Generally do not have security functions such as firewall and anti-DDoS attacks.



3.1.6 Principles of use

  • DNS load balancing is used to achieve geographical level load balancing;

  • Hardware load balancing is used to achieve cluster-level load balancing;

  • Software load balancing is used to achieve machine-level load balancing

3.1.7 Common load balancing algorithms

  • Static: Allocate tasks with a fixed probability, regardless of server status information, such as polling, weighted polling algorithm, etc.

  • Dynamic: Determine task allocation based on server real-time load status information, such as the minimum connection method and the weighted minimum connection method.

  • Random, by randomly selecting services for execution, this method is generally used less often;

  • Rotation training, the default implementation method of load balancing, queues and processes requests after they come in;

  • Weighted rotation training, through the classification of server performance, assigns higher weights to servers with high configuration and low load, and balances the pressure of each server;

  • Address Hash, server scheduling is performed through the modulo mapping of the HASH value of the address requested by the client.

  • Minimum number of connections; even if the requests are balanced, the pressure may not be balanced. The minimum number of connections method is to allocate requests to the server with the least current pressure based on the server's conditions, such as the number of request backlogs and other parameters.

3.1.8 Load balancing technology

  • DNS-based responsible balancing technology

  • Reverse proxy:

  • Based on NAT (NetWork Adress Transaction)

  • Ribbion client load balancing implements SpringCloud Ribbion client load balancing is the core project of the Spring Cloud Netflix sub-project. It is a client load tool based on Http and TCP. It mainly provides load balancing functions for server calls and API gateway forwarding. Spring Cloud Ribbon is a toolbar framework that does not need to be deployed and run independently, but is integrated with other projects for use. Through the encapsulation of Spring Cloud Ribbon, it is very simple to implement client load balancing in the Spring Cloud microservice architecture.

        Spring Cloud Alibaba integrates Ribbon by default. When a Spring Cloud application is integrated with Spring Cloud Alibaba Nacos Discovery, the automatic configuration of Ribbon implemented in Nacos will be automatically triggered (whether it is automatically triggered by switching ribbon.nacos.enabled, the default is true). The maintenance mechanism of ServerList will be covered by NacosServerList, and the list of service instances will be maintained by the service governance mechanism of Nacos. Nacos still uses the default load balancing implementation of Spring Cloud Ribbon by default, but only extends the service instance list maintenance mechanism.

3.1.9 Load lowest priority class

        The load balancing system allocates according to the load of the server. The load here is not necessarily the "CPU load" in the usual sense, but the current pressure of the system, which can be measured by the CPU load, or the number of connections, I/O O Utilization, network card throughput, etc. to measure system pressure. Lowest load first:

  • LVS, a 4-layer network load balancing device, can judge the status of the server based on the "number of connections". The greater the number of server connections, the greater the pressure on the server.

  • Nginx, a 7-layer network load system, can judge the server status based on the "number of HTTP requests"

  • If you develop your own load balancing system, you can choose indicators to measure system pressure based on business characteristics. If it is CPU-intensive, the system pressure can be measured by "CPU load"; if it is I/O-intensive, the system pressure can be measured by "I/O load".

advantage:

  • The algorithm with lowest load priority solves the problem of being unable to sense the status of the server in the polling algorithm.

shortcoming:

  • The algorithm with the least number of connections first requires the load balancing system to count the connections currently established by each server. Its application scenario is limited to any connection request received by the load balancing will be forwarded to the server for processing. Otherwise, if the relationship between the load balancing system and the server is fixed The connection pool method is not suitable for this algorithm.

  • The CPU load lowest priority algorithm requires the load balancing system to collect the CPU load of each server in some way, and determine whether to use 1 minute load as the standard or 15 minute load as the standard. There is no such thing as 1 minute is definitely better than 15 minutes. Minutes are better or worse. The optimal time intervals for different services are different. If the time interval is too short, it will easily cause frequent fluctuations. If the time interval is too long, it may cause slow response when the peak value comes.

3.1.10 Best Performance Class

        The lowest load priority algorithm is allocated from the perspective of the server, while the best performance priority algorithm is allocated from the client's perspective, giving priority to allocating tasks to the server with the fastest processing speed. Through this This method achieves the fastest response to the client.

shortcoming:

  • The load balancing system needs to collect and analyze the response time of each task of each server. In a scenario where a large number of tasks are processed, this collection and statistics itself will also consume a lot of performance.

  • In order to reduce this statistical consumption, sampling can be used to collect statistics, and an appropriate sampling rate is required.

  • Whether it is all statistics or sampling statistics, you need to choose an appropriate period.

3.1.11Hash class

        In some scenarios, it is hoped that a specific request should always be executed on one server. In this case, a consistent hash algorithm needs to be used to achieve this.

        Example: User data is cached in the service, so it is best to keep the same server every time the user visits.

        Caching Redis, Memcache, Nginx, and Dubbo load balancing algorithms all use consistent Hash.

There are two ways of Hash:

  • Source address Hash

  • ID Hash

Specific algorithm introduction:  https://blog.csdn.net/u011436427/article/details/123344374 

3.1.12 Task Distributor Cluster

        As the system scale continues to expand, the task allocator will also face load pressure, so the task allocator also needs to use clusters to expand capacity and improve availability.



This architecture is more complex than the architecture of two business servers, mainly reflected in:

  • The task allocator has changed from one to multiple (corresponding to task allocator 1 to task allocator M in the figure). The complexity brought about by this change is that different users need to be assigned to different task allocators (ie The dotted line "User Assignment" in the figure), common methods include DNS round robin, smart DNS, CDN (Content Delivery Network, content distribution network), GSLB device (Global Server Load Balance, global load balancing), etc.

  • The connection between task allocators and business servers has changed from a simple "1-to-many" (one task allocator connects to multiple business servers) to a "many-to-many" (multiple task allocators connect to multiple business servers) network shape structure.

  • The number of machines expands from 3 to 30 (generally, the number of task allocators is less than that of business servers, here it is assumed that there are 25 business servers and 5 task allocators), and the complexity of state management and fault handling is also greatly increased.

        The above two examples take business processing as an example. In fact, "task" covers a wide range and can refer to complete business processing or a specific task. For example, "storage", "computing", "caching", etc. can all be regarded as a task, so the storage system, computing system, and cache system can all be built according to the task allocation method. In addition, a "task allocator" does not have to be a physically existing machine or a program that runs independently. It can also be an algorithm embedded in other programs, such as Memcache's cluster architecture.

3.2 Task decomposition

        Through task allocation, the bottleneck of a single machine's processing performance can be broken through, and more machines can be added to meet the performance needs of the business. However, if the business itself becomes more and more complex, performance can only be expanded through task allocation. The income will be lower and lower.

        For example, when the business is simple, if one machine is expanded to 10 machines, the performance can be increased by 8 times (part of the performance loss caused by the machine group needs to be deducted, so it cannot reach the theoretical 10 times), but if the business becomes more and more Complex, if one machine is expanded to 10, the performance may only be improved by 5 times. The main reason for this phenomenon is that the business is becoming more and more complex, and the processing performance of a single machine will be getting lower and lower. In order to continue to improve performance, the second method needs to be adopted: task decomposition.

        Continuing to take the architecture in "Task Allocation" above as an example, if the "business server" becomes more and more complex, it can be split into more components, taking WeChat's backend architecture as an example.



        As can be seen from the above architecture diagram, the WeChat backend architecture logically splits each sub-business, including: access, registration and login, messaging, LBS, shake, drift bottle, and other businesses (chat, video, Moments, etc.).

        Through this method of task decomposition, the original unified but complex business system can be split into small and simple business systems that require the cooperation of multiple systems. From a business point of view, task decomposition will neither reduce functionality nor reduce the amount of code (in fact, the amount of code may increase because calls are changed from within the code to calls through the interface between servers), so why pass Can task decomposition improve performance?

There are several main factors:

(1) Simple systems are easier to achieve high performance

        The simpler the function of the system, the fewer points that affect performance, making it easier to carry out targeted optimization. When the system is very complex, firstly, it is more difficult to find the key performance point because there are too many points that need to be considered and verified; secondly, even if it takes a lot of effort to find it, it is not easy to modify it, because the key performance point A may be changed. has been improved, but it has inadvertently reduced the performance of point B. Not only has the performance of the entire system not improved, it may also decline.

(2) Can be expanded for a single task

        When each logical task is decomposed into independent subsystems, the performance bottlenecks of the entire system are easier to discover, and after discovery, only the performance of the subsystem with the bottleneck needs to be optimized or improved, without changing the entire system, and the risk will be much smaller. Take the backend architecture of WeChat as an example. If the number of users grows too fast and the performance of the registration and login subsystem becomes bottlenecked, you only need to optimize the performance of the login and registration subsystem (which can be code optimization or simply add machines). The message Other subsystems such as logic, LBS logic, etc. do not need to be changed at all.

        Since decomposing a unified system into multiple subsystems can improve performance, isn't it better to divide it as finely as possible? For example, the WeChat backend above currently has 7 logical subsystems. If these 7 logical subsystems are further subdivided into 100 logical subsystems, will the performance be higher?

        In fact, otherwise, not only will the performance not be improved, but it will also be reduced. The main reason is that if the system is split too finely, in order to complete a certain business, the number of calls between systems will increase exponentially, and the number of calls between systems will increase exponentially. The call channel is currently transmitted through the network, and its performance is much lower than that of the function call in the system. Explain with a simple diagram.



        As can be seen from the figure, when the system is split into 2 subsystems, user access requires 1 inter-system request and 1 response; when the system is split into 4 subsystems, the number of inter-system requests increases from 1 It has increased to 3 times; if it continues to be split into 100 subsystems, in order to complete a certain user visit, the number of inter-system requests becomes 99 times.

        In order to simplify the description, the simplest model is abstracted: assuming that these systems use IP network connections, ideally a request and response takes 1ms on the network, and the business processing itself takes 50ms. It is also assumed that system splitting has no impact on the performance of a single business request. When the system is split into 2 subsystems, it takes 51ms to process a user access; and when the system is split into 100 subsystems, it takes 51ms to process a user access. It actually reached 149ms.

        Although system splitting may improve business processing performance to a certain extent, the performance improvement is also limited. It is impossible that business processing takes 50ms when the system is not split, but only 1ms after system splitting, because What ultimately determines business processing performance is the business logic itself. As long as the business logic itself does not undergo major changes, theoretical performance has an upper limit. System splitting can bring performance close to this limit, but it cannot break through this limit. Therefore, the performance benefits brought by task decomposition have a certain degree. It is not that the more detailed the task decomposition, the better. For architecture design, how to grasp this granularity is very critical.

Guess you like

Origin blog.csdn.net/weichao9999/article/details/129893923