How microservices are governed

b0f76c4008b4499eafdf5a7c60c30245.jpgMicroservice remote calls may have the following problems:

 

 

The registration center is down;

 

Service provider B has a node down;

 

The network between service consumer A and the registration center is disconnected;

 

The network between service provider B and the registration center is disconnected;

 

The network between service consumer A and service provider B is disconnected;

 

The performance of some nodes of service provider B slows down;

 

Service provider B has a problem within a short period of time.

 

Commonly used service management methods:

 

node management

Service call failure is generally caused by two types of reasons, one is the service provider's own problems, such as server downtime, process exits unexpectedly, etc.; the other is network problems, such as service providers, registration centers, service consumers, etc. There is a problem with the network between any of the three.

 

Whether there is a problem with the service provider itself or a problem with the network, there are two node management methods.

 

1. The registration center actively removes the mechanism

 

This mechanism requires the service provider to actively report the heartbeat to the registration center regularly. The registration center compares the time of the latest heartbeat report by the service provider node with the last report heartbeat time. If it exceeds a certain period of time, it will be considered that there is a problem with the service provider. , and then remove the node from the service list, and push the nearest available service node list to the service consumer.

 

2. Service consumer removal mechanism

 

Although the active removal mechanism of the registration center can solve the problem of abnormal service provider nodes, if the network between the registration center and the service provider is abnormal, the worst case is that the registration center will remove all service nodes, resulting in service consumption. The provider has no available service node calls, but in fact, the service provider itself is normal at this time. Therefore, it is more reasonable to use the survival detection mechanism on the side of the service consumer. If the service consumer fails to call the service provider node, this node will be removed from the list of available service provider nodes stored in memory.

 

load balancing 

In general, service provider nodes are not unique, and mostly exist in the form of clusters, especially for large-scale service calls, the number of service provider nodes may be hundreds or even thousands. Due to different machine purchase batches, the configuration of different service nodes may also vary greatly. Newly purchased machines may have higher CPU and memory configurations. Under the same request volume, their performance is better than that of old machines. For service consumers, when selecting an available node from the service list, if the new machine with higher configuration can bear more traffic, the performance of the new machine can be fully utilized. This requires some adjustments to the load balancing algorithm.

 

The commonly used load balancing algorithms mainly include the following types.

 

1. Random algorithm

 

As the name implies, a node is randomly selected from the available service nodes. In general, the random algorithm is uniform, that is to say, no matter whether the back-end service node is configured well or not, the number of invocations received in the end is almost the same.

 

2. Polling Algorithm 

 

It is to poll the available service nodes according to the fixed weight. If the weights of all service nodes are the same, the call volume of each node is similar. However, the weight of some nodes with better hardware configuration can be increased, so that a larger call volume will be obtained, so as to give full play to its performance advantages and improve the average performance of the overall call.

 

3. Least Active Call Algorithm

 

This algorithm dynamically maintains the number of connections with each service node in the memory of the service consumer. When calling a service node, it adds 1 to the number of connections with the service node, and calls After returning, decrement the number of connections by 1. Then each time when selecting a service node, the number of connections maintained in the memory is arranged in reverse order, and the node with the smallest number of connections is selected to initiate the call, that is, the service node with the smallest number of calls is selected, and the performance is also optimal in theory.

 

 4. Consistent Hash Algorithm

 

It means that requests with the same parameters are always sent to the same service node. When a service node fails, the request originally sent to the node will be spread to other nodes based on the virtual node mechanism without causing drastic changes.

 

The implementation difficulty of these algorithms is also gradually increasing, so the choice of which node to choose the load balancing algorithm depends on the actual scenario. If there is no difference in the configuration of the back-end service nodes, and there is no difference in performance under the same amount of calls, it is more appropriate to choose a random or round-robin algorithm; if there are obvious configuration and performance differences in the back-end service nodes, it is more appropriate to choose the least active call algorithm .

 

service routing

For service consumers, which node to choose in the list of available service nodes in memory is not only determined by the load balancing algorithm, but also determined by the routing rules.

 

The so-called routing rules are to limit the selection range of service nodes through certain rules such as conditional expressions or regular expressions. 

 

Why do we need to formulate routing rules? There are two main reasons. 

 

1. There is a demand for gray scale release in the business

 

For example, a service provider has made a function change, but hopes to only let some people use it first, and then decide whether to release it in full based on the feedback from this group of people. At this time, it is possible to limit that only a certain percentage of people will access the newly released service nodes through rules similar to the grayscale by tail number. 

 

2. Requirements for nearby access to multiple computer rooms

 

As far as I know, most Internet companies with medium or above business scale will deploy their business in more than one IDC for high availability of business. At this time, there is a problem. Since the access between different IDCs needs to cross IDCs, the delay will be relatively large, especially when the IDCs are far apart. For example, the delay of the dedicated lines in Beijing and Guangzhou is generally about 30ms. Some delay-sensitive services are unacceptable, so it is necessary to select the same node inside the IDC as much as possible for one service call, thereby reducing network time-consuming overhead and improving performance. At this time, access can generally be controlled through IP segment rules. When selecting service nodes, nodes with the same IP segment are preferred.

 

So how to configure routing rules? According to my actual project experience, there are generally two configuration methods.

 

1. Static configuration

 

It is to store the routing rules of service calls locally on the service consumer. During the service call, the routing rules will not change. If you want to change, you need to modify the local configuration of the service consumer, and it will take effect after going online.

 

2. Dynamic configuration

 

In this way, the routing rules exist in the registration center, and the service consumer periodically requests the registration center to maintain synchronization. If you want to change the routing configuration of the service consumer, you can modify the configuration of the registration center, and the service consumer will be in the next synchronization cycle. After that, the registration center will be requested to update the configuration, so as to realize dynamic update.

 

service fault tolerance 

The service call is not always successful, which may be due to the failure of the service provider node itself, the abnormal exit of the process, or the failure of the network between the service consumer and the provider. For the failure of the service call, there needs to be a means of automatic recovery to ensure the success of the call.

 

Commonly used methods mainly include the following.

 

FailOver: Automatic switching on failure. That is, after the service consumer finds that the call fails or times out, it will automatically select the next node from the list of available service nodes to re-initiate the call, and the number of retries can also be set. This strategy requires that the operation of the service call must be idempotent, that is to say, no matter how many times it is called, as long as it is the same call, the returned result is the same, which is generally suitable for scenarios where the service call is a read request.

 

FailBack: Failure notification. That is, after the service consumer calls fail or times out, it does not retry, but determines the subsequent execution strategy based on the detailed information of the failure. For example, for non-idempotent call scenarios, if the call fails, you cannot simply retry, but you should query the status of the server to see if the call actually takes effect. If it has taken effect, you cannot retry it; if it does not take effect, you can Make another call.

 

FailCache: Fail cache. That is, after the service consumer calls fail or times out, it does not initiate a retry immediately, but tries to initiate the call again after a period of time. For example, the backend service may have problems for a period of time. If a retry is initiated immediately, the problem may be exacerbated, which is not conducive to the recovery of the backend service. It will be better to initiate the call again after a period of time after the backend node recovers.

 

FailFast: Fail fast. That is, after the service consumer fails to call once, it will not try again. In actual business execution, the general non-core business calls will adopt a fast failure strategy. After the call fails, the failure log is generally recorded and returned.

 

From the description of different strategies for service fault tolerance, it can be seen that their use scenarios are different. Generally, for idempotent calls, you can choose FailOver or FailCache, and for non-idempotent calls, you can choose FailBack or FailFast.

Guess you like

Origin blog.csdn.net/weixin_57763462/article/details/131732805