In-depth analysis of online application node traffic isolation technology

Why traffic isolation is required

Let's go

It originated from a difficult situation encountered by an EDAS customer: a Pod CPU indicator on their line was abnormal. In order to further diagnose the problem, the customer wanted to keep the site without rebuilding the Pod, but the traffic would still pass through the abnormal Pod during the diagnosis. As a result, the quality of service was affected, so we asked if there is a way to remove the traffic flowing into the abnormal node to form an isolated diagnostic environment. After the diagnosis, if the abnormality can be repaired, after the repair is completed, the traffic isolation will be lifted, and the node will resume normal operation.

In addition to isolation of all input traffic in diagnostic scenarios, specific traffic needs to be isolated in some online drills to achieve simulated drill effects. When facing this kind of traffic isolation problem, the first thing we consider is full-link traffic control. Currently, the full-link flow control on EDAS can control the flow direction without restarting the application nodes. However, full-link flow control can only control the traffic of the microservice framework, and cannot meet the requirement of isolating all or specific traffic.

To this end, we conducted in-depth research and implemented a set of out-of-the-box traffic isolation tools that can dynamically isolate specific traffic and restore it at any time after isolation to meet traffic isolation requirements in various scenarios.

Which traffic to isolate

Let's go

The purpose of traffic isolation is to block the incoming traffic of the application node. First, it is clear what the incoming traffic of the microservice application node is.

The traffic flowing into microservice application nodes can be roughly divided into two categories: service traffic and event traffic. Taking a common microservice application as an example, its traffic composition is shown in the figure below.

24082ee518c1faf6eb097877fff577e8.png


Service traffic refers to all nodes of a microservice application as a network entity, providing a set of services to the outside world, and being called by other systems, services or users initiating requests. For service traffic, the node itself does not directly determine whether the traffic flows in or not, but a set of service registration and discovery mechanisms maintains the logical relationship of the traffic path. A node is registered as an endpoint of the service. When the caller initiates a request for the service, the called party is the logical address of the service. After forwarding and address translation, the request is routed to the entity node of the service endpoint. An optional solution to isolating service traffic is to break the communication connection of the service call, but this method will inevitably affect the quality of service. While maintaining the normal operation of the overall function of the service, a more elegant solution is to destroy the mapping relationship between the service and the entity node. This way, during routing, traffic is directed to other nodes, avoiding specific nodes as intended. Service traffic mainly covers K8s Service and services built by microservice frameworks such as Spring Cloud and Dubbo published by registry centers such as Nacos.

Event traffic refers to the traffic generated by the event-driven architecture inside the application, including events or messages delivered by middleware to application nodes. This type of communication is usually asynchronous, such as message traffic from the message queue RocketMQ, and trigger scheduling from the scheduling framework SchedulerX Event traffic. Middleware and application nodes usually follow the client-server communication, so it can be considered to isolate the message or event traffic sent by the middleware by breaking the communication connection.

Service traffic isolation

Let's go

K8s Service

For applications that use K8s Service to expose services, the mapping relationship between the service declared by Service and the application Pod is maintained by the Endpoints object. The subsets field of the Endpoints object represents a set of endpoints of Serivce, and each endpoint represents the network address of an application Pod, that is, a Pod instance that actually provides services. The subsets field contains details of these endpoints, such as IP address and port. The Endpoints controller monitors Pod changes through the API Server, and then updates the Endpoints endpoint list synchronously. Therefore, to isolate the traffic of the K8s Service, it is necessary to destroy the pointing of the Endpoints to the Pod, and remove the network address of the Pod to be isolated from the endpoint list of the Endpoints. At the same time, it is necessary to monitor the changes of the Endpoints object through the Informer mechanism to ensure that the Endpoints can maintain the expected state during subsequent changes or controller reconcile.

bf5d929cd2cc0215a4478496acc3c232.png

Dubbo

For applications that use the registry to expose services, the registry is responsible for managing service nodes. As long as the registration relationship exists and the application node is alive, the registration center will dispatch traffic to the application node. The operation that destroys the service registration relationship is called service cancellation. After the application node cancels the service, the registration center will not import traffic to the cancellation node, which forms traffic isolation.

To realize the dynamic logout of Dubbo microservices, you first need to understand the principle of Dubbo service registration from the source code level. Taking Dubbo 2.7.0 as an example, the general structure of its service registration module is as follows:

  1. There is an AbstractRegistryFactory singleton in the Dubbo application, which is responsible for the container initialization of the Registry. The class attribute REGISTRIES maintains the mapping relationship between the microservice list and the registry instance.

  2. AbstractRegistry implements the Registry interface, as a template, implements specific public methods, such as service registration (register), service cancellation (unregister) and so on. It also maintains a list of registered service URLs.

  3. FailbackRegistry is based on AbstractRegistry and provides a failure retry mechanism. At the same time, it provides the doRegister and doUnregister abstract methods of the registration center. When register/unregister is executed, the doRegister/doUnregister method is called.

  4. The registry (such as NacosRegistry, RedisRegistry) implements specific service registration (doRegister) and service cancellation (doUnregister) logic.

outside_default.png

It can be seen from the source code that Dubbo's service registration module has built-in methods for dynamically deregistering/re-registering services. Therefore, Dubbo microservice isolation can be achieved by actively triggering the service logout method of its registry object. Similarly, if the service node needs to be restored, the service registration method is actively triggered to update the service mapping relationship in the registration center.

After determining the technical direction of "triggering the service cancellation method of the registry object", it is necessary to solve the two problems of how to obtain the object and trigger the method. In the Java environment, we can easily think of using Agent technology to intervene in process behavior. However, the conventional agent based on bytecode burying cannot meet the requirement of being enabled at any time, because it depends on the specific execution path of the application code. Only when the execution path touches the buried point, the Agent code will be triggered, thereby obtaining the object from the context and calling the relevant method through reflection. However, the burying point related to the registration center is usually set at the initial stage of the program startup. At this time, operations such as registry center initialization and service registration will be performed, and it is easier to find a suitable burying point. During the period when the program provides external services, the program actively initiates fewer registry operations, so it is difficult to find a suitable buried point to obtain the expected context. When the application traffic needs to be isolated, the Agent is dynamically mounted at this time. Since there is no buried point in the execution path that can obtain the registry context, the Agent code will not take effect.

Therefore, we need an out-of-the-box Agent tool that can actively acquire objects and trigger object methods. Here, we introduce JVMTI technology. JVMTI (JVM Tool Interface) is a native programming interface provided by a virtual machine, allowing developers to create Agents to probe the internal running status of the JVM, and even control the execution of JVM applications. JVMTI can obtain specific class and object information from the Java heap, and then trigger methods through reflection, which perfectly meets our needs.

Since JVMTI is a set of JVM native programming interface, it needs to be written in C/C++. The compiled product is a dynamic link library (.so or .dll file). The Java runtime environment interacts with JVMTI through JNI (Java Native Interface). As a Java Agent, it is dynamically mounted to the target JVM through the Attach API.

e229aac11293bdc03a371eb97fa9bc9e.png


Thanks to the powerful function of JVMTI Agent, we can relatively easily implement some control logic in Java applications. In order to realize the isolation of Dubbo service traffic, it is first necessary to obtain the static attribute REGISTRIES of the AbstractRegistryFactory class, which contains the service list of the application's currently registered service and the corresponding Registry instance. For a specific microservice, it is only necessary to call the register/unregister method of its registration center Registry to realize the dynamic removal and restoration of the service. This solution directly operates on a higher abstraction level without relying on the specific Registry implementation class, making it compatible with all registries.

d48f7dff8bc430cd4508f00df9c92fa1.png

Spring CLoud

The Spring Cloud service traffic isolation method is similar to Dubbo. After understanding the Spring Cloud service registration principle, obtain the service registration/deregistration method path, and then intervene in the service registration/deregistration behavior of the application through JVMTI.

The service registration principle of Spring Cloud is relatively simple. When the Spring container starts, AbstractAutoServiceRegistration listens to the startup event, and calls the register method of ServiceRegistry to register Registration (service instance data) to the registry. For example, the Nacos service registration class NacosServiceRegistry implements the ServiceRegistry interface, and completes the registration and deregistration of services in the registry by overloading the register/deregister method.

// 服务注册类
public abstract class AbstractAutoServiceRegistration<R extends Registration>...{      
    // 注册中心实例
    private final ServiceRegistry<R> serviceRegistry;
    // 服务注册
  protected void register() {
    this.serviceRegistry.register(getRegistration());
  }
    // 服务注销
  protected void deregister() {
    this.serviceRegistry.deregister(getRegistration());
  }
}

When dealing with Spring Cloud service traffic isolation, first obtain the service registration instance of AbstractAutoServiceRegistration, and then call the register/deregister method to complete the deregistration and re-registration of the service on the registry. This method also does not depend on the specific implementation class of a specific registry, and is compatible with all registry centers.

4e3ec83e1a3d513de1dd88d02483156e.png

Event traffic isolation

Let's go

Application nodes and middleware usually communicate in the client-server mode, and RocketMQ and SchedulerX use Netty as the underlying network framework to complete the communication between the client and the server. Here, we take RocketMQ as an example to illustrate how to implement traffic isolation for similar event-driven middleware.

The main implementation class of RocketMQ client is NettyRemotingClient. As shown in the figure below, the attribute channelTables in the NettyRemotingClient class stores the Channel used to transmit data, and lockChannelTables is the lock used to control the update of channelTables. At the same time, several invoke methods are responsible for handling the communication process.

outside_default.png


The communication processing flow is shown in the figure below. First, try to get the Channel for communication from channelTables. If no Channel is available, reconnect to the server to create a Channel. In order to ensure synchronization between threads, the lockChannelTables lock needs to be obtained when the new Channel is updated to channelTables. If lockChannelTables has been occupied within the specified time window, a connection exception will be thrown.

5c49fda0b9fa4de6a781e7c3bc5b3f86.png


Based on the analysis of the above principles, we can prevent the establishment of the Channel by occupying the lockChannelTables lock, and then close the existing Channel, the client cannot establish a communication connection with the server before the lockChannelTables is released. To resume traffic, just release the lockChannelTables lock, and the client will automatically rebuild the Channel and resume communication. Since this control is performed at the network client layer, it is not affected by the application message model, and is applicable to both synchronous and asynchronous messages; it is also independent of the role of the client, and is applicable to both consumers and producers By.

f01df9d7ae068e7bb011dbbb03b115a5.png

epilogue

Let's go

Guess you like

Origin blog.csdn.net/g6U8W7p06dCO99fQ3/article/details/131355920