Support gRPC long link, in-depth interpretation of Nacos 2.0 architecture design and new models

Head picture.jpg

Author | Yang Yi (Schiwon) Nacos PMC
Source | A Alibaba Cloud native public No.

Introduction to Nacos

1.jpeg

Nacos originated from the Cobblestone project in Alibaba in 2008. The project completed the split of microservices and the construction of the middle stage of the business. With the rise of cloud computing and open source environment, in 2018, we deeply felt the influence of the open source software industry, so we decided Open source Nacos, export Ali's ten-year accumulation of service discovery and piping management, promote the development of the microservice industry, and accelerate the digital transformation of enterprises.

Currently, Nacos supports mainstream microservice development languages ​​& mainstream service frameworks and configuration management frameworks, such as Duboo and SCA, and also docks with some cloud-native components such as coreDNS and sentinel.

The client language supports mainstream languages ​​such as Java and go python, as well as the official versions of C# and C++ that have just been released. I would like to thank all community contributors for their support.

2.jpeg

Since Nacos has been open source for more than 2 years, a total of 34 versions have been released, including some of the more important milestone versions; Nacos is still very young and has a lot of room for improvement. The community and leaders from all walks of life are welcome to build together.

3.jpeg

Nacos 1.X structure and problems

Next, we look at the nacos1.x architecture and its analysis of the more important problems. First look at the architecture diagram.

Nacos 1.X architecture level

4.jpeg

Nacos 1.X is roughly divided into 5 layers, namely access, communication, function, synchronization and persistence.

The access layer is the most direct interaction layer for users. It mainly consists of Nacos client, Dubbo and SCA relying on the client, and the console for user operations. The client and Console perform service and configuration operations, and initiate communication requests through HTTP OpenAPI.

The communication layer is mainly based on the short connection request model of HTTP, and part of the push function communicates through UDP.

The functions currently include service discovery and configuration management. This layer is the business layer that actually manages services and configurations.

The synchronization layer has AP mode Distro and CP mode Raft for data synchronization, and there is a simplest level notification Notify, which has different uses:

  • Distro: The synchronization mode of non-persistent services.
  • Raft: The synchronization mode of the persistent service and the synchronization configuration operation when using Derby as the configuration storage.
  • Notify: When using MySQL as the configuration storage, notify other nodes to update the cache and initiate configuration push.

Persistence layer Nacos uses MySQL, Derby and the local file system for data persistence configuration information, user information, permission information is stored in MySQL or Derby database, persistent service information and service and instance metadata information are stored in the local file system .

Service model under Nacos 1.X architecture

Through a service discovery process, we will get to know the Nacos 1.X architecture and the Nacos service discovery model based on the current architecture.

5.jpeg

The Nacos client registration service will send a request for Http registration service through OpenAPI. The request content will bring service information and instance information. Usually this step is completed by the microservice framework SCA and dubbo.

After the server receives the request, it will first read and verify the data in Contoller, such as whether the IP is legal, whether the service name is correct, and so on. After the verification is passed, if the service is registered for the first time, Nacos will generate a Service object on the server, and then store the registered instance information in the Service object; if the Nacos server already has the Service object, Then the newly registered instance information will be directly stored in the object. This Service object guarantees uniqueness through the combination of namespace + Group + Service.

When the instance is stored in the Service, two events will be triggered. One of the events is used for data synchronization. The Nacos server will use the Distro or Raft protocol to synchronize according to whether the service is a temporary object and notify other Nacos. The service of the node has changed; another event notifies the subscribers who subscribed to the service on the Nacos service node, and pushes the latest service list to the subscriber client through UDP based on the subscriber information. This completes a service registration process.

In addition, for all information defined as persistent services, the raft protocol is used to ensure that it can be written to the file system and be persisted.

Finally, other Nacos nodes will also trigger events to notify subscribers when the Service changes through synchronization, so that subscribers who subscribe to the service on other Nacos service nodes can also receive pushes.

1. Problems with X architecture

A rough introduction to the architecture and service discovery model of Nacos1.X, and then analyze some of the more important problems faced by the Nacos1.X architecture.

In one sentence, there are many heartbeats, many invalid queries, slow changes in heartbeat renewal perception, high connection consumption, and serious resource drain.

6.jpeg

  • The number of heartbeats is so high that TPS remains high

Through heartbeat renewal, when the scale of services increases, especially when there are more interface-level services like Dubbo, the number of polling for heartbeat and configuration metadata is large, resulting in a high cluster TPS and a high consumption of system resources.

  • Perceive service changes through heartbeat renewal, time extension

The heartbeat renewal needs to reach the timeout period before it will be removed and notified to the subscriber. The default is 15s, which has a long time delay and poor timeliness. If the timeout period is shortened, when the network jitters, the change push will be triggered frequently, which will cause greater loss to the client and server.

  • UDP push is unreliable, causing QPS to remain high

Because UDP is unreliable, the client needs to perform reconciliation queries at regular intervals to ensure that the status of the service list cached by the client is correct. When the size of the subscription client increases, the cluster QPS is very high, but most service lists are actually not. It will be changed frequently, resulting in invalid queries, resulting in waste of resources.

  • Based on the HTTP short connection model, there are too many connections in the TIME_WAIT state

HTTP short connection model, each client request will create and destroy the TCP link, the link state of TCP protocol destruction is WAIT_TIME, it will take some time to completely release, when the TPS and QPS are high, the server and client may have a lot of WAIT_TIME status link will cause connect time out error or Cannot assign requested address problem.

  • Frequent GC caused by 30-second long polling of the configuration module

The configuration module uses the HTTP short connection blocking model to simulate long connection communication, but because it is not a real long connection model, a context switch between request and data is required every 30 seconds. Each switch causes a waste of memory, which leads to Frequent GC on the server side.

Nacos 2.0 architecture and new model

Nacos 2.0 architecture level

Nacos 2.X adds support for the persistent connection model based on the 1.X architecture, while retaining the core function support for the old client and openAPI.

7.jpeg

The communication layer currently implements long-connection RPC calls and push capabilities through gRPC and Rsocket.

In the server side test, a new link layer is added to convert different types of Request requests and different types of requests from different clients into the same semantic functional data structure, and reuse business processing logic. At the same time, future functions such as flow control and load balancing will also be handled at the link layer.

The other architectural layers remain largely unchanged.

Nacos 2.0 new service model

Although the architecture level of Nacos 2.0 has not changed much, the specific model details have been changed a lot. The registration service process is still used, and then we have a deeper understanding of the changes in the Nacos 2.0 service model.

8.jpeg

Because the communication uses the RPC method, all requests of a client (regardless of registration or subscription) are carried out through the same link and the same service node, unlike the previous connection through HTTP, each request may be requested at a different Nacos node As a result, the data content discovered by the service has changed from being stateless to a kind of stateful data bound to the connection state. In order to adapt to this change, the data model needs to be changed. Therefore, a new data structure is abstracted, and the content published and subscribed by the same client through this link is associated, temporarily named Client. This Client does not mean the client, but the data content related to this client. A link corresponds to a Client.

When a client publishes a service, all the services and subscriber information published by the client will be updated to the Client object corresponding to the client link, and then the update of the index information is triggered through the event mechanism. This index information is an index of client links and services, which facilitates the quick aggregation and generation of data of service latitude that needs to be pushed.

After the index information is updated, a push event will be triggered. At this time, all Client objects related to the service will be aggregated through the newly generated index information. When the data aggregation is completed, the client link will be filtered to subscribe to the service The client link of the subscriber will push the data back through the link. In this way, the main link for publishing changes is completed.

Looking back at data synchronization, the object actually updated when the client publishes the service has changed from the original Service to the Client object, so the content that needs to be synchronized has also become the Client object; at the same time, the communication method between the servers will also be changed to RPC. Here, only the Client object that is actually updated by the client will trigger the synchronization, and the Client object that is updated through the synchronization will not trigger the synchronization again.

Finally, look at Metadata. Metadata is some attributes separated from the Service object and Instance object in the 1.X version: such as the metadata label of the service, the online and offline status of the instance, the weight, and the metadata label. These metadata can be individually modified by openAPI and take effect when data is aggregated. The reason why metadata is separated from basic data is that basic data such as: ip port, service name, etc. should not be modified once published, and the information at the time of publication shall prevail; but other original data, such as The online and offline status and weights are usually dynamically adjusted during operation. Therefore, after separation, it should be more reasonable to divide into two different processing workflows.

Advantages and disadvantages of Nacos 2.0 architecture

I briefly introduced the architecture of Nacos 2.0 and how the new model works. Next, let's analyze the advantages and disadvantages of such changes.

9.jpeg

advantage

  • The client no longer needs to send instance heartbeats regularly, and only needs to have a keepalive message available to maintain the connection. Repeated TPS can be greatly reduced.

  • TCP connection disconnection can be quickly detected, which improves response speed.

  • Streaming push with long connections is more reliable than UDP; nio's mechanism has higher throughput, and because of reliable push, it can lengthen the client's time for reconciliation service lists, and even delete related requests. Repeated invalid QPS can be greatly reduced.

  • Long connection avoids frequent connection overhead and can greatly alleviate the TIME_WAIT problem.

  • The real long connection solves the GC problem of the configuration module.

  • More fine-grained synchronization content reduces the communication pressure between service nodes.

Disadvantage

Without a silver bullet solution, the new architecture will also introduce some new problems:

  • The complexity of the internal structure increases, the connection status is managed, and the load balance of the connection needs to be managed.

  • The original stateless data becomes stateful data bound to the connection, and the process link is longer.

  • RPC protocol is not as observable as HTTP. Even if gRPC is implemented based on HTTP2.0Stream, it is still not as intuitive as using the HTTP protocol directly.

Nacos 2.X planning

Next, briefly share the later planning of Nacos 2.X, which is mainly divided into documents, quality and Roadmap.

In terms of documentation and quality, Nacos 1.X did not do very well. The content of the document is small, only simple to use the document; there is a certain disconnect with the version, and the update is not timely; there is no description of the technical content, and it is difficult to participate in the contribution. The code quality and test quality are not very high, although the codeStyle has been verified using checkstyle and the community collaborative review has been opened. But this is far from enough. Nacos 2.X will gradually update and refine the official website documentation; analyze technical details through e-books; display technical solutions through Github to promote discussion and contribution; and a large number of code reconstructions and UT and IT governance work, In the future, Benchmark will also be open sourced to facilitate stress testing for open source users.

10.jpeg

As for RoadMap, Nacos 2.X will refactor the project significantly, complete the initial plug-inization, and improve some shortcomings of the 2.0 architecture, such as load balancing and observability.

About the Author

Yang Yi, the flower name Xi Weng. Nacos PMC is mainly involved in the service discovery module, and doing some kernel refactoring and upgrading. Apache SharadingSphere PMC, mainly responsible for and participated in the modules include routing modules, distributed transactions, data synchronization, and elastic expansion.

Guess you like

Origin blog.51cto.com/13778063/2577154