Nacos 2.0 is officially released, greatly improving performance

The Nacos project originated from Alibaba's internal multicolored stone project and has been incubated internally since 2008. In recent years, influenced by projects such as Eureka and Consul, Nacos has become more and more popular!

At present, Nacos supports mainstream microservice development languages ​​& mainstream service frameworks and configuration management frameworks, such as Duboo, SpringCloud, SCA, and some cloud-native components such as coreDNS and sentinel.

The client language currently supports mainstream languages ​​such as Java and go python, and the official version just released recently also supports C# and C++.

Recently, Nacos has been updated frequently, and Nacos 2.X version ushered in its debut, adding support for the long connection model based on the 1.X architecture. The communication layer currently implements long-connection RPC calls and push capabilities through grpc, and the benefits of using long-connections are greatly reduced. 1.x frequent polling heartbeats cause JVM Full GC.

1. Problems with X architecture

Next, let's take a look at some of the more important issues faced by the Nacos1.X architecture.

In one sentence, there are many heartbeats, many invalid queries, slow changes in heartbeat renewal perception, high connection consumption, and serious resource drain.

image

image

  1. The number of heartbeats is so high that TPS remains high

Through heartbeat renewal, when the scale of services increases, especially when there are more interface-level services like Dubbo, the number of polling for heartbeat and configuration metadata is large, resulting in high cluster TPS and high system resource consumption.

  1. Perceive service changes through heartbeat renewal, time extension

The heartbeat renewal needs to reach the timeout before it will be removed and notified to the subscriber. The default is 15 s, which has a long delay and poor timeliness. If the timeout period is shortened, when the network jitters, the change push will be triggered frequently, which will cause greater loss to the client and server.

  1. UDP push is unreliable, resulting in high QPS

Because UDP is unreliable, the client needs to perform reconciliation queries at regular intervals to ensure that the status of the service list cached by the client is correct. When the size of the subscription client increases, the cluster QPS is very high, but most service lists are actually not. Frequent changes will result in invalid queries, which will result in waste of resources.

  1. Based on the HTTP short connection model, there are too many connections in the TIME_WAIT state

HTTP short connection model, each client request will create and destroy the TCP link, the link state of TCP protocol destruction is WAIT_TIME, it will take some time to completely release, when the TPS and QPS are high, the server and client may have a lot of WAIT_TIME status link, which will cause connect time out errors or Cannot assign requested address problems.

  1. Frequent GC caused by 30-second long polling of the configuration module

The configuration module uses the HTTP short connection blocking model to simulate long connection communication, but because it is not a real long connection model, a context switch of request and data is required every 30 seconds. Each switch causes a waste of memory, which leads to Frequent GC on the server side.

Advantages and disadvantages of Nacos 2.x architecture

I briefly introduced the architecture of Nacos 2.x and how the new model works. Next, let's analyze the advantages and disadvantages of such changes.

image

image

advantage

  1. The client no longer needs to send instance heartbeats regularly, only a keepalive message is available to maintain the connection. Repeated TPS can be greatly reduced.

  2. TCP connection disconnection can be quickly detected, which improves the response speed.

  3. Streaming push with long connections is more reliable than UDP; nio's mechanism has higher throughput, and because of reliable push, it can lengthen the client's time for reconciliation service lists, and even delete related requests. Repeated invalid QPS can be greatly reduced.

  4. Long connection avoids frequent connection overhead and can greatly alleviate the TIME_WAIT problem.

  5. The real long connection solves the GC problem of the configuration module.

  6. More fine-grained synchronization content reduces the communication pressure between service nodes.

Disadvantage

There is no silver bullet solution, the new architecture will also introduce some new problems

  1. The complexity of the internal structure increases, the connection status is managed, and the load balance of the connection needs to be managed.

  2. The data changes from the original stateless to stateful data bound to the connection, and the process link is longer.

  3. The RPC protocol is not as observable as HTTP. Even if gRPC is implemented based on HTTP2.0 Stream, it is still not as intuitive as using the HTTP protocol directly.

Performance improvement

Nacos 2.x service discovery performance tests are all focused on key functions. By performing stress testing on a 3-node cluster, you can see the interface performance load and capacity, and compare the improvement of the Nacos 1.X version in the same/similar scenarios.

  • During the stress test, the service and instance capacity reached millions, and the cluster operation continued to be stable and reached expectations; (this scenario does not calculate the frequent push content caused by frequent changes, only the calculation capacity is online, and the real scenario with push will be reported in the next round of pressure test report Given in)

  • The registered/deregistered instance TPS reached more than 26000, which is at least 2 times higher than Nacos1.X overall, and the interface meets expectations;

  • The TPS of the query instance can reach more than 30,000, which is about 3 times higher than Nacos1.X overall, and the interface meets expectations;

compatibility

Configuration Center

  • Fully compatible with all API interface methods of 1.X client

  • Fully implement all API interface methods of 2.X client

  • Fully compatible with all configuration center related openAPI

Service discovery

Due to major changes in the data model of service discovery, the following functions are temporarily not supported.

  • View the current cluster leader (will be obsolete)

  • Update instance metadata in batches (Beta, not supported)

  • Delete instance metadata in batches (Beta, not supported)

Console

  • Fully compatible with related pages and functions of the configuration center

  • Fully compatible with access control related pages and functions

  • Fully compatible with namespace-related pages and functions

  • Fully compatible with cluster management related pages and functions

  • Fully compatible with service discovery related pages and functions

Spring Cloud Alibaba adaptation

Since the currently built-in nacos-client of Spring cloud alibaba 2.2.5 version is 1.4.1, the Nacos2.0 long connection function can be used in advance by specifying the nacos-client method.

<dependency>
    <groupId>com.alibaba.cloud</groupId>
    <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
    <version>2.2.5.RELEASE</version>
    <exclusions>
        <exclusion>
            <groupId>com.alibaba.nacos</groupId>
            <artifactId>nacos-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.alibaba.cloud</groupId>
    <artifactId>spring-cloud-starter-alibaba-nacos-config</artifactId>
    <version>2.2.5.RELEASE</version>
    <exclusions>
        <exclusion>
            <groupId>com.alibaba.nacos</groupId>
            <artifactId>nacos-client</artifactId>
        </exclusion>
    </exclusions>
</dependency>
<dependency>
    <groupId>com.alibaba.nacos</groupId>
    <artifactId>nacos-client</artifactId>
    <version>2.0.0</version>
</dependency>

Guess you like

Origin blog.csdn.net/weixin_42073629/article/details/115262844