Nacos Architecture and Principles - Design Principles of Registration Center


insert image description here


Pre

The current network architecture is that each host has an independent IP address, so service discovery basically obtains the IP address deployed by the service in some way.

The DNS protocol is the earliest protocol to translate a network name into a network IP. In the initial architecture selection,
DNS+LVS+Nginx can basically satisfy all RESTful service discovery. At this time, the service IP list is usually configured in nginx or LVS.

Later, RPC services appeared, and the online and offline services became more frequent. People began to seek a registration center product that could support dynamic online and offline and push IP list changes.

Individual developers or small and medium-sized companies often choose open source products as their first choice.

  • Zookeeper is a classic service registry product (although its original positioning is not here), for a long time, it was the only choice that Chinese people thought of when they mentioned the RPC service registry, which is a big deal. The degree is related to the popularity of Dubbo in China.

  • Both Consul and Eureka appeared in 2014. The design of Consul includes many functions needed for distributed service governance, which can support service registration, health check, configuration management, Service Mesh, etc. Eureka has also gained a large number of users by virtue of the popularity of the concept of microservices and the deep integration of SpringCloud ecology.

  • Nacos is carrying Alibaba's large-scale service production experience, trying to provide users with a new choice in the market of service registration and configuration management.insert image description here


Hierarchical model of services (service-cluster-instance three-tier model)

The core data storage model of the registration center is generally divided into three layers:

  • Service level: store the service name and network address, as well as the attributes of the service level, such as permission rules, etc.
  • Cluster level: For large-scale services, its instances may be divided into different clusters, and each cluster has different configurations, so another cluster-level data is added between the service and the instance.
  • Instance level: store the network address, health status, weight and other attributes of the instance. Functions such as instance filtering and traffic distribution can be implemented for instance-level data.

Zookeeper's data storage is relatively abstract in the form of tree KV, which can store data with arbitrary semantics, but it cannot well meet the data model requirements of service discovery.

The data models of Eureka and Consul are extended to the instance level, which can meet most scenarios, but cannot meet the needs of large-scale multi-environment deployment.

The service-cluster-instance three-layer model proposed by Nacos can meet all data storage and management requirements of services in any scenario. This is the design idea of ​​its core data model.

Therefore, the core of the data model design of the registration center lies in:

    1. Service level: store service basic information
    1. Cluster level: partition management for large-scale service instances
    1. Instance level: Store instance details, implement instance management and flow control

A good design can meet all data storage and management needs from small-scale to large-scale services.

insert image description here


data consistency

    1. Data consistency protocols are mainly divided into two types: Leader-based single-point write consistency (CP) and peer-to-peer multiple write consistency (AP) .
    1. No single protocol can cover all scenarios. When the service registration has no heartbeat, the CP protocol is the only choice, because the data cannot be compensated through the heartbeat. When there is a heartbeat, the single-point performance bottleneck of the CP protocol is not suitable, and the AP protocol is better, such as Eureka's Renew mechanism.
    1. Zookeeper uses the ZAB protocol to ensure strong consistency, but the disaster recovery capability of the computer room is poor, so it is not suitable for large-scale scenarios. If temporary nodes and heartbeat renewal are used like Dubbo, Eureka's Renew mechanism is more suitable.
    1. Nacos supports the coexistence of AP and CP protocols to meet the needs of different scenarios. Version 1.0 restructures the read-write and synchronization logic, and isolates the business logic from the underlying synchronization logic. Business reading and writing are abstracted as Nacos data types, which are synchronized through consistency services. Use proxies and rules for forwarding when selecting AP or CP.
    1. Nacos' CP implementation is based on simplified Raft, which guarantees half consistency and a small amount of data loss. The AP implementation is based on the self-developed Distro protocol , referring to ConfigServer and Eureka, and is implemented without third-party storage. Distro mainly optimizes logic and performance.

Therefore, a good registry design requires:

    1. Support CP and AP protocols to meet the needs of different scenarios.
    1. Good scalability to avoid single-point performance bottlenecks.
    1. Strong disaster recovery capability, supporting large-scale deployment.
    1. Flexible data reading and writing mechanism, the optimal solution can be selected according to the characteristics of the protocol.
    1. Self-developed consensus algorithm can better meet product requirements

insert image description here


load balancing

  • Strictly speaking, load balancing is not a function of the registry, which mainly provides service discovery. Service consumers choose service providers according to their needs.
  • Eureka, Zookeeper, and Consul themselves do not provide load balancing. Eureka's load balancing is implemented by Ribbon, and Consul's is implemented by Fabio.
  • Alibaba uses the opposite idea. Service consumers do not care about load balancing, but only care about efficient and correct access to services. Service providers pay high attention to traffic allocation, otherwise it may lead to overwhelming service.
  • Server-side load balancing gives service providers stronger flow control, but it cannot meet the load balancing strategy needs of different consumers. Client-side load balancing offers more customization options, but improper configuration can lead to hotspots or inaccessibility.
  • Ribbon's two-step load balancing: 1) filter unqualified service providers; 2) select one of the qualified service providers to implement load balancing strategy. Ribbon provides various strategies and extension interfaces.
  • Label-based load balancing is very flexible and can implement traffic allocation at any ratio and weight. Tag storage and management is required, such as a registry or a third-party CMDB.
  • Nacos 0.7 began to support label-based load balancing, and currently implements same-label priority access. The tag expressions supported by Nacos are not yet rich but will be expanded. Nacos also defines Selector as an abstraction of load balancing.
  • The ideal load balancing implementation varies from person to person. Nacos tries to combine server-side and client-side load balancing, providing scalability and options, and is easy to use. Nacos tries to provide various strategies, if not also allow users to expand.
  • Pay attention to whether the registry provides the required load balancing strategy and simple usage. If not, is it possible to easily expand the strategy to achieve the requirements.

Server side load balancing

insert image description here

Client side load balancing

insert image description here


health examination

  • Both Zookeeper and Eureka implement the TTL mechanism. If the client does not send a heartbeat within a certain period of time, it will be removed. Eureka allows defining a health check method to check its own status when registering a service. Dubbo and Spring Cloud treat this as default behavior.
  • Nacos also supports the TTL mechanism, but it is different from the ConfigServer mechanism. Nacos supports temporary instances to use heartbeat to maintain activity, and the default heartbeat period is 5 seconds. No heartbeat for 15 seconds is set as unhealthy, and it is removed for 30 seconds.
  • Some services cannot report heartbeats but can provide detection interfaces. These services also have a strong need for service discovery and load balancing.
  • Common methods of server-side health checks are TCP port detection and HTTP interface return code detection, which support most scenarios. Special scenarios require special interfaces, such as the MYSQL command to determine whether the main database.
  • The client health check focuses on the client heartbeat mode and the mechanism of removing unhealthy clients from the server. The server-side health check focuses on detecting the client mode, sensitivity and setting the client health status mechanism.
  • Server-side detection is more complicated, and it needs to execute the interface to judge the return result, retry mechanism and thread pool management. Client probes only need to wait for the heartbeat to refresh the TTL.
  • The server-side health check cannot remove unhealthy instances, and it is necessary to maintain the detection tasks of all registered instances. The client can remove unhealthy instances at any time to reduce the pressure on the server.
  • Nacos supports both client and server health checks, and the same service can switch modes. Diversified health check methods support various services using Nacos load balancing.
  • The next step of Nacos is to implement the user extension mechanism of the health check method, support the user's incoming business semantic request to be executed by Nacos, and realize the health check customization.

insert image description here

In short, Nacos allows more types of services to use its load balancing function by supporting multiple health check methods. At the same time, we must continue to expand the health check method to further enhance customization.


performance and capacity

  • There are many factors that affect performance, such as consensus protocol, machine configuration, cluster size, data volume, data structure and logic design, etc.

  • In service discovery scenarios, read and write performance is critical, but higher performance is not necessarily better, and other aspects may have to be sacrificed.

  • The write performance of Zookeeper can reach 10,000-level TPS, thanks to its exquisite design, but it needs a prerequisite: only KV writes, no aggregation or health check, etc.; the Paxos protocol limits the scale, and 3-5 nodes cannot meet large-scale service subscription queries.

  • Evaluating capacity not only considers the existing service scale, but also predicts the expansion scale in the next 3-5 years. Alibaba's internal middleware supports millions of instances, and the capacity challenge it faces is no less than that of any Internet company. Capacity is not only the number of instances, but also the number of individual service instances, subscribers, and QPS.

  • Nacos eliminated Zookeeper and Eureka, capacity is a very important factor.

  • The number of Zookeeper storage nodes can reach millions, but it does not represent the full capacity. The performance of a large number of instances is unstable when going online and offline, and the defect of the push mechanism leads to an increase in client resource usage and performance degradation.

  • Eureka is unavailable when there are about 5000 service instances, and a high number of concurrent threads will cause Eureka to crash. About 1,000 instances can be satisfied by most registry centers, and Eureka has not seen widespread reports of capacity or performance problems in China.

  • The Nacos open source version can register 10 million service instances and 100,000 services. The actual deployment will vary due to different machine, network, and JVM parameters.

  • Nacos 1.0.0 stress test results:

    • Capacity: 10 million registered instances, 100,000 services
    • Concurrency: read and write QPS up to 50,000 per second
    • Scalability: Expected machine number linear scalability performance
    • Latency: 99% within 10 milliseconds, 100 milliseconds in extreme scenarios

In short, Nacos has demonstrated high capacity and performance through testing, and has good scalability. These advantages help to meet the capacity and performance needs of users. However, the actual use still needs to be tested and evaluated according to its own scenarios.
insert image description here

For a complete test report, please refer to Nacos official website:

https://nacos.io/en-us/docs/nacos-naming-benchmark.html

https://nacos.io/en-us/docs/nacos-config-benchmark.html


ease of use

  • Ease of use includes API and client simplicity, complete and easy-to-understand documentation, and perfect console interface. For open source products, it also includes community activity.
  • Zookeeper is less usable:
    • The client uses complex, no API and model for service discovery
    • Multi-language support is not good, and there is no easy-to-use console for operation and maintenance management
  • Eureka and Nacos are much better than Zookeeper:
    • There are clients and Spring Cloud Starter for service discovery, low-cost non-aware service registration and discovery
    • Standard HTTP interface, support multi-language and cross-platform
  • Eureka 2.0 announced that the development will be stopped, and the follow-up investment should be reduced to improve the ease of use. Nacos continues to build:
  • Enhance console capabilities, increase console login authority control, monitoring and Metrics exposure
  • Continuously improve documentation, develop multilingual SDK, etc.

Cluster scalability

Nacos supports two modes:

  • AP, such as Eureka, supports temporary instances, replaces Zookeeper and Eureka, and supports computer room disaster recovery
  • CP, supports persistent instances, does not support dual computer room disaster recovery

insert image description here

  • Zookeeper and Eureka have no official multi-data center solutions. Based on Ali's internal experience, Nacos uses the Nacos-Sync component to synchronize the full amount of data in the data center. Nacos-Sync not only synchronizes Nacos clusters, but also synchronizes Eureka, Zookeeper, Kubernetes and Consul.
    insert image description here

In short, Nacos supports temporary instance AP mode and persistent instance CP mode to be compatible with different scenarios. The AP mode can replace Zookeeper and Eureka, and supports computer room disaster recovery, but the CP mode does not support it. Nacos also provides remote multi-active and multi-data center solutions, relying on Nacos-Sync to achieve data center synchronization. However, Zookeeper and Eureka have poor support in these areas. However, the specific choice also needs to consider its own needs and maturity of use.


user scalability

  • In framework design, scalability is an important principle. Spring, etc. implement user-extended contract interfaces and custom logic through interfaces and dynamic class loading.
  • In server design, user expansion is prudent, which may affect availability and troubleshooting difficulty. A good SPI may bring stability and operation and maintenance risks, which need to be carefully considered.

In short, the framework design focuses on scalability, but the server side needs to be cautious when considering stability. Open source software is a good model to achieve expansion through direct contribution. Zookeeper and Eureka do not support it. Runtime extension is better, it can be easily decoupled, and it should be considered when designing a certain function. The product should support user runtime extensions, and a robust SPI is required. Nacos has/will open CMDB, health check, and load balancing extensions to support the decoupling of various requirements.
insert image description here

Guess you like

Origin blog.csdn.net/yangshangwei/article/details/131137722