Nacos architecture and principle - communication channel


insert image description here


Nacos long link

1. Current situation and background

The respective push channels of Nacos 1.x version Config/Naming modules are implemented according to their own design models.

insert image description here

The data push channels of the configuration and server modules are not uniform , and the performance pressure of http short connections is huge. In the future, Nacos needs to build a long connection channel that can support configuration and services at the same time, and reconstruct the push channel with a standard communication model .

insert image description here


2. Scene analysis

1. Configuration

Configure scenario analysis for connections

insert image description here

Between SDK and Server

  • The client SDK needs to perceive the list of service nodes, and select one of the nodes to connect according to a certain strategy;
    when the underlying connection is disconnected, it needs to switch the server to reconnect.

  • The client performs RPC
    semantic interface communication in configuration fields such as query, release, delete, monitor, and cancel monitor based on the currently available long links.

  • To perceive the configuration change message, it is necessary to push the configuration change message notification to the currently listening client; when the network is unstable, the client
    fails to receive it, and needs to support re-push and give an alarm.

  • Perceive the disconnection event of the client, log out the connection, and clear the context corresponding to the connection, such as clearing the listening information context
    .


Communication between servers

  • A single server needs to obtain the list of all servers in the cluster, and create an independent long link for each server; when the connection is disconnected, it needs to reconnect, and when the server list changes, it needs to create a long link of the new node. Destroy the long link of offline nodes.

  • Data synchronization between servers is required, including configuration change information synchronization, current connection number information, system load information synchronization, load adjustment information synchronization, etc.


2. Service

Between SDK and Server

  • The client SDK needs to perceive the list of service nodes, and select one of the nodes to connect according to a certain strategy; when the underlying connection is disconnected, it needs to switch the server to reconnect
  • RPC semantic interface communication in the field of service discovery such as query, registration, logout, subscription, unsubscription, etc. configured by the client based on the currently available long links
  • Sensing service changes, if there is a change in service data, the server needs to push new data to the client; push ack is required to facilitate the server to perform metrics and re-push judgments, etc.
  • Perceive the disconnection event of the client, log out the connection, and clear the context corresponding to the connection, such as the registered service and subscribed service of the client connection

Communication between servers

  • The server needs to perceive the survival status of the peer through the long connection, and needs to report the service status through the long connection (synchronous RPC capability)
  • AP Distro data synchronization between servers requires asynchronous RPC with ack capability

3. The core requirements of long links

insert image description here

1. Functional requirements

client

 Real-time awareness of connection life cycle, including connection establishment and connection disconnection events.
 The client calls the server to support three modes: synchronous blocking, asynchronous Future, and asynchronous CallBack.
 Bottom connection automatic switching capability.
 Respond to the connection reset message of the server for connection switching.
 Site selection/service discovery.

Server

 Real-time awareness of connection life cycle, including connection establishment and connection disconnection events.
 The server actively pushes data to the client, and the client needs to return an Ack to support reliable push, and it needs to retry on failure.
 The server actively pushes the load adjustment capability.


2. Performance requirements

It can support the scale of millions of long links and the volume of requests and pushes, and it must be stable enough.


3. Load balancing

Common load balancing strategies: random, hash, round robin, weight, minimum number of connections, fastest response speed, etc.

  • The similarities and differences between short connection and long connection load balancing: in short connection, because the connection is quickly established and destroyed, the four methods of "random, hash, polling, weight" can roughly keep the overall balance, and the restart of the server will not affect the overall Balanced, in which "minimum number of connections, fastest response speed" is a stateful algorithm, because data delays are likely to cause accumulation effects; long connections are because after the connection is established, if there is no abnormal situation, the connection will always be maintained. It is necessary to re-select a new service node. When the service node publishes and restarts, the final connection will be unbalanced. The "random, round-robin, weight" strategy can be used when the client reconnects and switches, and the "minimum Number of connections, fastest response speed" Just like short connections, there will also be data delays causing accumulation effects. One of the main differences between long connections and short connections is that when the overall connection is stable, the server needs a rebalance mechanism to reshuffle and allocate the number of connections from the perspective of the cluster, tending to another stable state

  • Client Random + Server Flexible Adjustment: The core strategy is the client + server two-way adjustment strategy, client random selection + server runtime flexible adjustment.

insert image description here

client random

The client obtains the service list at startup, and selects nodes according to random rules. The logic is relatively simple, and the overall can be kept random.

Server-side flexible tuning

(Currently implemented version) Manual control scheme

  • The system load console from the perspective of the cluster provides views such as the number of connections, load, etc. (extend the number of newly added connections, load, CPU and other information, and synchronize reports between clusters), realize manual adjustment of the number of connections of each server node, manually trigger reblance, and manually Cut peaks and fill valleys.
  • Provides a load console from a cluster perspective: displays the total number of nodes, the total number of long links, the average number, and system load information.
  • The address of each node, the number of long links, the difference from the average number, positive and negative values.
  • Regulate the number of nodes above the average value, set the upper limit of the number (temporary and persistent), and specify service nodes to switch.

 (Future Final State Version) Automation Control Solution

  • Based on the number of connections between each server and the load, it automatically calculates the reasonable connection number of nodes, automatically triggers reblance, and automatically cuts peaks and fills valleys. The implementation cycle is longer and more dependent on the accuracy of the algorithm.

4. Connection life cycle

Heartbeat keep-alive mechanism

insert image description here

what do we need

 Low-cost and fast perception: the client needs to switch to a new service node as soon as possible when the server is unavailable, reduce the unavailable time, and be able to perceive the underlying connection switching event and reset the context; the server needs to be connected when the client disconnects Eliminate the context corresponding to the client connection, including configuration monitoring, service subscription context, and process the instance corresponding to the client connection to go offline.

  • The client restarts normally: the client actively closes the connection, and the server perceives it in real time
  • The server restarts normally: the server actively closes the connection, and the client perceives it in real time

 Anti-shake:

  • Temporary network unavailability: The client needs to be able to accept short-term network jitter, and a certain retry mechanism is required to prevent cluster jitter. After the threshold is exceeded, the server needs to be automatically switched, but request storms must be prevented.

 Network disconnection drill:

  • In the case of network disconnection, retry at a reasonable frequency, and can quickly reconnect and recover when the network disconnection ends.

5. Security

Supports basic authentication and data encryption capabilities.

6. Low-cost multilingual implementation

At the client level, it is necessary to support as many languages ​​as possible, at least one Java server connection channel, which can be accessed by clients in multiple mainstream languages, and the cost of implementing various languages ​​must be considered. Consider thin sdk to reduce the cost of multi-language implementation

Long link selection comparison

insert image description here


Consistency Model Based on Long Links

1. Configure the consistency model

sdk-server consistency

insert image description here

Consistency between servers

insert image description here


insert image description here
Lightweight implementation of receiving and processing synchronous messages between servers, and monitoring alarms when retries fail.

Network disconnection: When the network disconnection is too long and the retry task queue is full, there is no elimination strategy.


2. Service Consistency Model

Consistency between sdk-server

insert image description here


Consistency between servers

insert image description here

insert image description here


Core Model Component Design

insert image description here

Guess you like

Origin blog.csdn.net/yangshangwei/article/details/131119342