This article mainly introduces the important core library in Consul - Serf , explaining its concept and function, the problems it can solve, and its internal communication principle.

Consul is a tool for service discovery and configuration. It provides a series of advanced functions, such as service discovery, health check, and key / value storage. It uses a set of highly consistent servers to manage the data center. The gossip protocol used internally is supported by the Serf library. Consul uses Serf 's member management and fault detection mechanism, and builds advanced functions on this basis.

The Serf is mainly used for cluster member management, fault detection and scheduling tool, it is a distributed, fault-tolerant and high-availability of. Serf can run on all major platforms: Linux , MacOSX and Windows . It itself is lightweight, using only 5-10 MB of resident memory, and uses UDP messages for communication.

Serf uses the gossip protocol to solve the following three problems:

Cluster member management: Serf maintains a list of cluster members and can execute custom program scripts when members change. For example, Serf can maintain a load balancer's web server list and notify the load balancer when a node is online or offline.
Failure detection and recovery: Serf can automatically detect failed nodes within a few seconds, notify other nodes in the cluster, and execute program scripts that handle these events. It will restore the failed nodes by periodically reconnecting them.
Custom event propagation: Serf can broadcast custom events to the cluster, which can be used to trigger deployment, propagation configuration, etc. Serf adopts the best-effort message delivery strategy when facing offline nodes or network partitions.

Serf can also be used for service discovery and orchestration, but it is built on the gossip model of eventual consistency , without a centralized server. But Serf does not provide any advanced features coupled with Consul . Serf provides members at the node level, while Consul focuses more on service level abstraction. Consul also uses a strongly consistent Catalog , while Serf is only eventually consistent.

Consul also provides a key / value store and support for multiple data centers. It utilizes multiple gossip pools to preserve the performance of Serf on the LAN , while still using it on the WAN to connect multiple data centers.

Consul is relatively fixed in the way of use, while Serf is a more flexible and versatile tool. Consul focuses on the CP (Consistency and Partition Support) architecture and emphasizes consistency rather than availability. Serf is an AP (availability and partition support) system, which sacrifices consistency and emphasizes availability. This means that if the central server cannot form an effective arbitration, Consul will not run, and Serf will work well in almost all situations.

Gossip protocol

Serf uses the gossip protocol to broadcast messages to the cluster. The gossip protocol is based on the "scalable weakly consistent infectious process group membership protocol (SWIM protocol)" with some minor modifications, mainly to improve the propagation speed and convergence speed.

SWIM protocol overview

Serf starts by joining an existing cluster or starting a new cluster. If you start a new cluster, you need other nodes to join it. To join a cluster, a new node in an existing cluster must be provided with the address of at least one existing member. The new member performs complete state synchronization with the existing member through TCP , and starts to synchronize its own existence to the cluster.

gossip is done via UDP , with configurable but fixed fanout and interval. This ensures that the usage of the network is constant relative to the number of nodes. The complete state exchange with random nodes is done periodically via TCP , but much less than gossip messages. This increases the likelihood that the member list will converge appropriately, because the complete state is swapped and merged. The interval between full state exchanges is configurable and can also be completely disabled.

Fault detection is accomplished by periodic random detection using configurable intervals. If the node fails to ack within a reasonable time (usually several times the RTT ), indirect detection is attempted. Indirect detection requires a configurable number of random nodes to detect the same node to prevent our own node detection failure due to network problems.

If the detection fails within a reasonable time, the node will be marked as "suspicious" and the information will be propagated to the cluster. Suspicious nodes are still considered members of the cluster. If a suspicious member in the cluster does not object to the suspicion within a configurable period of time, the node is eventually deemed dead, and this status is propagated to the cluster.

SWIM protocol changes

As mentioned earlier, the gossip protocol is based on SWIM , but contains some minor changes, mainly to improve the speed of propagation and convergence.

Serf periodically performs full state synchronization on TCP, while SWIM only propagates changes through gossip. Although the two are ultimately consistent, Serf can reach convergence faster and recover gracefully from the network partition.
Serf has a dedicated gossip layer independent of the fault detection protocol, while SWIM only integrates gossip messages on top of probe/ack messages. Serf uses an integrated approach to piggyback dedicated gossip messages. This feature has a higher transmission rate (for example, once every 200ms) and a slower failure detection rate (for example, once per second), resulting in an overall faster convergence speed and data propagation speed.
Serf will keep the status of offline nodes for a period of time, so that when the request is fully synchronized, the requester will also receive information about offline nodes. However, SWIM does not perform complete synchronization, so after knowing that the node is offline, SWIM will immediately delete the offline node and its status information. So this change will again help the cluster converge faster.

Lifeguard mechanism

SWIM assumes that the local node is healthy, but assuming that the local node is running out of CPU or network, it may cause problems with the health of the node, leading to false monitoring alarms, and further causing the entire cluster to waste CPU and network resources , So as to diagnose faults that may not really exist. Serf 0.8 adds Lifeguard mechanism to solve this problem.

The first extension is the introduction of a "nack" message to detect queries. If the probe node realizes that it has lost the "nack" message, then it realizes that it may degrade and reduce the speed of the fault detector. When nack messages begin to arrive, the fault detector rate is restored.

The second change is to introduce a dynamic change suspect timeout before declaring a node as a failed node. The probe node will start with a very long suspect timeout. When other nodes in the cluster confirm that a node is suspicious, the timer will speed up. During normal operation, the detection time is actually the same as the previous version of Serf . However, if a node is downgraded and not confirmed, there will be a long timeout that allows the suspicious node to refute its status.

The combination of these two mechanisms makes Serf 's processing of degraded nodes in the cluster more robust, while maintaining the same failure detection performance.

Serf-Specific message

At the top of the SWIM- based gossip layer, Serf sends some custom message types. Serf uses Lamport clock extensively to maintain the concept of message ordering, although they are ultimately consistent. Each message sent by Serf contains a Lamport clock time. When the node leaves the cluster gracefully, Serf sends a leave intention through the gossip layer . Since the underlying gossip layer does not distinguish between nodes that leave the cluster and nodes that are detected as failed, it allows the advanced Serf layer to detect node failure and graceful departure scenarios.

当节点加入集群时，Serf发送连接意图。这个意图的目的仅仅是将一个Lamport时钟时间附加到一个连接上，以便在节点离开发生故障时可以正确地对其进行排序。对于自定义事件和查询，Serf要么发送用户事件，要么发送用户查询消息。此消息包含Lamport时间、事件名称和事件负载。因为用户事件是使用UDP的gossip层发送的，所以负载和整个消息帧必须依附于一个UDP包。

参考资料：

Serf内部机制介绍：https://www.serf.io/docs/internals/index.html

Consul Quest: Introduction to Serf's internal communication protocol

Gossip protocol

SWIM protocol overview

SWIM protocol changes

Lifeguard mechanism

Serf-Specific message

Guess you like