Etcd service discovery principle

What is etcd

etcd is a KV-structured storage system based on the Go language that supports service registration and discovery. The official definition is a reliable distributed key-value storage service, which is mainly used for shared configuration and service discovery.

  • Simple: Simple installation and configuration, and HTTP API is provided for interaction, and it is also very simple to use
  • Key-value pair storage: store data in hierarchically organized directories, just like in a standard file system
  • Monitor changes: monitor specific keys or directories for changes, and react to changes in values
  • Fast: According to the official benchmark data, a single instance supports 2k+ read operations per second
  • Reliable: Raft algorithm is used to realize the availability and consistency of distributed system data

What is service discovery

Service discovery is one of the most common problems in distributed systems, that is, how can processes or services in the same distributed cluster find each other and establish a connection. Essentially, service discovery is to know whether there are processes in the cluster listening on udp or tcp ports, and you can find and connect by name. Service discovery mainly revolves around these three issues.

  • A strongly consistent and highly available service storage catalog.
  • A mechanism for registering services and monitoring the health status of services.
  • A mechanism for finding and connecting services.
    Insert picture description here

How does etcd realize service discovery

When the client starts, it will obtain the basic information of the server from etcd, store the information in the local map, and then connect to the server through the local information. When the server is connected to etcd, it applies for a lease. etcd detects the server through the lease and realizes dynamic management of the server. When the client requests, etcd will create a Watcher to monitor in real time whether the server information requested by the client has changed. If a change occurs, the client is notified.

In etcd, the Raft algorithm is used to achieve data unification; the lease (Lease) is used to implement dynamic detection of the server; and the monitoring (Watch) is used to implement the notification of server changes. Combine the above three points etcd to realize service discovery.

Raft algorithm

Raft is a highly available and strongly consistent algorithm. Each Raft cluster contains multiple servers, and the time of the cluster is divided into different terms (Term). Each term starts with the election of the Leader node. This node will be responsible for the replication and management of the logs in the cluster. If there is a problem with the current Leader node, it will re-enter a term and select a new Leader node. After the election is over, it will enter the normal operation stage until the next time the Leader node is abnormal again, triggering a new election.
Insert picture description here

In the Raft algorithm, at any time, each node can only be in one of the three states of the master node (Leader), the slave node (Follower), and the candidate node (Candidate). The Etcd cluster follows the Raft protocol when starting nodes, and all nodes are initialized as slave nodes from the beginning. Therefore, in the Raft agreement, each term of office is started by Leader election.

If the current heartbeat time received by the slave node exceeds the time of the timer, the master node will be considered abnormal and a new round of election will be triggered. The slave node will become a candidate node, and a voting request will be sent at the same time. Each node has only one vote in a term.Insert picture description here

In order to avoid votes being divided among multiple candidates, the election timeout for each candidate in Raft is 150ms-300ms random.

Raft voting mechanism

When a candidate node sends a voting request to other nodes, it first votes for itself, then adds its own TermId, and sends RequestVoteRPC to other nodes. RequestVoteRPC contains its own TermId and Index. (TermId: Term ID, index slot subscript)

  • After other nodes receive the message, if their last TermId is greater than the requested TermId, they do not vote
  • If your last TermId is the same as the received TermId, your own index is greater than the requested index, and you do not vote
  • After a period of time, no votes have been received, and no news from other leaders has been received at the same time, and will run again

For example: A, B, C three nodes, when A and B initiate a voting request, the election message of A reaches C first, and C casts a vote for A. When the message of B reaches C, it can no longer satisfy each node. A vote requirement. After that, A has two votes and B has only one. After A wins, it will send a heartbeat message to nodes B and C. When B finds that the termId of node A is higher than its own termId, it knows that there is There is already a Leader, so I converted to a follower.
Insert picture description here

How raft does log replication

The client sends a data change request X, the leader writes X to its own log, and sends a log append request X to all followers, and the follower writes the data changes of X to the log and returns it to the leader. When no less than (N -1) / When the 2 followers (plus more than half of them) respond successfully, the leader initiates a commit of the data and sends a commit request to all nodes.

How does raft ensure the data after the downtime node is restored

When the leader sends the log to the follower, it will take the next position. When the follower receives the log, it will find it in the same tenure number and index position. If it exists and matches, then the log will be received; otherwise, the leader will reduce the log index position and try again , Until a certain location agrees with the follower. Then the follower deletes all the logs after the index and appends the logs sent by the leader. Once the logs are appended successfully, all the logs of the follower and the leader remain the same. Only when most of the followers respond and accept After logging, it means that the transaction can be submitted before returning to the client to submit successfully. It ensures that the data of the Follower node and the Leader node must be exactly the same.
Insert picture description here
Raft animation

etcd service registration and monitoring

After each server is started, it will initiate a registration request to Etcd and at the same time send its basic information to the etcd server. The information of the server is stored through the KV key value. The key is the user's real key, and the value is the corresponding version information. The keyIndex saves all version information of the key. Each time it is deleted, a generation is generated. Each generation stores all the version numbers from creation to deletion in this life cycle.
Insert picture description here
Insert picture description here

Service Registration

  • Will open the write transaction.
  • According to the current version of the key, rev will find in the keyindex whether there is a record of the current key version. Mainly get information about created and ver.
  • Generate new KeyValue information.
  • Update the keyindex record.

The following is the corresponding code operation part.
Insert picture description here
Insert picture description here
Insert picture description here

Health check

During registration, a heartbeat period ttl and lease period lease are initialized. The server needs to send data packets to etcd within the heartbeat period to indicate that it can work normally. If within the specified heartbeat period, etcd does not receive a heartbeat packet, it means that the server is abnormal, and etcd will delete the information corresponding to the server. If the heartbeat packet is normal, but the lease period of the server is over, you need to apply for a new lease again. If you do not apply, etcd will delete all the information corresponding to the lease.

In etcd, the corresponding keyValue information is not deleted from the disk, but marked for deletion.

  • First, an ibytes will be generated in delete, and a mark will be added to it to indicate that the revision is delete.
  • Generate a KeyValue, the KeyValue only contains Key information.
  • At the same time, modify the Tombstone flag to end the current life cycle, generate a new generation, and update kvindex.
    Insert picture description here
    Insert picture description here

etcd lease mechanism

When etcd is started, when the Leasor module is created, it will start two resident goroutines, one is the RevokeExpiredLease task to periodically check whether there is an expired Lease, and initiate the operation of revoking the expired Lease. One is CheckpointScheduledLease, which periodically triggers the operation of updating the remaining expiration time of Lease.
Create lease: call the LeaseGrant function. In this function, if there is no specified leaseId, reqIDGen will be called to obtain an Id, and finally the Grant function will be called to complete the creation.

In the Grant function, it will first judge whether the id and ttl are reasonable. At the same time, a Leasor structure will be created, and the leaseMap will also be searched for the same leaser. Persist Lease data to disk and return a unique LeaseID to the client.

Insert picture description here
Insert picture description here

After the lease is created, it needs to be associated with the key, which is actually a put operation. etcdctl put kv --lease xxx

  • Get oldLease first, because this key may have been associated with another leaseid before, and the lease id is carried when constructing mvccpb.KeyValue.
  • If oldLease is valid, Detach must first disassociate it.
  • Finally, call Attach to associate the lease with the key.
  • In fact, the lease and key are stored in the map.
    Insert picture description here
    Insert picture description here
    etcd uses a small root heap to store Lease. When the expired Lease is eliminated, an asynchronous goroutine will be started, and the expired Lease will be removed from the small root heap regularly to execute the revokeExpiredLease task of deleting the Lease and its associated key list data.

etcd uses a small root heap to store Lease. When the expired Lease is eliminated, an asynchronous goroutine will be started, and the expired Lease will be removed from the small root heap regularly to execute the revokeExpiredLease task of deleting the Lease and its associated key list data.
Insert picture description here

checkpoint mechanism

The essence of deleting a lease is that the main loop periodically polls expired leases. After obtaining the ID, the Leader initiates a revoke operation to notify the entire cluster to delete the Lease and associated data. When the master node is abnormal and the slave node initiates a campaign to become the new master node, the small root heap needs to be rebuilt. However, if Leader switching occurs more frequently and the switching time is less than the TTL of the Lease, this will cause the Lease to never be deleted and a large number of keys to accumulate.

  • On the one hand, when etcd starts, the leader node will run this asynchronous task in the background, periodically batching the remaining TTL of the Lease to the follower node based on the Raft Log. After the follower node receives the CheckPoint request, it will update the remaining TTL information of the memory data structure LeaseMap .
  • On the other hand, when the Leader node receives a KeepAlive request, it will also reset the remaining TTL of the Lease through the checkpoint mechanism and synchronize it to the Follower node to try to ensure the consistency of the remaining TTL of each node in the cluster after the renewal.

Watch lookup

etcd will save the watcher request sent by each client. The watcher monitors one or a group of keys. If there is a change, the watcher will send the changed content through chan. The watcherGroup manages multiple watchers, and can quickly find one or more watchers monitoring the key according to the key. etcd will have a thread that continuously traverses all watch requests, and each watch object will be responsible for maintaining the key events it monitors to see which revision it pushes to.
etcd will traverse the disk backwards according to this revision.ID. For each KV traversed, etcd judges whether the key in it is the key concerned by the watch request, and sends it to the client if it is.

Insert picture description here
etcd will register WatchServer when it starts to process watch requests.
Create a serverWatchStream structure to
open two goroutines, where sendLoop is used to send watch messages, and recvLoop receives requests.
Insert picture description here
Insert picture description here
Insert picture description here

Accept watch request (coroutine recvLoop)

recvLoop reads req from gRPCStream, and then processes them separately.
WatchRequest_CreateRequest: monitor information, call watchStream.Watch to return watchid, and finally return this watchid to ctrlStream.
WatchRequest_CancelRequest: Call watchableStore.Cancel to cancel the subscription and clear the status.
Insert picture description here
Insert picture description here
Insert picture description here

Send watch request (coroutine sendLoop)

After the watcher is created, when you perform the put modification operation, after the Raft module, go to WatchableKV, the modified KV will be saved in the changes array, and the modification operation will be replaced with an event, watchableStore.notify Function, notify will match watchers that have listened to this key and are in the synced watcherGroup. Send events to this watcher's channel.

After the sendLoop goroutine listens to the channel message, it reads the message and pushes it to the client immediately. At this point, the push of a new modification event is completed.
Insert picture description here
Insert picture description here

Send watch mechanism

If the channel is full, in order to ensure the high reliability of the Watch event, etcd will not discard it, but delete the watcher from the synced watcherGroup, and then save the watcher and event list to a watcherBatch structure named victim. In, the reliability of the event is guaranteed by retrying through the asynchronous mechanism.
If the network is abnormal, etcd will also put the Watch event in the victim to ensure the reliability of the event. Open two goroutines in WatchableKV, syncWatchersLoop is responsible for turning unsynchronized events into synchronous events, syncVictimsLoop is responsible for traversing the victim, trying to write all events to the channel, if the write fails, then put it back into the victim, if the current watcher is listening If the minimum version number of is less than the current version number of the server, it will be loaded into unsynced, otherwise, it will be loaded into synced.

Insert picture description here
Insert picture description here

notify mechanism

If you create tens of thousands of watchers to monitor key changes, when the server receives a write request, how does etcd quickly find the watcher that monitors it based on the changed key? Use map in etcd to record watchers that monitor a single key. Because the Watch feature can detect a single key, you can also specify the scope of the monitor key and the key prefix. Therefore, map and interval trees are used in etcd to implement event storage.
Insert picture description here
When an event is generated, etcd first needs to find from the map whether a watcher is monitoring a single key, and secondly, it needs to find all intervals that intersect this key from the interval tree, and then obtain the watcher collection from the interval value.

The above summary has been compiled into PPT, which can be downloaded if necessary.
Link: https://pan.baidu.com/s/1Aza697_JiwgEmGG5eZw65A Password: 6685

Guess you like

Origin blog.csdn.net/qq_42708024/article/details/114986793