Cloud-native K8S selected distributed and reliable key-value storage etcd principles and practices

Overview

definition

etcd official website address  etcd  latest version 3.5.7

etcd official website document address  v3.5 docs | etcd

etcd 源码地址 GitHub - etcd-io/etcd: Distributed reliable key-value store for the most critical data of a distributed system

etcd is a strongly consistent and reliable distributed key-value store, developed using the Go language (also docker and k8s). It provides reliable distributed key-value (key-value) storage, configuration sharing and service discovery functions, even in Leader election can also be handled gracefully in the case of cluster split-brain network partitions; it is officially stated that etcd is a CNCF project. It can be said that etcd has become the storage cornerstone of cloud native and distributed systems.

Application scenarios

Data in distributed systems is divided into control data and application data. By default, the data processed by etcd usage scenarios are control data. For application data, it is only recommended when the amount of data is small, but updates and access are frequent. Application scenarios include the following categories:

  • Configuration management of key-value storage

  • Service registration and discovery

  • Message publishing and subscription

  • load balancing

  • Distributed notification and coordination

  • Distributed lock, distributed queue

  • Cluster monitoring and leader election

If you need a distributed storage warehouse to store configuration information, and you want this warehouse to have fast read and write speeds, support high availability, simple deployment, and support http interfaces, then you can use the cloud native project etcd. Get the address: click to get

characteristic

  • Simple interface: use standard HTTP tools (such as curl) to read and write values.

  • KV storage: Stores data in hierarchically organized directories, just like in a standard file system.

  • Listen for changes: Observe changes to a specific key or directory and react to changes in values.

  • Reliable: Distributed functions are implemented through the Raft protocol.

  • Security: Optional SSL client certificate authentication, optional TTL for key expiration.

  • Fast: Benchmarked at 10,000 writes/second.

Why use etcd

Zookeeper can implement most of the functions implemented by etcd, so why use etcd? In comparison, Zookeeper has the following disadvantages:

  • Complex: The deployment and maintenance of Zookeeper is complicated, and administrators need to master a series of knowledge and skills; the Paxos strong consistency algorithm is also famous for being complex and difficult to understand; in addition, the use of Zookeeper is also complicated and requires the installation of a client. The official only provides interfaces in Java and C languages.

  • Writing in Java: Java itself is biased toward heavy-duty applications, which will introduce a large number of dependencies; and operation and maintenance personnel generally hope that machine clusters are as simple as possible and are less error-prone to maintain.

  • Slow development: The "Apache Way" unique to the Apache Foundation project has been controversial in the open source community. One of the reasons is that the project development is slow due to the foundation's large structure and loose management.

As a rising star, etcd has the following advantages compared to Zookeeper:

  • Simple: It is easy to write and deploy using Go language; it is easy to use using HTTP as the interface; it uses Raft algorithm to ensure strong consistency and is easy for users to understand.

  • Data persistence: etcd defaults to persisting data as soon as it is updated.

  • Security: etcd supports SSL client security authentication.

As a young project, etcd is being iterated and developed at a high speed, which has both advantages and disadvantages. The advantage is that it has unlimited possibilities in the future. The disadvantage is that the iteration of versions makes its reliability unable to be guaranteed, and it cannot be tested for long-term use in large projects. But since well-known projects such as CoreOS, Kubernetes, and Cloudfoundry all use etcd in production environments, etcd is generally worth trying.

the term

  • Alarm: The etcd server issues an alert when the cluster requires operator intervention to maintain reliability.

  • Authentication: Authentication and management of user access rights to etcd resources.

  • client: The client connects to the etcd cluster to make service requests, such as getting key-value pairs, writing data, or monitoring updates.

  • Cluster: A cluster consists of several members; the nodes in each member follow the raft consensus protocol for log replication. The cluster receives proposals from members, submits them and applies them to local storage.

  • Compaction: Compaction will discard all etcd event history and superseded keys prior to a given revision. It is used to reclaim storage space in etcd backend database. Election

  • Election: As part of the consensus protocol, the etcd cluster holds elections among its members to choose a leader.

  • Endpoint: URL pointing to etcd service or resource.

  • Key: User-defined identifier used to store and retrieve user-defined values ​​in etcd.

  • Key range: A set of keys that contains a single key, all lexical intervals of x (A < x <= b), or all keys greater than a given key.

  • Keyspace: The collection of all keys in the etcd cluster.

  • Lease: A short-term renewable contract, equivalent to a lease period, and the keys related to it are deleted when it expires.

  • Member: A logical etcd server that participates in serving etcd cluster.

  • Modification Revision: Saves the first revision of the last write operation on a given key.

  • Peer: Peer is another member of the same cluster.

  • Proposal: A proposal is a request that needs to go through the Raft protocol (such as a write request, a configuration change request).

  • Quorum: The number of consensus active members required to modify the cluster state. Etcd requires more than half of the members to reach a quorum.

  • Revision: A 64-bit cluster-wide counter that starts at 1 and increments each time the keyspace is modified.

  • Role: Permission unit, a permission unit within a set of key ranges, which can be granted to a group of users for access control.

  • Snapshot: point-in-time backup of etcd cluster status.

  • Store: supports physical storage of cluster keyspace.

  • Transaction: A set of operations performed atomically. All modification keys in a transaction share the same modification revision.

  • Key Version: The number of write operations on the key since it was created, starting from 1. A key that does not exist or has been deleted has a version number of 0.

  • Watcher: The client opens a watcher to observe updates for a given key range.

Architecture

According to the layered model, etcd can be divided into Client layer, API network layer, Raft algorithm layer, logic layer and storage layer. The functions of each layer are as follows:

  • Client layer: The Client layer includes two large version API client libraries, client v2 and v3, which provide a simple and easy-to-use API. It also supports load balancing and automatic failover between nodes, which can greatly reduce the complexity of business use etcd and improve development. Efficiency, service availability.

  • API network layer: The API network layer mainly includes the communication protocol between client access server and server nodes. On the one hand, the API for clients to access etcd server is divided into two major versions: v2 and v3. The v2 API uses the HTTP/1.x protocol, and the v3 API uses the gRPC protocol. At the same time, v3 also supports the HTTP/1.x protocol through the etcd grpc-gateway component, which facilitates service calls in various languages. On the other hand, the communication protocol between servers refers to the HTTP protocol used by nodes to implement functions such as data replication and Leader election through the Raft algorithm. The communication between client and server in etcdv3 version uses the gRPC protocol based on HTTP/2. Compared with HTTP/1.x of etcd v2, HTTP/2 is based on binary rather than text, supports multiplexing instead of ordering and blocking, supports data compression to reduce packet size, supports server push and other features. Therefore, the gRPC protocol based on HTTP/2 has the characteristics of low latency and high performance, and effectively solves the HTTP/1.x performance problem in etcd v2.

  • Raft algorithm layer: The Raft algorithm layer implements core algorithm features such as Leader election, log replication, and ReadIndex. It is used to ensure data consistency among multiple etcd nodes and improve service availability. It is the cornerstone and highlight of etcd.

  • Functional logic layer: etcd core feature implementation layer, such as typical KVServer module, MVCC module, Auth authentication module, Lease lease module, Compactor compression module, etc. Among them, the MVCC module mainly consists of treeIndex (memory tree index) module and boltdb (embedded KV-style persistence repository) module composition. The treeIndex module uses the B-tree data structure to save the mapping relationship between user keys and version numbers. B-tree is used because etcd supports range queries, and using hash tables is not suitable. From a performance perspective, B-tree has a lower level than a binary tree. , higher efficiency; boltdb is a key-value library based on B+ tree, supports transactions, and provides simple APIs such as Get/Put for etcd operations.

  • Storage layer: The storage layer includes the write-ahead log (WAL) module, snapshot (Snapshot) module, and boltdb module. Among them, WAL can ensure that data will not be lost after etcd crashes, while boltdb saves cluster metadata and data written by users.

principle

etcd typically requires more reading and less storage. In our actual business scenarios, reading generally accounts for more than 2/3 of the requests.

  • Read request: The client selects an etcd node through the load to issue a read request. The API interface layer provides the Range RPC method, and the etcd server intercepts the processing request called after the gRPC read request.

  • Write request: The client selects an etcd node to initiate a request through load balancing. The etcd server intercepts the gRPC write request. After the checksum monitoring is involved, KVServer initiates a proposal to the raft module, and the content is written into a data command. After being forwarded through the network, when most nodes in the cluster After reaching a consensus on the persistent data, the state change MVCC module executes the proposal content.

Read operation

The etcd client tool executes a read command through etcdctl, parses the parameters in the request to create a clientv3 library object, then uses the Round-Robin load balancing algorithm to select an etcd server node through the EndPoint list, and calls the KVServer API module based on the HTTP/2 gRPC protocol. Send the request to etcd server, intercept it with the interceptor, mainly do some verification and monitoring, and then call the Range interface of the KVserver module to obtain the data. Core steps of read operation:

  • Linear readReadIndex module

  • MVCC (including treeindex and BlotDB) module

Linear reading is a concept relative to serial reading. There will be multiple etcd nodes in cluster mode, and there may be consistency issues between different nodes. Serial reads return status data directly without interacting with other nodes in the cluster. This method is fast and has low overhead, but there may be data inconsistencies.

Linear reading requires a consensus among cluster members, which involves overhead and a relatively slow response speed, but it can ensure data consistency. The default read mode of etcd is linear reading.

Query requests in etcd, query a single key or a group of keys and the query quantity, will actually call the Range keys method at the bottom layer.

The process is as follows:

  • Use BTree to quickly query the index item KeyIndex corresponding to the key according to the key in the treeIndex. The index item contains Revison

  • According to the queried version number information Revision, use the binary method to search in the cache Buffer of Backend. If there is a hit, it will be returned directly.

  • If the conditions are not met in the cache, search in BlotDB (based on the index of BlotDB), and return the key-value pair information after query.

ReadTx and BatchTx are two ports used to create Backend structures for read and write requests. readTx and batchTx are also created by default. readTx implements ReadTx, which is responsible for processing read-only requests. batchTx implements the BatchTx interface, which is responsible for processing read and write requests.

For the upper key-value store, it will use the returned Revision to query the Revsion data corresponding to the current key from BoltDB in the real stored data. BoltDB internally stores the table structure corresponding to MySQL in a buctket-like manner. The name of the bucket where user key data is stored is the meta of the key etcd mvcc metadata storage bucket.

Core module functions:

  • KVServer serial reading: state machine data is returned without interacting with the cluster through the Raft protocol. It has the characteristics of low latency and high throughput, and is suitable for scenarios that do not require high data consistency. Linear read: etcd's default read mode is linear read, which is slightly worse than serial read in terms of latency and throughput, and is suitable for scenarios that require high data consistency. When receiving a linear read request, it first obtains the latest committed log index of the cluster from the Leader.

  • When the Leader receives the ReadIndex request, in order to prevent abnormal scenarios such as split brain, it will send a heartbeat confirmation to the Follower node. Only after more than half of the nodes confirm the identity of the Leader can the committed index be returned to the node. The node will wait until the applied index of the state machine is greater than or equal to the committed index of the Leader, and then notify the read request that the data has caught up with the Leader, and you can access the data in the state machine.

  • The MVCC multiversion concurrency control (Multiversion concurrency control) module was created to solve the problems that etcd v2 does not support saving historical versions of keys and does not support multi-key transactions. The scheme for etcd to save multiple historical versions of a key is as follows: for each modification operation, a new version number (revision) is generated, with the version number as the key and value as a structure composed of user key-value and other information.

  • treeIndex is implemented based on the btree library and only saves the user's key and related version number information. The key and value data used are stored in boltdb. Compared with etcd v2 full-memory storage, etcd v3 has lower memory requirements.

  • Not all buffer requests must obtain data from boltdb. For reasons such as data consistency and performance, etcd will first read the transaction buffer from a memory before accessing boltdb, and perform a binary search to see whether the key you want to access is in the buffer. If it hits, it will be returned directly.

  • If the boltdb buffer misses, then you really need to query the boltdb module for data.

write operation

  • The client selects an etcd node through the load balancing algorithm and initiates a gRPC call.

  • etcd Server receives the client request.

  • After gRPC interception and Quota verification, the Quota module is used to verify whether the etcd db file size exceeds the quota.

  • The KVserver module sends the request to the raft of this module, which is responsible for communicating with the etcd raft module and initiating a proposal. The command is put foo bar, that is, using the put method to update foo to bar.

  • After the proposal is forwarded, half of the nodes successfully persist.

  • The MVCC module updates the state machine.

The write operation involves the following core module functions:

Quoto module

  • The client initiates a gRPC call to the etcd node. Unlike the read request, the write request needs to go through the process 2 db quota (Quota) module.

  • When the etcd server receives a write request such as put/txn, it will first check whether the sum of the current etcd db size plus the key-value size you requested exceeds the quota (quota-backend-bytes). If the quota is exceeded, it will generate an alarm (Alarm) request, the alarm type is NO SPACE, and synchronize it to other nodes through the Raft log, informing the db that there is no space, and persistently store the alarm in the db.

  • A quota of '0' means using etcd's default size of 2GB, which can be tuned according to common business conditions. The etcd community recommends no more than 8G. If you fill in a number less than 0, it means that the quota function is disabled, but this will make the db size out of control and cause performance degradation, so it is not recommended.

KVServer module

  • etcd implements data replication between nodes based on the Raft algorithm, so it needs to package the content of the put write request into a proposal message and submit it to the Raft module.

WAL module

  • After the Raft module receives the proposal, if the current node is a Follower, it will forward it to the Leader, and only the Leader can process the write request. After the Leader receives the proposal, it outputs the message to be forwarded to the Follower node and the log entries to be persisted through the Raft module. The log entries encapsulate the content of the proposal.

Apply module

  • If the put request crashes while executing the proposal content, when restarting and recovering, the Raft log entry content will be parsed from the WAL, appended to the Raft log storage, and the submitted log proposal will be replayed to the Apply module for execution.

  • etcd is an MVCC database, and a new version number will be generated every time it is updated. If there is no idempotence protection, the same command is executed once by some nodes and executed multiple times by some nodes after encountering abnormal failures. Then the consistency status of each node in the system cannot be guaranteed, resulting in data chaos, which is a serious failure.

  • The index field in Raft log entries is globally monotonically increasing. Each log entry index corresponds to a proposal. The currently executed log entry index is also recorded in the db.

MVCC module

  • After the Apply module determines that the proposal has not been executed, it will call the MVCC module to execute the proposal content. MVCC is mainly composed of two parts. One is the memory index module treeIndex, which saves the historical version number information of the key. The other is the boltdb module, which is used to persistently store key-value data.

Log replication

The log is identified by an increasing sequential number index. The Leader maintains the log replication progress of all Follow nodes. After adding a log, it will broadcast it to all Follow nodes. After the Follow node processing is completed, the Leader will be notified of the currently copied maximum log index. After the Leader receives it, it will calculate the maximum index position copied by more than half of the nodes, mark it as a committed position, and notify the Follow nodes in the heartbeat. Only the logs before the committed position will be applied to the storage state machine.

deploy

Quick deployment of single example

Install, run and test etcd's single-member cluster locally. For details on the deployment, please see the previous article "Cloud Native API Gateway Full Life Cycle Management Apache APISIX Exploration and Practice" about etcd single node deployment. Verify after the single node deployment is completed. Read, write and view version information as follows:

Multi-instance cluster deployment

Staticly starting the etcd cluster requires each member in the cluster to know other members in the cluster; but usually the IP addresses of cluster members may not be known in advance, and the etcd cluster can be bootstrapped through the discovery service. In a production environment, for the high availability of the entire cluster, etcd is normally deployed in a cluster to avoid single points of failure. Boot etcd cluster startup has the following three mechanisms:

  • static

  • etcd dynamic discovery

  • DNS discovery

static

Having known the cluster members, their addresses and the size of the cluster before deployment, name can use an offline boot configuration by setting the initial-cluster flag. Execute the following statements on each node respectively

etcd --name infra1 --initial-advertise-peer-urls http://192.168.3.111:2380 \
  --listen-peer-urls http://192.168.3.111:2380 \
  --listen-client-urls http://192.168.3.111:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.3.111:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra1=http://192.168.3.111:2380,infra2=http://192.168.3.112:2380,infra3=http://192.168.3.113:2380 \
  --initial-cluster-state new
  
etcd --name infra2 --initial-advertise-peer-urls http://192.168.3.112:2380 \
  --listen-peer-urls http://192.168.3.112:2380 \
  --listen-client-urls http://192.168.3.112:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.3.112:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra1=http://192.168.3.111:2380,infra2=http://192.168.3.112:2380,infra3=http://192.168.3.113:2380 \
  --initial-cluster-state new
  
etcd --name infra3 --initial-advertise-peer-urls http://192.168.3.113:2380 \
  --listen-peer-urls http://192.168.3.113:2380 \
  --listen-client-urls http://192.168.3.113:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.3.113:2379 \
  --initial-cluster-token etcd-cluster-1 \
  --initial-cluster infra1=http://192.168.3.111:2380,infra2=http://192.168.3.112:2380,infra3=http://192.168.3.113:2380 \
  --initial-cluster-state new  

You can also start etcd through nohup & background to obtain the member information of the cluster.

etcdctl --endpoints=192.168.5.52:2379 member list

etcd dynamic discovery

# 创建日志目录
mkdir /var/log/etcd
# 创建数据目录
mkdir /data/etcd
mkdir /home/commons/data/etcd

Discover the URL that uniquely identifies the etcd cluster. Each etcd instance shares a new discovery URL to bootstrap a new cluster, rather than reusing an existing discovery URL. If no existing cluster is available, the public discovery service hosted by discovery.etc.io is used. To create a private discovery URL using the "new" endpoint, use the command:

# 通过curl生成
curl https://discovery.etcd.io/new?size=3
https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200
# 通过上面返回组装
ETCD_DISCOVERY=https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200
--discovery https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200

Execute the following statements on each node respectively

etcd --name myectd1 --data-dir /home/commons/data --initial-advertise-peer-urls http://192.168.5.111:2380 \
  --listen-peer-urls http://192.168.5.111:2380 \
  --listen-client-urls http://192.168.5.111:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.5.111:2379 \
  --discovery https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200
etcd --name myectd2 --data-dir /home/commons/data --initial-advertise-peer-urls http://192.168.5.112:2380 \
  --listen-peer-urls http://192.168.5.112:2380 \
  --listen-client-urls http://192.168.5.112:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.5.112:2379 \
  --discovery https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200
etcd --name myectd3 --data-dir /home/commons/data --initial-advertise-peer-urls http://192.168.5.113:2380 \
  --listen-peer-urls http://192.168.5.113:2380 \
  --listen-client-urls http://192.168.5.113:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://192.168.5.113:2379 \
  --discovery https://discovery.etcd.io/d45c453e99404bcb4b0b30b0ff924200

Common commands

#写入KV
etcdctl put /key1 value1
etcdctl put /key2 value2
etcdctl put /key3 value3
# 范围,左闭右开
etcdctl get /key1 /key3
# 以十六进制格式读取key foo值的命令:
etcdctl get /key1 --hex
# 仅打印value
etcdctl get /key1 --print-value-only
# 前缀匹配和返回条数
etcdctl get --prefix /key --limit 2
# 按照key的字典顺序读取,大于或等于
etcdctl get --from key /key1
# 监听key,可以获取key变更信息
etcdctl watch /key1
# 重新修改
etcdctl put /key1 value111
# 读取版本
etcdctl get /key1 --rev=5
# 删除key
etcdctl del /key3
# 租约,例如授予60秒生存时间的租约
etcdctl lease grant 60
lease 5ef786eee44b831d granted with TTL(60s)
# 写入带租约
etcdctl put --lease=5ef786eee44b831d /key4 value4
# 撤销租约
etcdctl lease revoke 32695410dcc0ca06
# 授权创建角色
etcdctl role add testrole
etcdctl role list
etcdctl role grant-permission testrole read /permission
etcdctl role revoke-permission testrole /permission
etcdctl role del testrole
# 授权创建用户
etcdctl user add testuser
etcdctl user list
etcdctl user passwd
etcdctl user get testuser
etcdctl user del testuser
etcdctl user grant-role testuser testrole
# 创建测试账号2
etcdctl role add testrole2
etcdctl role grant-permission testrole2 readwrite /permission
etcdctl user add testuser2
etcdctl user grant-role testuser2 testrole2

#1. 添加root角色
etcdctl role add root
#2. 添加root用户
etcdctl user add root  
#3. 给root用户授予root角色
etcdctl user grant-role root root
#4.激活auth
etcdctl auth enable
etcdctl put /permission all2 --user=testuser2
etcdctl get /permission --user=testuser2
etcdctl get /permission --user=testuser
etcdctl put /permission allhello --user=testuser

# 直接带上密码
etcdctl --user='testuser2' --password='123456' put /permission all2 
# 集群鉴权
etcdctl --endpoints http://192.168.3.111:2379,http://192.168.3.111:2379,http://192.168.3.111:2379 --user=root --password=123456 auth enable
etcdctl --endpoints http://192.168.3.111:2379,http://192.168.3.111:2379,http://192.168.3.111:2379 --user=root:123456 auth enable

Article reprinted from: itxiaoshen

Original link: https://www.cnblogs.com/itxiaoshen/p/17245913.html

Guess you like

Origin blog.csdn.net/sdgfafg_25/article/details/131572312