[Repost] Getting started from scratch K8s | etcd performance optimization practice

Getting started with K8s from scratch | etcd performance optimization practices

https://www.kubernetes.org.cn/6295.html

 

Author | Chen Xingyu (Yumu) Alibaba Cloud Basic Technology Central Taiwan Technical Expert

This article is compiled from Lecture 17 of "CNCF x Alibaba Cloud Native Technology Open Course" .

Introduction: etcd is a component used by the container cloud platform to store key meta information. Alibaba has been using etcd for 3 years, and it has once again assumed a key role in this year's double 11 process and has been tested by the pressure of the double 11. Starting from the performance background of etcd, the author of this article led us to understand the performance optimization of etcd server and the best practice of using etcd client, hoping to provide help for everyone to run a stable and efficient etcd cluster.

1. Brief introduction of etcd

etcd was born in CoreOs, developed in Golang, and is a distributed KeyValue storage engine. We can use etcd as a storage database for distributed system metadata, storing important metadata in the system. etcd is also widely used by major companies.

The following figure shows the basic architecture of etcd

1.png

As shown above, a cluster has three nodes: a leader and two followers. Each node synchronizes data through the Raft algorithm and stores data through boltdb. When one node hangs, another node will automatically elect a leader to maintain the high availability characteristics of the entire cluster. The client can complete the request by connecting to any node.

Second, understand etcd performance

First let's look at a picture:

2.png

The figure above is a simplified diagram of a standard etcd cluster architecture. The etcd cluster can be divided into several core parts: for example, the blue Raft layer and the red Storage layer. The Storage layer is further divided into the treeIndex layer and the boltdb bottom persistent storage key / value layer. Each layer of them may cause etcd performance loss.

First look at the Raft layer. Raft needs to synchronize data through the network. The RTT and / bandwidth between network IO nodes will affect the performance of etcd. In addition, WAL is also affected by the disk IO write speed.

Looking at the Storage layer, the disk IO fdatasync delay will affect the performance of etcd, and the block of the index layer lock will also affect the performance of etcd. In addition, the lock of boltdb Tx and the performance of boltdb itself will also greatly affect the performance of etcd.

From other perspectives, the kernel parameters of the host where etcd is located and the delay of the grpc api layer will also affect the performance of etcd.

Three, etcd performance optimization-server side

The following is a detailed introduction to the performance optimization of etcd server.

etcd server performance optimization-hardware deployment

The server side needs enough CPU and memory on the hardware to ensure the operation of etcd. Secondly, as a database program that relies heavily on disk IO, etcd requires ssd hard disks with very good IO latency and throughput. Etcd is a distributed key / value storage system, and network conditions are also important to it. Finally, in the deployment, you need to deploy it as independently as possible to prevent other programs on the host machine from interfering with the performance of etcd.

Attachment: etcd official recommended configuration requirements information .

etcd server performance optimization-software

The etcd software is divided into many layers, the following is a brief introduction to performance optimization according to different layers. Students who want to know more in depth can visit the following GitHub pr to get the specific modified code.

  • The first is the memory index layer optimization for etcd : optimize the use of internal locks to reduce waiting time. The original implementation is to traverse the internal lock used by BTree. The internal lock granularity is relatively coarse. This lock greatly affects the performance of etcd. The new optimization reduces the impact of this part and reduces the delay.

For details, please refer to the following links:

  • For to lease scale to optimize the use of: optimization algorithms and the lease revoke expired, failure to traverse the original list from time complexity O (n) down to O (logn), solves the problem of large-scale use of the lease.

For details, please refer to the following links:

  • Finally, it is optimized for  the use of back-end boltdb : adjust the back-end batch size limit / interval, so that it can be dynamically configured according to different hardware and workload. These parameters were previously fixed and conservative values.
  • Another point is the fully concurrent reading feature optimized by Google engineers : optimized use of boltdb tx read and write locks to improve read performance.

A new algorithm for distributing and recycling freelist of etcd internal storage based on segregated hashmap

There are also many other performance optimizations. Here we focus on a performance optimization contributed by Alibaba . This performance optimization greatly improves the performance of etcd's internal storage. Its name is: a new algorithm for etcd internal storage freelist distribution and recycling based on segregated hashmap.

3.png

The above figure is a single-node architecture of etcd. Boltdb is used internally as a persistent storage for all keys / values. Therefore, the performance of boltdb is very important for the performance of etcd. Inside Alibaba, we use etcd extensively as internal storage metadata. During the use process, we discovered the performance problems of boltdb. Here we share with you.

4.png

The above figure is a core algorithm for etcd internal storage allocation and recovery. Here is some background knowledge. First of all, etce uses a default page size of 4KB to store data. As shown in the figure, the number indicates the page ID, the red indicates that the page is in use, and the white indicates that it is not used.

When the user wants to delete the data, etcd does not immediately return this storage space to the system, but reserves it internally first and maintains a pool of pages to improve the performance of the next use. This page pool is called freelist. As shown in the figure, freelist page IDs 43, 45, 46, 50, and 53 are being used, and page IDs 42, 44, 47, 48, 49, 51, and 52 are in the idle state.

When the new data storage needs a configuration with consecutive pages of 3, the old algorithm needs to start scanning from the freelist header, and finally returns the page start ID to 47, so that you can see the ordinary etcd linear scan internal freelist algorithm, in When the amount of data is large or the internal fragmentation is severe, the performance will drop rapidly.

In response to this problem, we designed and implemented a new freelist distribution recycling algorithm based on segregated hashmap. The algorithm uses the continuous page size as the key of the hashmap, and the value is the configuration set of the starting ID. When a new page storage is needed, we only need O (1) time complexity to query this hashmap value to quickly get the starting ID of the page.

Looking at the above example again, when consecutive pages of size 3 are needed, the starting page ID of 47 can be quickly found by querying this hashmap.

Also when releasing the page, we also used hashmap for optimization. For example, when the page IDs 45 and 46 are released in the above figure, it can be merged forward and backward to form a large continuous page, that is, a continuous page with a starting page ID of 44 and a size of 6.

In summary: the new algorithm optimizes the time complexity of allocation from O (n) to O (1), and the recovery from O (nlogn) to O (1), etcd internal storage no longer limits its read and write performance In real-world scenarios, its performance is optimized dozens of times. From a single cluster recommended storage 2GB can be expanded to 100GB. The optimization is currently used internally in Alibaba and exported to the open source community.

Here is another point. The optimization of multiple softwares mentioned this time will be released in etcd in the new version. You can pay attention to use it.

Fourth, etcd performance optimization-client side

Let's introduce the best practices of performance usage of etce client.

First, let's review several APIs that etcd server provides to clients: Put, Get, Watch, Transactions, Leases, and many other operations.

5.png

For the above client operations, we have summarized several best practice calls:

  1. For Put operation, avoid using large value, streamline and then streamline, such as crd under K8s;
  2. Second, etcd itself applies and stores some infrequently changed key / value metadata information. Therefore, the client needs to avoid creating frequently changed keys / values. This point, for example, the upload of heartbeat data for new node nodes under K8s follows this practice;
  3. Finally, we need to avoid creating a lot of leases, and try to choose reuse. For example, under K8s, event data management: events with the same TTL expiration time will also select similar leases for multiplexing instead of creating new leases.

Finally, please remember one thing: Keeping the client to use best practices will ensure that your etcd cluster runs stably and efficiently.

Summary of this section

This section is over here, here is a summary for everyone:

  • First of all, we understand the performance background of etcd, and understand the potential performance bottlenecks from the principle behind;
  • Analyze the performance optimization of etcd server, and optimize from the aspects of hardware / deployment / internal core software algorithms;
  • Understand the best practices of etcd client;

Finally, I hope that after reading this article, you will be able to gain something and provide help for you to run a stable and efficient etcd cluster.

Guess you like

Origin www.cnblogs.com/jinanxiaolaohu/p/12504054.html