Kubernetes performance testing practice

Overview

With the development of container technology, container services have become the mainstream of the industry. However, the key to successfully deploying and operating containers in a production environment is container orchestration technology. There are various container orchestration tools on the market, such as Docker's native Swarm, Mesos, Kubernetes, etc. Among them, Kubernetes developed by Google has become the first choice for container orchestration due to the participation of major industry giants and the full support of the open source community.

Simply put, Kubernetes is a container cluster management system that provides functions such as resource scheduling, deployment and operation, rolling upgrade, and capacity expansion and contraction for containerized applications. Container cluster management brings convenience to business, but with the continuous growth of business, the number of applications may grow explosively. In this case, whether Kubernetes can quickly scale up and whether the Kubernetes management capability is stable when scaled up to a large scale has become a challenge.

Therefore, combined with the recent research on Kubernetes performance testing in the community and our usual kubernetes performance testing practice, I will discuss with you some key points of Kubernetes performance testing.

Testing purposes

As shown in the Kubernetes architecture diagram above, whether external clients or internal components of the Kubernetes cluster, their communication needs to go through the Kubernetes apiserver, and the responsiveness of the API determines the performance of the cluster.

Second, for the external client, he only cares about the time it takes to create the container service, so the startup time of the pod is another factor that affects the performance of the cluster.

Currently, the performance standards commonly used in the industry are:

  • API responsiveness: 99% of API calls have a response time of less than 1s.
  • Pod startup time: 99% of pods (images that have been pulled) start up within 5s.

"Pod startup time" includes the creation of ReplicationController, RC to create pods in turn, scheduler to schedule pods, Kubernetes to set up the network for the pod, start the container, wait for the container to successfully respond to the health check, and finally wait for the container to report its status back to the API server, and finally The API server reports the status of the pod to listening clients.

In addition, network throughput, image size (which needs to be pulled) will affect the overall performance of Kubernetes.

Test points

1. The key points of community testing Kubernetes performance

  1. When the cluster resource utilization rate is X% (50%, 90%, 99%, etc. at different scales), the time required to create a new pod (this scenario needs to be laid in advance, and then use different concurrency on the basis of laying the groundwork) Create pods with gradients, test the time consuming of pod creation, and evaluate cluster performance). When testing the new version of kubernetes, it is generally based on the stable water level (node, pod, etc.) of the old version, and then the gradient is increased for testing.
  2. The increased container startup latency (the system experiences an abnormal slowdown) when the cluster usage is above 90% is also a factor in the linear nature of etcd tests and "model building". The tuning method is to investigate whether the new version of etcd has solved the problem.
  3. In the process of testing, it is necessary to find a highest point of the cluster. Below and above this threshold point, the cluster performance is not optimal.
  4. The component load will consume the resources of the master node, and the instability and performance problems caused by resource consumption will cause the cluster to be unavailable. Therefore, you should always pay attention to the resource situation during the testing process.
  5. The format in which the client creates the resource object - the API service also takes a significant amount of time to encode and decode JSON objects - this can also be an optimization point.

2. Summary of key points of NetEase Cloud Container Service Kubernetes cluster performance test

Cluster as a whole

  1. Different clusters use the performance of core operations such as pod/deployment (rs and other resources) creation, scaling, and scaling at the water mark (0%, 50%, 90%). The cluster can be filled by pre-creating a batch of dp (the number of replicas is set to 3 by default) to reach the expected water level, that is, the bottom.
  2. The effect of different water levels on system performance - safe water level, limit water level
  3. Whether a container has a data disk mounted or not affects the container creation performance. For example, mounting the data disk increases the time it takes for the kubelet to mount the disk, which increases the startup time of the pod.

When testing the performance of the kubernetes cluster, focus on the stability of the system when performing stress tests for a long time under different water levels and different concurrency numbers, including:

  • System performance, trends over longer time horizons
  • System resource usage, trends over longer time horizons
  • TPS, response time, error rate of each service component
  • Internal performance data such as the number of visits between internal modules, time-consuming, error rate, etc.
  • Resource usage of each module
  • When each server component runs for a long time, whether the process unexpectedly exits, restarts, etc.
  • Whether there is an unknown error in the server log
  • Whether the system log reports an error

apiserver

  1. Pay attention to the response time of the api. The data can be written to etcd, and then pay attention to whether the asynchronous operation is actually completed according to the situation.
  2. Pay attention to the performance impact of the storage device cached by the apiserver. For example, the disk io of the master node.
  3. The influence of flow control on system and system performance.
  4. Error response codes in the apiserver log.
  5. The time when the apiserver restarts and resumes. It is necessary to consider whether the user can accept it at this time, and whether there is an abnormality in the request or resource usage after the restart.
  6. Pay attention to the response time and resource usage of the apiserver in the stress test situation.

scheduler

  1. Stress test scheduler processing capacity
  • Create a large number of pods concurrently, and test the time-consuming of each pod being scheduled by the scheduler (from the creation of the pod to the binding to the host)
  • Continue to increase the number of newly created pods to increase the load on the scheduler
  • Pay attention to the average time, maximum time, and maximum QPS (throughput) of the scheduler under different pod orders of magnitude

    2. The time for the scheduler to restart and restore (from the start of the restart to the time the system is stable after the restart). It is necessary to consider whether the user can accept it at this time, and whether there is an abnormality in the request or resource usage after the restart.

    3. Pay attention to the error message in the scheduler log.

controller

  1. Stress testing deployment controller processing capability
  • Create a large number of rc concurrently (deployment after 1.3, single copy), test the time-consuming of each deployment being null-aware and create the corresponding rs
  • Observe the time taken by the rs controller to create the corresponding pod
  • Time-consuming expansion and shrinking (scaling to 0 copies)
  • Continue to increase the number of new deployments, and test the average time, maximum time, maximum QPS (throughput), and controller load for the controller to process deployments under different deployment orders of magnitude.

   2. The time for the controller to restart and recover (from the restart to the system recovery after the restart). It is necessary to consider whether the user can accept it at this time, and whether there is an abnormality in the request or resource usage after the restart.

   3. Pay attention to the error message in the controller log.

kubelet

  1. The impact of node heartbeat on system performance.
  2. The time for kubelet to restart and restore (from the start of the restart to the time the system is stable after the restart). It is necessary to consider whether the user can accept it at this time, and whether there is an abnormality in the request or resource usage after the restart.
  3. Watch for error messages in the kubelet log.

etcd

  1. Pay attention to the write performance of etcd
  • Write maximum concurrent
  • Write performance bottleneck, this is mainly the performance of periodic persistent snapshot operations

   2. The impact of etcd's storage device on performance. For example, write etcd's io.

   3. The impact of the number of watcher hubs on the performance of the kubernetes system.

 

Author: Zhang Wenjuan, NetEase Test Engineer

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324506408&siteId=291194637