Rancher chief architect interprets Fleet: Why does it manage millions of clusters?

About the Author

Darren Shepherd, co-founder and chief architect of Rancher Labs. Before joining Rancher, Darren was a senior chief engineer of Citrix, where he worked on CloudStack, OpenStack, Docker, and built the next generation of infrastructure orchestration technology. Before joining Citrix, Darren worked at GoDaddy, where he designed and led a team to implement public and private IaaS clouds.

 
This article is transferred from Rancher Labs
 
in early 2020. Rancher open sourced the massive cluster management project Fleet to provide centralized GitOps management for a large number of Kubernetes clusters. Fleet's most important goal is to be able to manage 1 million clusters distributed in different geographical locations. When we design the Fleet architecture, we hope to use the standard Kubernetes controller architecture. This means that we can expand the number of Kubernetes a lot more than before. In this article, I will introduce Fleet's architecture, the methods we used to test scaling, and our findings.
 
 

Why is 1 million clusters?

 
 
With the explosive growth of K3s users (currently Github Star has exceeded 15,000), edge Kubernetes has also begun to develop rapidly. Some companies have adopted an edge deployment model, where each device is a single-node cluster. Or you may see the use of a 3-node cluster to provide high availability (HA). The key point is that we need to deal with a large number of small clusters, rather than a large cluster with many nodes. Nowadays, engineers everywhere are using Linux, and they all want to use Kubernetes to manage workloads. Although most K3s have less than 10,000 nodes deployed at the edge, it is not impossible to reach 1 million nodes. Fleet will meet your scale expansion requirements.
 
 

Fleet architecture

 
 

The key parts of the Fleet architecture are as follows:

 

  • Fleet uses a two-stage pull method

  • Fleet is a set of K8S Controller driven by standard K8S API interaction

  • Fleet agent does not need to stay connected all the time

  • Fleet agent itself is another group of Kubernetes controller

 

To deploy from git, Fleet Manager first copies and stores the content in git, and then Fleet manager decides which cluster needs to be updated with the content in git, and creates a deployment record for the agent to read. When the agent can read, the agent will check in to read the deployment cluster, deploy new resources and report the status.
 
Rancher chief architect interprets Fleet: Why does it manage millions of clusters?
 
 

Scale-up test method

 
 

We use two methods to simulate 1 million clusters. First, we deployed a set of large VMs (m5ad.24xlarge-384 GiB RAM). Each VM uses k3d to run 10 K3s clusters. Then each of these 10 clusters runs 750 agents, and each agent represents a downstream cluster. In total, each VM simulates 7500 clusters. On average, it takes about 40 minutes to deploy a VM, register all clusters in Fleet, and reach a steady state. In two days, we started virtual machines in this way until we reached 100,000 clusters. In the first 100,000 clusters, we found most of the scaling problems. After solving these problems, the expansion became quite predictable. At this rate, simulating the remaining 900,000 clusters will take a long time and considerable capital.
 

Then, we adopt the second method: run an emulator, which can execute all the API calls that 1 million clusters will make without the need for downstream Kubernetes clusters or deployment of Kubernetes resources. Instead, the simulator makes API calls to register new clusters, discover new deployments, and report their success status. Using this method, we achieved from 0 to 1 million simulated clusters in a day.

 

Fleet manager is a controller running on a Kubernetes cluster, running on 3 large virtual machines (m5ad.24xlarge-384 GiB RAM) and one RDS (db.m5.24xlarge) instance. In fact, we use K3s to run the Fleet Manager cluster. We do this because Kine is already integrated in it. I will explain what Kine is and why I use it later. Although K3s is aimed at small-scale clusters, it may be the easiest Kubernetes distribution to run on a large scale. We use it because of its simplicity and ease of use. It is worth noting that on hosting providers like EKS, we cannot run Fleet on a large scale. I will explain this later.
 
 

Finding 1: Adjust service account and rate limits

 
 

The first problem we encountered was completely unexpected. When a Fleet agent registers with Fleet Manager, it will use a temporary cluster registration token (token). Then, the token is used to create new identities and credentials for the cluster/agent. The cluster registration token and the agent's credentials are both service accounts. The speed at which we register the cluster is limited by the speed at which controller-manager creates tokens for service accounts. After research, we found that we can modify the default settings of controller-manager to improve the speed of creating service accounts (-concurrent-serviceaccount-token-syncs=100) and the total number of API requests per second (-kube-api-qps= 10000).
 
 

Finding 2: etcd cannot run at this scale

 
 

Fleet is written as a Kubernetes Controller. Therefore, scaling Fleet to 1 million clusters means managing tens of millions of objects in Kubernetes. As we know, etcd does not have the ability to manage such a large amount of data. The main space of Etcd is limited to 8GB, and the default setting is 2GB. The key space includes current values ​​and their values ​​that have not been garbage collected before. In Fleet, a simple cluster object requires approximately 6KB. For 1 million clusters, we need at least 6GB. But a cluster generally contains about 10 Kubernetes objects, plus one object for each deployment. So in actual operation, we are more likely to need more than 1 million clusters 10 times the memory space.

 

In order to bypass the limitations of etcd, we use Kine , which makes it possible to run any Kubernetes distribution using a traditional RDBMS. In this scale test, we ran the RDS db.m5.24xlarge instance. We did not try to properly size the database, but started with the largest m5 instance. At the end of the test, we had approximately 20 million objects in Kine. This means that running Fleet at this scale cannot be done on a hosting provider such as EKS, because it is based on etcd, and it will not provide sufficiently scalable data storage for our needs.

 

This test does not seem to push the database very much. It is true that we use a very large database, but obviously we still have a lot of room for vertical expansion. The insertion and search of a single record continues at an acceptable speed. We noticed that randomly listing a large number of objects (up to 10,000) will take 30 seconds to a minute. But generally speaking, these queries will be completed in less than 1 second, or can be completed in 5 seconds under a very rough test. Because these very time-consuming queries occur during cache reloading, they have little impact on the overall system. We will discuss them later. Although these slow queries did not have a significant impact on Fleet, we still need to further investigate why this happens.
 
 

Finding 3: Increase the size of the monitoring cache

 
 
When the controller loads the cache, it first lists all objects, and then starts monitoring from the revised version of the list. If there is a very high rate of change and it takes a long time to list objects, then you can easily fall into a situation where you have completed the list but cannot start monitoring because the revision is not in the API Server’s monitoring cache, or it has been It is compressed in etcd. As a workaround, we set the watch cache to a very high value (–default-watch-cache-size=10000000). In theory, we think we will encounter Kine's compaction problem (compact), but we don't, which requires further investigation. Generally speaking, the frequency of Kine compaction (compact) is much lower. But in this case, we suspect that our speed of adding records exceeds the speed of Kine compaction. This is not bad. We don't want to insist on maintaining a consistent rate of change, just because we are registering clusters quickly.
 
 

Finding 4: slow loading cache

 
 
The standard implementation of the Kubernetes controller is to cache all the objects you are processing in memory. For Fleet, this means we need to load millions of objects to build a cache. The default page size of the object list is 500. It takes 2000 API requests to load 1 million objects. If you assume that we can make a list call, process the object and open the next page every second, this means that it takes about 30 minutes to load the cache. Unfortunately, if any of these 2000 API requests fails, the process will start again. We tried to increase the page size to 10,000 objects, but saw that the overall load time did not speed up significantly. After we started to list 10,000 objects at a time, we ran into a problem. Kine would randomly spend more than 1 minute to return all objects. Then the Kubernetes API Server will cancel the request, causing the entire loading operation to fail and have to be restarted. We solve this problem by increasing the API request timeout (-request-timeout=30m), but this is not an acceptable solution. Keeping a small page size can ensure faster requests, but the number of requests increases the chance of failure and causes the entire operation to restart.
 

It will take 45 minutes to restart the Fleet controller. This restart time also applies to kube-apiserver and kube-controller-manager. This means you need to be very careful. This is also the point where we found that running K3s is not as good as running traditional distributions such as RKE. K3s combines api-server and controller-manager into the same process, which makes restarting api-server or controller-manager slower than it should be and is more error-prone. Simulating a catastrophic failure requires a complete restart of all services, including Kubernetes, and it took several hours to get back online.

 
The time required to load the cache and the probability of failure are the biggest problems we have found so far to expand Fleet. In the future, this is the primary problem we have to solve.
 
 

in conclusion

 
 

Through testing, we proved that Fleet's architecture can be scaled to 1 million clusters. More importantly, Kubernetes can be used as a platform to manage more data. Fleet itself has no direct relationship with containers. It can be seen as a simple application to manage data in Kubernetes. These findings opened up the possibility for us to write code more as a general-purpose orchestration platform. When you consider that you can easily bundle a set of controllers with K3s, Kubernetes becomes a good self-contained application server.

 
In terms of scaling, the time required to reload the cache is worrying, but it is definitely controllable. We will continue to make improvements in this area to make running 1 million clusters not only feasible, but simple. Because at Rancher Labs, we like simplicity.

Guess you like

Origin blog.51cto.com/12462495/2590611