What are the principles for building a Kubernetes Operator?

The automation of data services on Kubernetes (K8s for short) is becoming more and more popular. Running stateful workloads on K8s means using Operators. However, its development and evolution has become very complicated today, and application modes and extension methods like Operator are becoming more and more popular for developers and operators.
insert image description here
But engineers often struggle with the complexity of writing K8s Operators, which affects end users. According to the "2021 K8s Data Report", the quality of the K8s Operator hinders the company from further expanding the K8s market share.

Anynines CEO Julian Fischer has been building automation tools for nearly a decade, and he understands the complexities of dealing with state management on cloud-native platforms and distributed infrastructure such as K8s.

Julian first shared the method that should be followed when building an Operator. He called it the operating model, which is divided into four parts:

Level 1: What does the SysOp or DB do

Level 2: Containerization, YAML + kubectl

Level 3: Writing Operators

Level 4: Operator Lifecycle Management

Through his sharing, you can understand the common pitfalls in data service automation and how to avoid them, and then write a better K8s Operator from a technical and methodological perspective.

Data Service Automation

It jumps a bit between the general topic of data service automation and K8s. Generally speaking, if you talk about data service automation, the first thing you have to do is to clarify the scope, that is, what you really mean by data service automation. For us, there is a mission at any time, which is to fully automate the entire life cycle of various data services to run at scale on a cloud-native platform across infrastructure.
insert image description here
This isn't some marketing gimmick, it's an example of how to do scope analysis for data service automation. For example, in order to automate multiple data services, you want to see some sharing effects, such as you can include data services other than the Operatar SDK into the automation framework. Therefore, the context of the task can have a big impact. For example, a simple K8s cluster with a small unit running their applications. Assuming a Postgres database, Postgres has always been my favorite example. As we all know, a K8s cluster corresponds to an Operator and a service instance, and the application will connect to that database. This is a different story than what we want to talk about here today. Postgres databases are represented as stateful collections, assuming they provision service instances on demand. Operator allows you to create multiple instances. Things get complicated because you have more data service instances and you have to deal with it. The challenge becomes even greater if you then introduce more data services, for example, adding RabbitMQ, MongoDB, or any other database to the Operator's collection.

Now we collaborate in organizations that sometimes have hundreds or thousands of employees, thousands or even 10,000 developers, incredible numbers of engineers, and many K8s clusters at the same time. We believe that dozens or hundreds of K8s clusters are a test of our experience. For example, in virtual machine-based data service automation, they usually have thousands of virtual machines running thousands of service instances, depending on how they are clustered. You can assume that there is a service instance corresponding to three pods, which happen to be running such a small cluster instance. At the current scale, automation needs change frequently, which has a great impact on scale.

If you solve simple tasks like making "sausage" and distributing "sausage", you can imagine the stack technology solution adapting if you want to serve a larger scale of users. Data service automation is pretty much the same. So, if we think about those large application scenarios, where you have many such service instances, each data service instance is important to some users. Therefore, automation needs to meet certain standards. If that standard doesn't reach the level of automation, then automation won't work and organizational and technology adoption won't happen.

Data service using K8s

So how to use K8s data service? First, how do you implement an Operator, as I think the community should know? The easiest way is to use K8s, CRD and custom resource definitions to transfer new data structures to K8s. For example, when describing your Postgres instance, create a plural Postgres instance, because we are provisioning on demand, and there is a controller responsible for managing the instances. The controller will convert the object you specify into a runnable program according to its specifications. So basically what the Operator does is convert the specification of a primary object (like a Postgres instance) with a Postgres 12.2 version into a secondary resource. As far as I know, the Operator SDK is a mainstream tool for building CRDs, generating CRDs, and providing template code for controllers. This is when discussing K8s-related data service automation, we thought of these two things. At the same time, there is KUDO.
insert image description here
In the development phase, if you develop an Operator, one of the challenges is how you want to handle this work systematically. There is a simple model, which we call the operation model, divided into four levels, which helps you deal with data service automation, which is proposed for the first time.

Give a little constructive feedback and keep your focus on the task. For example, we recommend automating Postgres at the first level. The first thing you need to know is what the assistant or DBA does. In particular, how does this affect application developers? What exactly do they want?

For example, what is the average application developer's expectation of Postgres? Do they need clustered instances with automatic failover? In this case, do they prefer synchronous or asynchronous replication? Which failover and cluster manager do you want to use, or your preferred warehouse manager, or rather, Prometheus?

And, basically you have to figure out how to configure the file, do the basic setup for Postgres, which is the operating model level. Just assume you have a virtual machine and you can do whatever you want, install packages, configure databases, etc. So once you've done that, you know what the config file should look like, all of which the Operator can do. You can think about containerization, which takes an existing container image and assembles it into the K8s specification of a stateful set service and creates its own template, which is the YAML part in the second level of the operation model. So, in the final operation at the second level of the operating model, whether you chose existing container images or created them yourself, you have K8s specifications that can be used with kubectl to manually create your own service instances. Once you've done that, you can essentially create your Postgres instance, say, with three replicas and synchronous stream replication, assuming you already know how to do this manually, and then you can, by thinking about the problem, how to write gde, You can more easily implement such an Operator by creating headless services with specific stateful settings to handle specific confidential data.

Now, let's say we remind ourselves that we're talking about an environment that might contain 1000s of data service instances, multiple data services across many K8s clusters. In this case, we also need to accept that Operator lifecycle management itself is an important part of our toolchain. Therefore, we also need automation to manage the lifecycle of the Operator itself.

Whether it's an Operator Lifecycle Manager, or another technology is irrelevant at this point, the most important thing you need to know is that this is part of your overall data services automation challenge. Now, if you think of the K8s Operator, and you mention custom resource definitions, a YAML structure like this describes a new data type that can be passed to the K8s API, and then the K8s API will provide you with nodes, and store the specification persistently in etcd. The format here is not very good. But you can see here what a custom resource defined by a specific resource would look like, and we teach K8s how to create such an object.

However, your CRD alone won't do anything, because you need controllers, which implement event observations through code, such as creating an object. The controller can then confirm whether this particular service instance already exists, identify secondary resources, require a service key for access, and stateful sets that need to be created. So, as I said before, the K8s controller basically converts primary resources into combinations of secondary resources. In our examples so far, these resources have been internal to K8s, but that doesn't have to be the case. We'll talk about that later.

If you still want to start writing Operators there, the Operator SDK makes recommendations on the maturity level of Operators, and Operators are divided into five different levels. I'm really not sure if you all understand the difference between these levels. But if you start now, this is definitely a good start. Learn to ask the right questions, which are also in the documentation. If you do build Operators, you need some core functionality, such as updating patches without backups, and backup and restore functions. Often these are must-haves, but users may reject a solution, or they may not have a solution. But you know, you're going to do it sooner or later, so it's going to help you. So please remember that there are many common traps, bugs caused by programming problems in distributed systems, and it depends on how many we exclude.
insert image description here
For example, the problems caused by the use of Git by enterprises. Overall, in my experience, most likely the biggest problem with data service automation is that people underestimate the complexity and effort required to automate data services, in the form of insufficient coverage of basic lifecycle operations, and reckless Quality features like stickiness and observability are low. Based on this, it is necessary to understand the threshold of use. You need to know what automation needs to do to be accepted by the target audience. While this depends a lot on the target audience itself, for now I can share some of the things I've learned that are important to our large clients, but can't say enough because there isn't much time right now and it's a bit time consuming.

Accepting configuration updates is important because application developers are able to use databases and applications through automated configuration. If the application has special requirements, you need to adjust the database configuration slightly. This is a real need, and resources should be used as much as possible. Therefore, you need to interview your target audience and find out if these configuration options are already in the automation documentation. You need to be good at adapting automation to specific needs. If you know there are more developers in the organization, all the cloud-native requirements are there, like, friendly observability, transparent usage of the infrastructure. With K8s, you've got that to some extent. But in the context of backup, when you need to store the backup somewhere, you usually have to write the backup to object storage. This is where people make assumptions about the existence of the S3 API, e.g. you should choose some abstract library that hides the underlying object storage.

Horizontal scalability of service instances

For example, if you need a service instance, you can consider using a pod for a single Postgres, or you can consider using asynchronous stream replication for clustered Postgres. Once you want to scale horizontally, from one replica to three, you introduce a lot of complexity into the automation. Because Postgres is not so simple to do automated services, people like to use it as an example. So you need to add a cluster manager for failure detection, and you need to have a master election and master promotion logic to help you out.

Also, if you happen to have data centers with multiple availability zones, you can distribute your pods to use them so that there is no single K8s node. As long as it is an available area and a K8s cluster is established, it will be used almost 100% of the time. Generally speaking, the state set is rebuilt many times throughout the life cycle, such as planning, switching, upgrading, or vertical scaling makes pods bigger and data is merged. We'll come back to the issue of backup and recovery, which is obviously very important because app developers don't need to wait for the platform operator's manual intervention to restore apps, which is usually a last resort.

So it's all about on-demand self-service, so far, application developers can self-service, create service instances, and then modify them, reconfigure them, and if a service instance happens to be misbehaving, or data gets accidentally deleted, they Data needs to be restored as required by the application to prevent potential data loss.

A less obvious requirement is sometimes to provide the latest version of a data service. Assuming the latest version of Postgres is good, active users will naturally like it. But for an organization, some applications may be in a long-term maintenance state, they will not use the new version immediately, so the application developer needs to be able to choose the data service version, and the operator can be managed with the version number to support all automation versions enable and disable. This is the policy you must have for automation. It also gives the team a lot of support if you provide too many versions. However, documentation may also reduce support for you.

Security is also very important, usually requiring encrypted storage and transmission encryption. For example, data on disk that you want to be encrypted is not read and used, as is data sent from a client to a data service instance, and ports in a stateful set should all be encrypted.

Guess you like

Origin blog.csdn.net/java_cjkl/article/details/130521351