The 5-step upgrade path for cloud native DevOps

Head picture.png

Author |
Edited by Zhang Yu | Yachun
Source | Alibaba Cloud Native Official Account

What is cloud native DevOps

Click to view video:https://v.qq.com/x/page/u3220cutt7v.html

Let's first use the above short video and the following two pictures to understand what cloud native DevOps is and how it is different from DevOps.

1.png

The picture above is a food stall. The chef in the picture is working very hard to cut, fry, make various kinds of food, and sell it. From the purchase of raw materials to processing to sales to after-sales, one or two people complete it. This is a very typical DevOps scenario, the team handles everything from end to end. In this case, when the chef has a relatively high level and strong sales ability, high efficiency and low waste can be achieved. But the problem is that it will be difficult to scale. Because its processes are non-standard, it requires the chef to have strong personal ability.

2.png

Let's look at the picture of the Nanjing food stall above. Although there is a food stall in the name, it is obviously not the food stall we mentioned above. When we walk into any Nanjing food stall, we can find that the chefs of Nanjing food stall can focus on providing customers with better dishes, develop and test new dishes, and try and promote them through small batches of users. Whether the number of users increases or decreases, they can quickly adapt. Shop expansion can also be fast. We can understand this as cloud-native DevOps.

So what exactly is cloud native DevOps? We believe that cloud-native DevOps is to make full use of cloud-native infrastructure, based on microservices/no-service architecture systems and open source standards, independent of language and framework, with continuous delivery and intelligent self-operation and maintenance capabilities, so as to achieve higher than traditional DevOps Service quality and lower development and operation costs allow R&D to focus on rapid business iteration .

3.png

As shown in the figure above, cloud-native DevOps is based on two principles: compliance with open standards, language and framework independence; two foundations: microservice/no-service architecture, serverless infrastructure BaaS/FaaS; two capabilities: intelligent self-operation and maintenance , Continuous delivery. 

  • Two principles: compliance with open standards, language and framework have nothing to do. Compared with a specific language and a specific framework, it can have higher flexibility, better development and vitality when technology upgrades or iterations, and form a better ecology.

  • Two foundations: Based on microservices and no-service architectures, DevOps can be made possible; serverless-based infrastructure is resource-oriented and demand-oriented to achieve better flexibility.

  • Based on these two principles and two foundations, two capabilities are achieved: continuous delivery and intelligent self-operation and maintenance.

Alibaba cloud native DevOps upgrade case

Let's first look at a case of a cloud-native DevOps transformation of an Alibaba team. Case background: An overseas e-commerce team of Ali faces many challenges in overseas markets, such as many sites, high site construction costs, fast demand changes, slow delivery, and high operation and maintenance costs. How to smoothly upgrade to cloud-native DevOps to solve these problems and improve business delivery efficiency What? This is what we do.

1. Architecture upgrade-service governance sidecar and mesh

4.png

The first step is to upgrade the architecture. First, the service governance code is submerged to the sidecar part outside the application, and the service grid is used to carry capabilities such as environmental routing. As shown in the figure above, each green dot represents a service application code, and each orange dot represents a service management code. These codes are stored in this container in the form of a two-party package. With the construction of the service governance system, there are a lot of things in it, such as log collection, monitoring burying points, operation and maintenance intervention, etc. We call this kind of container rich container. The problem is obvious: even if it is an upgrade or adjustment of log collection, we need to upgrade, build, and deploy the application again. However, this has nothing to do with the application itself. At the same time, because the concerns are not separated, a bug in log collection will affect the application itself.

5.png

In order to allow the application to focus more on the application itself, the first thing we did was to separate all the service governance code from the application container and put it in the sidecar, so that there are two service governance and application codes. In the container. At the same time, we handed over some of the original service management tasks, such as test routing and link tracking, to the Mesh sidecar. In this way, the application is slim, and the application only needs to care about the application code itself.

The advantage of this is that the business can focus on business-related application code without relying on service governance.

This is the first step, and this step is smooth, because we can gradually migrate service governance to the sidecar without worrying about the cost of one migration.

2. Architecture upgrade-from construction decoupling, release decoupling to operation and maintenance decoupling

In the second step, we did three levels of decoupling: construction decoupling, release decoupling, and operation and maintenance decoupling.

Those who understand microservices and serviceless architectures should know that only when a business can be independently developed, tested, released, and operated, can the business run faster and better. Because this minimizes the coupling with other people. But we also know that as services become more and more complex and applications continue to evolve, applications will contain more and more business codes. For example, in the application in the figure below, some codes in it are for a specific business. For example, as a payment application, some are for the specific needs of Hema, some are for the specific needs of Tmall, and some are general codes. , Or platform code, is for all business scenarios.

6.png

Obviously, from the perspective of improving development efficiency, business parties can change their related business codes to reduce communication costs and improve R&D efficiency. But this brings about a new problem: if a certain business needs changes, but does not involve general business logic, it is also necessary to fully return to all businesses of the entire application. If there are other business changes during this time period, They need to integrate and publish together. If there are many changes to the business, everyone needs to queue up for integration. In this case, the cost of integration testing and communication and coordination is very high.

Our goal is that each business can be independently developed, released, and operated. In order to achieve this goal smoothly, the first thing we need to do is to make them decoupled in the construction phase. For example, for a relatively independent business, we build it as a container image separately and place it in the init Container of the Pod through orchestration. When the Pod starts, it is then mounted to the storage space of the main application container.

But at this time, application release and operation and maintenance are still together, we need to separate them.

We know that app intimacy can be roughly divided into three categories:

  • Super close, in the same process, communication through function calls.

  • Different containers located in the same Pod communicate via IPC.

  • They are in the same network and communicate via RPC.

We can gradually split some business codes into RPC or IPC services according to the characteristics of the business, so that they can be released and operated independently.

So far we have completed the construction decoupling, release decoupling and operation and maintenance decoupling of the application container.

3. IAC & GitOps

7.png

The third step we look at the development and operation and maintenance status. In many R&D scenarios, a thorny problem is: different environments and businesses will have a lot of their own unique configurations. During release and operation and maintenance, it is often necessary to modify and select the correct configuration according to the situation, and this configuration and application code It is actually part of the release itself, and the cost of traditional maintenance through the console will be very high.

In the context of cloud native, we think IaC (Infrastructure as Code) and GitOps are better choices. In addition to a code base for each application, we also have an IaC repository. This repository will contain the image version of the application and all related configuration information. When code changes need to be released or configuration changes, they are all pushed to the IaC warehouse in the form of code push. The GitOps engine can automatically detect IaC changes, and automatically translate them into configurations that comply with OAM specifications, and then apply the changes to the corresponding environment based on the OAM model. Whether it is development or operation and maintenance, you can learn what changes have taken place in the system through the IaC code version, and each release is complete.

4. BaaSization of resources

8.png

The last step is the BaaSization of resources.

Let's imagine how to use resources in the application. We usually first go to the corresponding console to submit a resource application, describe the resource specifications and requirements we need, and then get the connection string and authentication information of the resource after passing the approval. Add the resource configuration to the configuration of the application. If there is any change afterwards, go to the corresponding console to operate it, and cooperate with the code release for approval. Of course, the operation, maintenance and monitoring of such resources are generally carried out in an independent console.

When we have more and more types of resources, the operation and maintenance costs are very high, especially when building a new site.

Based on the principle of describing resources in a declarative manner and using them on demand, we define these resources in IaC to simplify the use of resources by all applications. All resources are described in a declarative manner, enabling intelligent management and on-demand use of resources. At the same time, all of our resources use common resources and standard protocols on the cloud, which greatly reduces migration costs. In this way, we gradually migrate the business team to the cloud native infrastructure.

Therefore, the two key points of resource BaaSization are:

  • Declaratively describe resource requirements, intelligently manage, and use on demand.

  • Use common resources on the cloud to align standard protocols.

Cloud efficiency drives the efficient implementation of cloud-native DevOps

What we shared above is Ali’s internal practice, which relies on Ali’s internal R&D collaboration platform Aone. The public cloud version of Aone is Alibaba Cloud Cloud. How do we implement cloud-native DevOps through Alibaba Cloud cloud effects?

9.png

From the previous case, we can see that the implementation of cloud-native DevOps is a systematic project, including methods, architecture, collaboration, and engineering. Among them, the implementation of cloud-native DevOps belongs to the category of lean delivery.

10.png

The above picture is a cloud-effect cloud-native DevOps solution diagram.

Here, we divide users into 2 roles:

  • Technical lead or architect.

  • Engineers, including development, testing, operation and maintenance, etc.

As a technical director or architect, he needs to define and control the R&D behavior of the enterprise as a whole. From a broad perspective, the R&D process includes four aspects: operable, observable, manageable, and changeable.

First of all, he will define the company's R&D collaboration model, such as whether to adopt agile R&D or lean Kanban. Secondly, he needs to master the overall product architecture, such as which cloud products need to be used, and how these cloud products are coordinated and managed. Then he will decide the team's R&D model: how to do R&D collaboration, how to control R&D quality, etc. In the third step, he needs to determine the release strategy, whether to use grayscale release or blue-green deployment, what is the grayscale strategy, and so on. Finally, it is the monitoring strategy of the service, such as which monitoring platforms the service needs to access, how to detect the service status, global monitoring configuration and so on.

Front-line development, testing, and operation and maintenance engineers focus on the smoothness and efficiency of the work process. After the cloud effect project collaboration platform receives a requirement or task, it can be coded, submitted, built, integrated, released, and tested through the cloud effect, and deployed to the pre-release and production environment, and the R&D mode and release configured by the administrator The strategy really landed. At the same time, each environment is automatically triggered and flowed, without human coordination and pull.

The data generated during the entire R&D process is an organic whole, which can generate a large amount of data insights and drive the team to make continuous improvement. When the team encounters bottlenecks or confusion in the R&D process, they can also get professional diagnosis advice and R&D guidance from the cloud efficiency expert team.

To sum up, the cloud-native DevOps solution of Cloud Efficiency is guided by the ALPD methodology, based on the best practices recommended by experts, and deeply integrated into the complete DevOps tool chain to help enterprises gradually move into cloud-native DevOps.

Next, we look at a specific case.

An Internet company has a R&D team of about 30 people and no full-time operation and maintenance personnel. Its products include more than 20 microservices and dozens of front-end applications (web, applets, apps, etc.). Its business is growing very fast. In the face of rapidly growing customers and increasing demands, the original script-based deployment method based on Jenkins+ECS has gradually failed to meet the demands, especially the problem of zero-downtime deployment upgrades. . As a result, I began to need the help of cloud efficiency, and finally fully migrated to cloud efficiency cloud native DevOps.

This R&D team faces three major pain points:

  • Large number of customers and many urgent needs.

  • No full-time operation and maintenance, cloud native technologies such as K8s have a high learning threshold.

  • The IT infrastructure is complex and time-consuming and labor-intensive to release.

In response to these problems, cloud efficiency starts from three aspects: basic capabilities , release capabilities, and operation and maintenance capabilities .

First, introduce Alibaba Cloud ACK to upgrade the infrastructure on top of the existing ECS ​​resources, and transform the application into containerization. In terms of service governance and application architecture, the Spring Cloud family bucket is simplified to SpringBoot, and service discovery and governance are supported through K8s standard capabilities.

Secondly, automatic container deployment is realized through the cloud-effect pipeline, and the gray-scale deployment strategy can be used to achieve gray-scale online, automatic expansion, and automatic restart in the event of a failure. At the same time, based on the cloud-efficiency pipeline, it can achieve zero downtime and quickly roll back any cost, saving machine costs At the same time, the problem of no full-time operation and maintenance personnel in the enterprise is solved.

Third, through the cloud-efficiency automated assembly line and branch protection standard research and development model, including code review, code inspection, test card points, etc., to improve feedback efficiency and release quality.

The figure below is the architecture diagram of the overall solution.

11.png

Cloud native DevOps upgrade path

We divide the implementation of cloud native DevOps into 5 stages.

12.png

The first stage: All manual delivery and operation and maintenance . It is our initial stage. The application architecture has not yet undergone service transformation, nor has it used cloud infrastructure or only IaaS. There is no continuous integration, test automation, manual deployment, release, and manual operation and maintenance. I believe that few companies stay at this stage.

The second stage: tool-based delivery and operation and maintenance . The first thing to do is to serve the application architecture and use the microservice architecture to improve service quality; secondly, to introduce some R&D tools, such as gitlab, jenkins and other island-style tools to solve some problems. At the same time, we began to implement continuous integration of single modules, but generally there is no automated quality stuck point, and the release is often assisted by automated tools.

The third stage: limited continuous delivery and automated operation and maintenance . We further improved our basic capabilities and transformed our infrastructure into containers based on CaaS. On the other hand, it began to introduce a complete tool chain to open up research and development data, for example, using a tool platform such as cloud-effect DevOps to achieve complete interoperability of all data. Continuous deployment can be achieved in terms of release capabilities, but a certain amount of manual intervention is required. At this time, automated testing has become mainstream, the service as a whole can be observed, and the operation and maintenance can be service-oriented and declarative.

The fourth stage: continuous delivery and manual-assisted self-operation and maintenance . We further let our development students focus on business development. First, we began to adopt a large number of serviceless architectures in the application architecture, and achieved unattended continuous deployment; the grayscale and rollback of releases can be automated as much as possible with intervention . The observing ability is upgraded from the application level to the business level, realizing the observability of the business, and being able to do part of the self-operation and maintenance with manual assistance.

The fifth stage: continuous delivery across the link and self-operation and maintenance . This is the ultimate goal we seek. At this stage, all our applications and infrastructure adopt a serviceless architecture, and achieve end-to-end unattended continuous delivery, including release rollback and grayscale are also automated; technical facilities and services are fully self-operated and maintained . Developers really only need to care about business development and iteration.

However, the devil is in the details. Of course, there are still many problems we need to solve when we really land. With the help of a tool platform such as Cloud Effect and the expert consultation of ALPD, we can avoid detours and achieve our goals faster. .

Guess you like

Origin blog.51cto.com/13778063/2595690