Xu Haihong, Senior Technical Expert of Alibaba Cloud Elastic Computing: Cloud Automation O&M Maturity Model

On March 22, 2023, the [newly upgraded Alibaba Cloud ECS CloudOps 2.0 is coming] press conference was officially broadcast. Automated operation and maintenance white paper version 2.0. Xu Haihong, senior technical expert of Alibaba Cloud's elastic computing, brought a sharing titled "Automated Operation and Maintenance Maturity Model on the Cloud" in this live broadcast. This article is compiled based on the content of his speech.

With the new upgrade of the CloudOps (automated operation and maintenance on the cloud) suite, Alibaba Cloud has launched a supporting intelligent product solution - ECS maturity assessment and insight (ECS insight), which combines the customer's resource management needs and related products on the cloud Practice, from the six dimensions of basic capabilities, cost management, automation, reliability, flexibility, and security, helps users analyze and locate potential operation and maintenance risks, and recommends corresponding solutions and best practices to help enterprise users reduce costs and increase efficiency and improve business continuity.

All along, the elastic computing team expects to improve the efficiency of customers' operation and maintenance on the cloud through continuous experience optimization and the construction of related tools. In the past year, we have compiled white papers including automated operation and maintenance and built related tools by visiting customers and combining some delivery practices on the cloud.
insert image description here

In December 2021, the elastic computing team began to introduce CloudOps automated operation and maintenance on the cloud. At that time, everyone realized that when offline becomes online, resource usage and customer focus will also change.

DevOps in Cloud is not exactly the same as moving the offline CloudOps practice to the cloud. Therefore, in the 2021 Puppet report, according to various survey responses, 65% of mid-stage companies claim that they have already started using cloud resources.

However, according to the survey results, only 20% of enterprises take full advantage of some features or advantages of the cloud itself to conduct business. In 1.0, based on the differences in resource delivery, operation and maintenance on the cloud, we proposed from the perspectives of reducing costs, improving delivery speed, improving automation, improving flexibility, enhancing system reliability, and improving business security. Some best practices and corresponding tools.
insert image description here

Over the past period of time, ops ecology and trends have also changed. A lot of content that already existed in the early days has been paid attention to again for various reasons.

This involves enhanced versions of different dimensions of ops, with different focuses. Some are applied in vertical business domains, and some emphasize the implementation of operation and maintenance practices.

Among them, FinOps is a combination of Finance+DevOps. It focuses on improving the utilization and performance of resources on the cloud. It requires close collaboration between business, finance, and engineering teams to improve cost visualization capabilities through data, thereby optimizing costs.

According to Flexera's 2022 assessment, about 32% of the annual cost expenditure on the cloud is wasted due to idle resources or low utilization. Over the past 12 months, the size of FinOps participating teams has grown by 75%.

In recent years, due to the breakthrough development of artificial intelligence and machine learning, AIOps has been brought up again, focusing on how to apply related technologies to various operation and maintenance scenarios to achieve cost reduction and efficiency increase. From relevant assessment reports, it can be seen that the global AIOps market size is expected to reach 11.25 billion US dollars in 2025.

In addition, DevSecOps is a combination of Security+DevOps, which is a practice of security as a shared responsibility throughout the entire IT lifecycle.

Finally, in the field of machine learning, MLOps is the application of DevOps methodology and tools in the field of ML. According to a MarketsandMarkets report, the global MLOps market size is expected to reach $490 million by 2025.

The concept of DevSecOps was first proposed by experts and practitioners in the field of IT security in 2012. In the subsequent period, Gartner and RSAC conferences gradually strengthened related concepts and practices, especially the concept of security left shift, emphasizing security It should be implemented in the entire life cycle of DevOps. At present, integrating multiple factors such as risk management and compliance governance into the DevSecOps framework has become one of the industry trends.
insert image description here

No matter what type of Ops it is, it ultimately revolves around resources. Resources include infrastructure, application teams, data business processes, etc. Typical participants include cloud integrators. On the cloud, the cloud platform is the most important member of the integrator role. Of course, there are other different roles, including the most traditional development, operation and maintenance personnel, and operation personnel. There are also some experts in the business field, some financial, some security.

From the perspective of the cloud platform, the first thing we do is to improve the user experience and capability richness of basic products, which is the foundation of CloudOps. It can avoid problems at the root. Take the ECS product as an example. In the past year, we have developed from the perspective of work orders, gradually analyzed customer problems, and solved them from the product itself. Judging from the results in March, the number of work orders has decreased considerably year-on-year, and the results have been very good. It also verifies that the product's own experience is the most basic part of CloudOps.

In addition, the cloud platform shields some characteristics of resources, so some Ops practices on the cloud have undergone corresponding changes. Therefore, it is necessary to integrate the resource operation and maintenance requirements of customers and the way resources are used on the cloud, and build the best practice of Ops on the cloud through diversified product capabilities. This is another part that needs continuous construction.

Finally, from the perspective of business roles, we have always believed that business teams including development/operation and maintenance roles are important participants in CloudOps, but they are also the biggest contributors to CloudOps best practices. Many users have very rich resource management practices.

In the past period of time, our product team and R&D team have visited many customers to understand their scenarios and existing working methods, which are used to guide our follow-up work.

Based on the above information, resource management practice is divided into three parts, problem detection, problem solving and problem prevention. Among them, in discovering problems, we need to think about how to establish best practice norms and data-based diagnostic capabilities? Among them, the most critical is to establish a best practice specification. Secondly, the standardized data-based diagnostic capabilities can help everyone find problems.

Because with specification and diagnostic capabilities, problems can be solved and prevented. This leads to the white paper and insight tools to be introduced next.
insert image description here

By observing the trend changes in the industry, we continue to communicate with customers, visit to understand the application scenarios used by customers, and build our own product capabilities. After we sorted out this information, we launched the CloudOps white paper.

There are two points to emphasize here:

First, about the maturity model. We divide users' use of the cloud into several levels. At the beginning, enterprises just started to use resources and began to pay attention to the automation, elasticity, security, compliance and other features used on the cloud. In practice, start to consciously contact and use the product. Enterprises simply enable related functions by default configuration. With the gradual deepening of the later stage, it will gradually reach different stages such as intermediate, advanced, standardized, and intelligent.

The second point is in terms of classification. By splitting the various fields of CloudOps, we have introduced automation capabilities. Automation capability refers to how we use tools and systems to reduce or even completely replace manual operations, so as to better improve related operation and maintenance efficiency. In addition, typical classifications include elastic capabilities, reliability capabilities, security compliance capabilities, cost and resource quantitative management capabilities, and so on.
insert image description here

The figure above shows the overall picture of Alibaba Cloud's elastic computing CloudOps products. You can see that the bottom layer is the basic capabilities of IaaS. At this level, as we mentioned earlier, it is the foundation of the entire CloudOps, and elastic computing has been committed to improving these basic capabilities and improving the experience.

On the upper layer of basic products is the product matrix of CloudOps. As we mentioned in the CloudOps white paper above, we have divided it into five dimensions, namely cost management, automation services, reliability services, elastic services, and security compliance services.

Among the elastic services that everyone is most familiar with, taking the most typical elastic scaling tool as an example, customers can automatically expand or shrink resources according to business load. In Elastic Resource Guarantee, we provide resource usage methods for different scenarios. Customers can read examples and manage resources by reserving capacity packages and capacity reservations.

insert image description here

ECS uses the Maturity Assessment and Insight Model, an open tool on the console. It is the realization of "various best practices and related normative standards" we mentioned in the white paper.

As shown in the figure above, in the first part, you can see that the tool can diagnose the current maturity of different dimensions based on the resource usage of the currently logged-in user, such as the use of automation capabilities, basic capabilities, insights into elastic capabilities, and security capabilities. Condition.

In the second part, you can see the scoring of different dimensions, including scoring items and missing items. For example, in the stability dimension, there are currently ten evaluation items, and the user may currently have seven scoring items and three losing items.

In these three lost points, we will also make some refinements and give some corresponding practical solutions to facilitate improvement and optimization based on the basis. If the system finds that the user has not used snapshots to back up data in the last seven days, the user can perform some optimizations on this issue. Of course, CloudOps is a continuous process. Whether it is a white paper or an insight tool, it is a summary of some best practices that we have worked with customers in the past. We will incorporate more new content in the future, thank you.

Click "Read the original text" at the end of the article to watch the wonderful live broadcast, follow the official account of Cloud Evangelist and reply to the keyword "CloudOps", and read/download "CloudOps White Paper 2.0 for Automated Operation and Maintenance on the Cloud" immediately.

Xu Haihong, Senior Technical Expert of Alibaba Cloud Elastic Computing: Cloud Automation O&M Maturity Model

Guess you like

Origin blog.csdn.net/bjchenxu/article/details/129956668