Alibaba Cloud Elastic Computing Senior Product Expert Ma Xiaoting: ECS Maturity Assessment and Insights

On March 22, 2023, the [newly upgraded Alibaba Cloud ECS CloudOps 2.0 is coming] press conference was officially broadcast. Automated operation and maintenance white paper version 2.0. Ma Xiaoting, Senior Product Expert of Alibaba Cloud Elastic Computing, brought the theme sharing of "New Product Introduction: ECS Maturity Assessment and Insight (ECS Insight)" in this live broadcast. This article is compiled based on the content of her speech.

ECS Usage Maturity Assessment and Insight, referred to as ECS Insight. As the name implies, it analyzes and evaluates the user's use of ECS, and then gives optimization suggestions after the evaluation.

"ECS Maturity Assessment and Insight" is based on users' ECS multi-dimensional usage data, and helps users analyze and locate potential operation and maintenance risks from six dimensions of basic capabilities, cost management, automation, reliability, elasticity, and security, and recommends corresponding solutions. Solutions and best practices help enterprise users reduce costs, increase efficiency, and improve business continuity in an all-round way.
insert image description here

This product is a data-driven product. Its purpose is to help ECS users to continuously mine business risks on ECS, combine the best practices of enterprise cloud operation and maintenance, conduct continuous optimization, and finally achieve stable and sustainable business on the cloud. Since the name of "ECS Usage Maturity Assessment and Insight" is relatively long, we will refer to it as "ECS Insight" for short.

In the Cloud White Paper 2.0, we gave a clear definition of CloudOps, that is, CloudOps = DevOps x Cloud. Because we found that 95% of enterprises have started to use DevOps for software development and delivery, but less than 20% of enterprises have really used the characteristics and advantages of the cloud itself to improve the efficiency of DevOps practice. For example, the cloud naturally has the characteristics of high elasticity and standardized self-service capabilities. At the same time, with the prevalence of concepts such as FinOps and DevSecOps, business security and cost are also important parts that cannot be ignored in the process of implementing DevOps.

Against these backgrounds, we proposed the concept of CloudOps and its five dimensions, namely Cost Insight (Cost), Automation Capability (Automation), Reliability Capability (Reliability), Elasticity Capability (Elasticity) and Security Capability (Security Capability). ), the five dimensions are referred to as CARES.

This also means that if users use DevOps to shorten the development cycle and improve business efficiency, and at the same time want to keep the business stable, safe, reliable, and low-cost continuous operation, we can start from these five aspects and continue perfection. This coincides with the starting point that we hope that users can improve the maturity of CloudOps.
insert image description here

Next, let's look at the relationship between CloudOps and ECS Insight. The above figure shows the three parts.

The bottom layer is the basic capabilities of the IaaS layer, which includes basic capabilities on the platform side, such as various computing forms, services such as mirroring, and atomic capabilities on the user side, including resource group management and personalized configuration management of the Guest OS. These are capabilities that all IaaS services must provide.

In the middle part is the CloudOps product capability provided by Alibaba Cloud. For the five dimensions of CARES defined by CloudOps, in each vertical field, Alibaba Cloud provides corresponding automation and self-service tools to help users continuously improve the maturity of this vertical field. The higher the maturity of each dimension, the better the business is in this field, and the overall business is more stable, reliable, efficient, secure, and cost-effective.

For example, in terms of cost management, Alibaba Cloud currently provides a variety of resource payment methods, including annual subscription, monthly subscription, reserved instances according to volume, saving plans, etc., to meet the needs of different scenarios. For long-term and stable business, we recommend users to purchase by annual/monthly subscription, so that they can enjoy long-term discounts.

For temporary testing needs, we recommend users to purchase by quantity. Although the unit price per hour is slightly higher according to volume, it is very flexible and can be released at any time. If the business has temporary needs in different time periods, and the overall business demand is not small, we recommend users to purchase savings plans for deductions. In this way, you can not only enjoy the flexibility of creating or releasing resources at any time when you need them, but you can also deduct by the hour through the savings plan to reduce the overall cost of use.

Since there are so many payment methods, what kind of payment methods should we choose to combine at different stages, which can not only meet the business load requirements of different business scenarios, but also reduce the overall cost of use, and continue to maintain the advantage of super high cost performance? This requires continuous analysis and operation by users.

So how should it work? Based on these problems, we launched the CloudOps implementation practice, that is, ECS usage maturity assessment and insights. Based on the usage data of the five dimensions of CARES defined by users in CloudOps, it analyzes the usage of this dimension, and then puts forward corresponding optimization suggestions to help users continuously improve the shortcomings of this dimension and ensure efficient, available, stable and orderly business. On the whole, ECS Insight is a landing guide defined by CloudOps.

ECS Insight in detail

insert image description here

Next, I will introduce the product ECS Insight in detail. First, a quick overview of how ECS Insight works.

ECS Insight analyzes the usage of all ECS and associated resources under the user account, including the distribution of ECS, the usage of snapshots, ECS, cloud disk, bandwidth, usage data of each dimension, and the cost distribution of ECS, etc. wait. By combining the best practice experience in cloud operation and maintenance accumulated by Alibaba Cloud serving tens of thousands of companies, we will finally produce two results for users.

One is the current status of the maturity of users in multiple dimensions of CloudOps. Each dimension is counted on a percentage basis, and a deduction system is adopted. If an item does not meet the best practices recommended on the cloud, the corresponding score will be deducted. Users can view the scoring items of each dimension, the corresponding score and whether to score. The update frequency of this evaluation result is T+1 days. The analysis sources of these user data are actually very rich. It includes not only ECS operation logs and cloud monitoring, but also resource management and control behaviors of users, etc. Covers all key indicators of users using ECS.

In ECS, in addition to the five dimensions of CARES defined by CloudOps, we also added a dimension of basic capabilities of ECS. Because we found that for enterprise users with a certain level of ECS on the cloud, the corresponding specifications, availability zones, geographical distribution, and resource usage of ECS will affect the continuity of the entire business. So we added this part as a supplement to ECS.

Second, for items without scores, ECS Insight will clearly identify resources with risks and provide best practice guidelines for corresponding optimization. These best practices come from various industries, and the experience of medium and large enterprises is the accumulation of our years of exploration and growth, which is of great reference significance.

Now that we understand how ECS works, we can take a quick look at the ECS product page. Currently, this product is still in the testing phase. After the user passes the application, he can see the ECS maturity assessment report under his current account in the ECS console.

insert image description here

This report can be divided into three parts, as shown in the figure above.

The first part is a radar chart on the left to show the overall situation of ECS usage maturity assessment. From the basic capabilities of ECS and the six dimensions of CloudOps, a comprehensive score is given on the current use of ECS by users. You can see the total score and each Dimension score.

The second part is the score details of each dimension displayed at the top of the page and the total score of this dimension, including how many scoring items are included in this dimension, how many items are scored, and how many items are not scored. Although the matching between the final score and the maturity level is not completely related, for example, a score of 80 or above indicates an advanced level, and a score of 79 indicates an intermediate level, but a higher score means that the business has fewer risks in this dimension. At present, the scoring items for each dimension are not perfect, and there is still room for improvement in the distribution of scores. We will continue to optimize in the future, and welcome your feedback and suggestions.

The third part is the rating item details at the bottom of the page. Users can always see the scored items or lost items. For each missing score, we provide an explanation of why the score was lost and suggested guidelines on how to optimize it. For very specific scoring items, we will also list detailed information on risky resources, including resource IDs, availability zones, IP information, etc., so that users can quickly locate problematic resources and take timely actions.
Next, let's take a look at the product capabilities of each dimension of ECS to help you have a more direct experience of how to improve the maturity of each dimension.

First look at the basic capabilities of ECS

Although the basic capabilities of ECS are not included in CloudOps maturity, it is closely related to the characteristics of the public cloud itself and will directly affect the continuity of services on the cloud. So we added that dimension.
insert image description here

As we all know, cloud servers on public clouds are divided into specification families and specifications, such as general-purpose instances, computing instances, and memory-type instances. With the evolution of chips, hardware, and servers, instance type families continue to increase. Alibaba Cloud currently provides more than 300 instance types. The figure above shows the latest instance type families provided by Alibaba Cloud for different scenarios. This figure is updated almost every year. For some older instance specifications, such as classic network instances, it is not only cost-effective, but also does not support some new features, and faces many restrictions. Therefore, we recommend that users follow the evolution of instance specifications and continuously update the specifications of underlying resources, which can not only improve cost performance, but also ensure business stability, killing two birds with one stone.

In addition, as the scale of resources increases, the number of resource users will gradually increase. Different users have different rights to use different resources. When the scale of resources reaches a certain level, if we do not group and decentralize resources according to business units, we will not only face the problem of slow resource search, but also cause a series of serious consequences such as misoperation due to excessive permissions of some users.

In the face of these pain points, the basic capabilities of ECS evaluate whether the distribution and usage of ECS and related resources are reasonable from the four dimensions of computing, storage, network, and account management, and timely discover and identify business problems in the dimensions of high performance and availability. Some potential risks and corresponding optimization suggestions are provided to provide guidelines for the continuous operation of cloud business.

Generally speaking, the maturity assessment of ECS basic capabilities is to identify the most basic distribution of resource management on the cloud and whether the usage is reasonable, so as to avoid the regular risk of a single resource.

The second part is cost insight ability

insert image description here

The aforementioned ECS instances not only have various specifications, but also provide very rich payment methods. Including annual subscription, monthly subscription, pay-as-you-go, preemptible instances, reserved instances, saving plans, etc. The previous page shows different payment methods and suitable business scenarios. How to choose the most cost-effective payment method according to the form of business? This is a test of everyone's arithmetic ability.

At the same time, if there are multiple different teams in the enterprise, there will be a scenario of using cloud resources together. If we do not accurately calculate and allocate resource users or teams, it will lead to a lot of waste of resources. In the end, the cloud spending of enterprises far exceeded expectations. This runs counter to the original intention of enterprises to promote FinOps. If we adopt a one-size-fits-all approach to cost control, it will inevitably affect the normal development of some businesses. How to accurately identify resources according to the actual usage of resources, and optimize them in a targeted manner, is very important to achieve both cost optimization and business development.

Faced with these problems, the cost insight capability provides analysis and recommendations from three aspects.

First, we need to help users identify some idle or low-usage resources. It is recommended that users use self-service capabilities such as flexible allocation, shutdown, and no billing on the cloud to avoid some obvious extravagance and waste.

Secondly, we recommend users to use rights and interests products such as reserved instance coupons and saving plans. Deduct some temporary resources according to the amount, and finally reduce the use cost of this part.

Finally, we recommend users to use tags, financial units, budget management and other tools to conduct end-to-end cost management analysis, continuously optimize cost expenditure, and finally realize the implementation of FinOps.

Overall, the maturity assessment of the cost insight capability is to guide users to make better use of flexible payment methods and cost management tools on the cloud. On the basis of avoiding unnecessary cost waste, end-to-end cost management is carried out.

The third part is automation capability

insert image description here

Many people have always had a misunderstanding about DevOps, thinking that DevOps is automation. In fact, automation is only a means of practice, and it is a very important means. Why is automation so important?

Due to the limitations of technical capabilities or business development stages, many companies currently have serious shortages in automation capabilities. Many companies are supported by human sea tactics, which not only have a long response cycle, but are also prone to mistakes. At the same time, we have also observed that some users can complete some basic operation and maintenance work through scripts. However, most of these scripts are maintained by individuals alone, and it is difficult to reuse or form specifications.

The figure above shows the current evolution direction and status quo in the field of automation. European and American companies have a higher degree of automation in IT management, mainly because of the high labor costs of European and American companies. The automation of domestic enterprises is at a low level, and a large number of users rely on UI consoles, terminal tools or scripts for automation.

Faced with these problems, the maturity assessment of automation capabilities provides analysis and recommendations from three levels.

The most basic thing is to complete basic resource management and control operations through the console or open API. This capability is available to most users.

The intermediate level means that users can use automation tools to complete the automated management of infrastructure and its code, or operation and maintenance and its code in DevOps, and improve the efficiency of high-frequency management scenarios such as CICD.

On Alibaba Cloud, users can use tools such as resource orchestration and cloud assistant operation and maintenance orchestration to complete the release and deployment of applications. It involves multiple links such as resource delivery application, application packaging and distribution, and application grayscale release.

If each link can be automated, the release cycle of the entire application can be shortened from the previous 3 to 5 days to one hour. If you need to reach a more advanced level, you need to use a combination of multiple automated services and tools. And form a standardized operation and maintenance process and a unified configuration management platform, and finally realize standardized and unified operation and maintenance.

Overall, the maturity of the automation capability reflects the level of automation of the current user in ECS management and operation and maintenance. At the same time, it also provides corresponding paths and tools for users to improve their automation level. With the help of these automated tools, users can more efficiently solve the pain points of daily operation and maintenance.

The fourth part is reliability capability

insert image description here

When it comes to reliability, the first thing everyone thinks of is the stability of the underlying infrastructure, such as SLA. But there is a problem that everyone ignores here, that is, the stability of the underlying infrastructure, as long as it is not 100%, means that it is not completely reliable. It is very inadvisable if we depend on the availability of business on the stability of a single instance. If the problem is solved from the root, the application construction should be strengthened to make it highly available.

At the same time, in the same enterprise, different business teams have different demands for stability. For example, some big data computing clusters for offline business may require that the business cannot be interrupted between 12:00 p.m. and 7:00 p.m. For some online service businesses, its peak hours may be 9:00 am to 10:00 pm. Without affecting business availability, the coordination cost of multiple departments responding to underlying changes is actually very high. Once a problem occurs, some automated auxiliary tools are needed to help the staff quickly troubleshoot and locate it.

The figure above shows the capability support of ECS reliability. The reliability of ECS mainly comes from two parts. The first part is the stability of the underlying infrastructure. The second part is the stability within ECS. The stability of the infrastructure depends on the region of the public cloud, the distribution of availability zones, and the stability of a single physical server. Therefore, to achieve primary reliability, we need to disperse services as much as possible on different physical machines and different availability zones for deployment, so as to avoid the risk of large-scale failures.

For the stability in ECS, you need to rely on the guarantee of high-availability architecture. We need to periodically back up data and monitor instance performance fluctuations in real time. When the performance of the instance changes, we need to quickly and automatically complete the business switching to improve the high availability of the business itself and data.

High-level reliability is inseparable from the support of more dimensions of real-time monitoring, fault drills, fault injection and other tools. This is a more system engineering construction, tools and capabilities are only auxiliary means, and what is more important is the collaboration of multiple different teams.

Overall, in terms of reliability maturity, ECS Insight evaluates from four dimensions: instance stability, data reliability, performance reliability, and observability. We recommend that users first achieve primary and intermediate reliability. The current measurement of these four dimensions can basically help users achieve primary, intermediate, and some advanced reliability. As for more advanced reliability, it needs to cooperate with continuous drills to achieve.

The fifth part is resilience

insert image description here

Elasticity is one of the most basic advantages of the cloud. Pay-as-you-go is the essence of elasticity and one of the important characteristics of the cloud. Compared with offline IDC, for temporary large-scale elastic demand, not only the delivery cycle is long, but also the resource preparation may be insufficient due to inaccurate estimation, which will ultimately affect the business effect. For businesses with peak and valley fluctuations, if the capacity is expanded in advance, resources will be over-allocated, which not only requires high initial investment, but also causes a lot of waste of resources. If manual expansion is performed, the response will be slow, and the expansion may not be timely, resulting in business damage and ultimately affecting user experience.

Therefore, how to use the flexible and elastic capabilities of the cloud to meet business needs while avoiding waste of resources and costs is crucial. The elastic capabilities of ECS Insight provide us with guidance from the following three dimensions.

The most basic method is to purchase or release ECS instances in batches through the console or Open API. In this way, the temporary elastic demand can be met in a semi-manual way. For specific elastic requirements, ECS recommends using elastic scaling to realize automatic horizontal expansion and contraction of resources following business fluctuations. While improving the high availability of the business, it reduces the cost of use.

On this basis, if users have more complex business requirements. We can improve business elasticity, flexibility, and toughness with the help of the elastic scaling life cycle, linking elastic strength evaluation and instance specification paradigm, and finally realize fully automatic and adaptive elastic resource management to ensure the continuity of online business.

Elasticity is one of the most direct manifestations for users to judge whether it is appropriate to use. The maturity assessment of elastic capabilities reflects the depth of users' use of the cloud. With good use of elasticity, it can be said that users have used half of the cloud to a certain extent.

The last part is the security capability

insert image description here

Security issues are difficult to prove and difficult to falsify. It is not easy to see the effect of security protection directly, and many companies have a fluke mentality. Once the security protection is not in place, the consequences will be very serious, ranging from temporary unavailability of business, to loss of core data and huge losses. Based on this fact, we have observed that many enterprise customers have a serious lack of security awareness. Including the lack of protection awareness for key business-critical data, resulting in important data being deleted after the instance is attacked and cannot be retrieved.

The construction of security capabilities on the cloud is a shared responsibility model, which requires cloud vendors and users to build together. Cloud vendors are responsible for ensuring the security of the underlying infrastructure, including cloud server mirroring, supporting cloud servers, and mirroring underlying software and hardware services. In addition, it also includes the security of servers, network devices, and storage devices in various regions and availability zones, as well as the security of virtualization systems. Users need to be responsible for the security of the operating system on the voice server ECS, the application data in the operating system, and the application business architecture. Including environment variable configuration, software application, data security, security compliance, etc. If users do not take any security protection and measures themselves, and completely rely on the security of the underlying infrastructure, it is equivalent to running naked.

In addition to insufficient security awareness, users also face high barriers to entry in security practices, including clearly formulating security specifications, timely scanning and discovering security issues that do not comply with security specifications, and so on. In this dimension, ECS Insight provides users with a clear improvement path from the three dimensions of access security, data security, and application security.

Access security focuses on resource access rights and access auditing issues, including setting a more secure instance login method, providing login auditing for instance access, preventing unauthorized access, and so on.

Data security is a problem faced by many users. Unlike the offline computer room, once the data on the cloud is deleted, it cannot be retrieved. Therefore, regular backup of important data or encryption of highly sensitive data can greatly improve data security.

Application security is the ultimate goal of continuous business operation. On the basis of access security and data security, application security needs to continuously improve the security of the code of the application itself. And it is guaranteed by security protection capabilities such as WAF and DDOS.

Overall, security is no small matter, and business security needs to be jointly created by cloud vendors and users. When building business security systematically, we need to comprehensively consider access security, data security, and application security.

Summary and Outlook

insert image description here

To sum up, ECS Insight products are in the same line as CloudOps. It conducts a comprehensive analysis and evaluation of the user's use of ECS from the five dimensions of CARES defined by CloudOps. Combined with the best practices of cloud vendors, identify the points that can be optimized in each dimension, and provide corresponding suggestions to help users continue to optimize. Currently, capability assessment and accuracy at each latitude is less than perfect. Therefore, in the new year, ECS Insight will continue to optimize in two directions.

On the one hand, we will continue to optimize and improve the accuracy of CloudOps CARES' five-dimensional scoring, so that the scoring of each dimension can more accurately reflect the actual situation of users. The improvement of this capability is inseparable from the collection of more ECS indicators and usage data, as well as the trust and support of users in Alibaba Cloud.

On the other hand, we will continue to improve the self-service capabilities of CloudOps, provide users with more comprehensive, smarter, and more automated support for DevOps practices on the cloud, and help users make full use of their own advantages to help their business with high quality delivery and safe and stable operation.

Click "Read the original text" at the end of the article to watch the wonderful live broadcast, follow the official account of Cloud Evangelist and reply to the keyword "CloudOps", and read and download "CloudOps White Paper 2.0 for Automated Operation and Maintenance on the Cloud" immediately.

Alibaba Cloud Elastic Computing Senior Product Expert Ma Xiaoting: ECS Maturity Assessment and Insights

Guess you like

Origin blog.csdn.net/bjchenxu/article/details/129985894