Open source China APP, start! This is a brand new version you’ve never seen before.”

Huawei Cloud's cloud-native FinOps helps users use the cloud meticulously to improve resource utilization per unit cost and achieve cost reduction and efficiency goals through visual cost insights and cost optimization.

企业上云现状：上云趋势持续加深，但云上开支存在显著浪费

According to the latest survey by Flexer in 2024, more than 70% of enterprises currently use cloud services heavily, while this figure was 65% last year. It can be seen that more and more enterprises are beginning to deploy their services on the cloud. While enterprises are using cloud services provided by cloud vendors, they are also paying for cloud services. Surveys show that on average, about 30% of cloud cost expenditures are considered ineffective expenditures. How to save cloud costs has become the top concern for cloud companies in recent years.

Enterprise cloud-nativeization is gradually deepening, but cost management still faces challenges

Cloud native technology has now become the mainstream way for many enterprises to carry out digital transformation. The resource sharing, resource isolation, elastic scheduling and other capabilities provided by kubernetes can help enterprises improve resource utilization and reduce enterprise IT costs. However, the 2021 CNCF "FinOps Kubernetes Report" survey report shows that after migrating to the Kubernetes platform, 68% of the respondents said that the cost of computing resources in their enterprise has increased, and 36% of the respondents said that the cost has soared by more than 20%. The reasons behind this are worth pondering.

Challenges faced by cost management in the cloud native era

There are four contradictions in cost management in the cloud native era:

Business unit VS billing unit: Generally, the billing cycle of cloud services (such as ECS) is relatively long, which may be monthly or yearly; while the life cycle of cloud native containers is relatively short, and actions such as elastic scaling and failure restart of containers are difficult. It may lead to a relatively high idle rate of resources.
Capacity planning VS resource supply: Capacity planning is generally static, usually preparing containers in advance according to budget or planning, while resource supply is driven by business. Scenarios such as business peak traffic impact and capacity expansion will pose great challenges to capacity planning.
Unified governance VS multi-cloud deployment: Many enterprises now use more than one cloud, and different cloud vendors have different billing interfaces and formats, which is not conducive to enterprises' multi-cloud unified cost management.
Cost model VS cloud native architecture: The cost model of cloud vendors is relatively simple, and is generally billed based on physical resources. For example, ECS services are billed based on the price of the entire machine. The cloud native architecture is application-centric, and resource application is refined to the granularity of CPU/memory. This makes cost visualization and cost analysis of cloud native scenarios more difficult.

To sum up, cloud native cost governance faces three major challenges:

Cost Insight: How to realize cost visualization in cloud native scenarios, how to quickly locate cost issues and identify resource waste?

Cost optimization: There are many ways to optimize cloud native costs. How to use appropriate cost optimization methods to maximize benefits?

Cost Operation: How can companies build a sustainable cost governance system and culture?

Huawei Cloud native FinOps solution

FinOps is a discipline that combines financial management principles with cloud engineering and operations to give organizations a better understanding of their cloud spend. It also helps them make informed decisions about how to allocate and manage cloud costs. The goal of FinOps is not to save money, but to maximize revenue or business value through the cloud. It helps organizations control cloud spending while maintaining the levels of performance, reliability and security required to support their business operations.

The FinOps Foundation defines FinOps as three phases: inform, optimize, and operate. Depending on how far each team or enterprise is in completing FinOps, a company may be in multiple stages at the same time.

Notification (Cost Insights): Notification is the first phase of the FinOps framework. This phase is designed to provide all stakeholders with the information they need to be informed and make informed, cost-effective decisions about cloud usage.

Cost optimization: The focus of cost optimization is to find ways to save costs. Where can your organization right-size resources based on current usage and benefit from discounts?

Cost Operations: Cost operations is the last stage of the FinOps framework. During this phase, the organization continuously evaluates performance against business goals and then looks for ways to improve FinOps practices. With optimization in place, organizations can leverage automation to enforce policies and control costs by continuously adjusting cloud resources without impacting performance.

Huawei Cloud's cloud-native FinOps solution refers to industry FinOps standards and best practices to provide users with multi-dimensional visualization of cloud-native costs and multiple cost optimization management methods to help customers maximize revenue or business value.

Cloud Native FinOps - Cost Insights

Huawei Cloud’s cloud-native FinOps cost insights provide the following key features:

1. Tag-based resource cost attribution

Supports cluster tags associated with ECS, EVS and other resources to facilitate cluster cost summary calculation

2. Accurate cost calculation based on CBC bills

Calculate cost allocation based on real CBC bills and accurately divide department costs

3. Flexible cost allocation strategy

Supports cost visualization and cost allocation strategies in multiple dimensions such as clusters, namespaces, node pools, applications, and customizations.

4. Support long-term cost data storage and retrieval

Supports cost analysis for up to 2 years, and supports monthly, quarterly, and annual reports and exports.

5. Quickly sense workloads and easily cope with fast elastic scenarios

For fast elastic application scenarios, it supports minute-level load discovery and billing capabilities, so that no cost is missed.

Introduction to the implementation mechanism of cloud native cost insights:

1. Cluster physical resource cost VS cluster logical resource cost

The cost of a cluster can be calculated from two perspectives:

Cluster physical resource costs include resource costs directly or indirectly associated with the cluster, such as cluster management fees, ECS costs, EVS costs, etc. The physical resource cost of the cluster can be intuitively reflected in the cloud cost bill.
Cluster logical resource cost . From the perspective of Kubernetes resources, the cost of the cluster includes the cost of the workload, plus the cost of cluster idle resources and public overhead costs.

It is not difficult to see that the cost of cluster physical resources = the cost of cluster logical resources.

2. Unit resource (CPU/memory, etc.) cost calculation

When the physical resource cost of the cluster is known, how to derive the cluster logical resource cost (such as pod/workload) is the key to cloud native FinOps cost insight. The core problem to be solved here is the calculation of unit resource cost. We know that general cloud virtual machines are sold based on the price of the whole machine, not based on unit CPU or memory. However, the resource occupancy of the container service is applied based on unit resources (CPU or memory, etc.). Therefore, the cost per unit resource must be calculated to finally calculate the cost occupied by the container service.

Generally, cloud vendors will have an estimate of the unit price of CPU or memory. We can also calculate the unit resource cost based on the cost ratio of CPU and memory.

3. Cloud native resource cost calculation

From the figure below, we can see that the resource usage of a Pod fluctuates dynamically over time. At some times, the Pod's resource usage is lower than the resource request (Request), and at other times, the Pod's resource usage is greater than the resource request (Request). When calculating the Pod cost, we will regularly sample the actual usage value and the Request value of the Pod, and use the maximum value of the actual usage value and the Request value for the Pod cost calculation. This is because once the Request value is assigned to a Pod, this resource will be reserved by K8S and will not be preempted by other Pods. All Pods need to pay for the resources of the Request department. In the same way, if the actual usage of the Pod is greater than the Request, then the Pod will also need to pay for the excess.

Based on the above principles, we can calculate the cost of Pod:

By accumulating the costs of all Pods under the namespace, we can get the cost of the namespace dimension:

Based on the above calculation logic, the cloud-native cost management feature of Huawei Cloud CCE enables cluster cost visualization in multiple dimensions, such as:

Cluster cost visualization

Namespace cost visualization

Node pool cost visualization

Workload cost visualization

4. Department cost allocation and cost analysis reports

Many companies will allocate the granularity of a cluster installation namespace to different departments. So how to visually analyze the costs of each department?

As can be seen from the above figure, the cost of a department not only includes the cost of the namespace to which the department belongs, but also should bear part of the public cost. This part of the functional cost includes system namespace cost and idle resource cost.

Huawei Cloud CCE cloud native cost management supports department-based cost allocation policy configuration, as shown in the following figure:

At the same time, based on the department's cost allocation strategy, Huawei Cloud CCE cloud native cost management provides monthly/quarterly/annual reporting functions, supporting report query and export for up to 2 years.

Cloud Native FinOps - Cost Optimization

How to improve resource utilization in cloud native scenarios?

According to Gartner statistics, the average enterprise CPU usage is less than 15% . There are many reasons for low resource utilization. Typical scenarios include:

• Unreasonable resource allocation : Some users do not understand the resource usage of their own services and are blind when applying for resources. They usually apply for excessive resources.

• Business peaks and troughs : Microservices have obvious daily peak and trough characteristics. To ensure the performance and stability of the service, users apply for resources according to the peaks.

• Resource fragmentation : Different business departments have independent resource pools, cannot share resources, and are prone to resource fragmentation.

Containerization can improve resource utilization to a certain extent, but there are some problems that cannot be effectively solved by relying solely on containerization:

• Excessive application of resources : If there is no effective resource recommendation and monitoring mechanism, the common practice is to over-application and accumulation of sand, resulting in resource waste.

• Unified resource pool : K8s native scheduler lacks high-order scheduling capabilities such as groups and queues; it is difficult to integrate big data business storage and computing to take advantage of container elasticity.

• Application performance : Simply increasing deployment density cannot guarantee service quality.

In order to improve cluster resource utilization, CCE's cloud-native FinOps solution provides a variety of optimization methods, such as intelligent application resource specification recommendation, cloud-native hybrid deployment, dynamic overselling and other capabilities.

5. Recommended smart application resource specifications

In order to ensure application performance and reliability, and due to the lack of sufficient visualization tools, we always tend to apply for excessive resources for applications. In order to solve this problem, CCE cloud native cost management provides an intelligent application resource specification recommendation function. This function is based on the historical portrait data of the application and based on the machine learning algorithm to recommend the best application value for the application.

6. Huawei Cloud native co-location solution

Huawei Cloud CCE cloud-native hybrid solution is based on the volcano plug-in, supports one-click deployment, and provides container services with high and low priority mixed deployment, dynamic overselling, service QoS guarantee and other capabilities. Key capabilities mainly include:

Container business priority and resource isolation

Fusion scheduling

Application SLO awareness: intelligent hybrid scheduling of multiple types of services, application topology awareness, time-sharing multiplexing, overselling, etc.;
Resource-aware scheduling: Provides CPU NUMA topology awareness, IO awareness, network-aware scheduling, and software and hardware collaboration to improve application performance;
Cluster resource planning : Provides rich strategies such as queue, fairness, priority, reservation, and preemption to uniformly meet high-quality and low-quality services;

Node QoS management: multi-dimensional resource isolation, interference check, and eviction mechanism.

The following focuses on the dynamic oversold feature: how to reuse idle node resources and improve resource utilization.

The core principle of dynamic overselling is to use the difference between the node request and the actual usage as a schedulable resource for the scheduler to reallocate and only use for low-quality tasks.

The oversold characteristic has the following characteristics:

Prioritize use of oversold resources below jobs
When high-quality jobs pre-select oversold nodes, they can only use their non-oversold resources.
In a unified scheduling cycle, high-quality jobs are scheduled before low-quality jobs.

Whether it is cloud-native mixed deployment or oversold features, resource utilization can be improved. So how to improve resource utilization while ensuring application performance and service quality?

The CPU isolation capability provided by Huawei HCE 2.0 OS, combined with the load balancing capabilities of CPU fast preemption, SMT management control, and offline task suppression instructions, ensures the QoS of online business resources and allows suppressed offline task instructions to be responded to as quickly as possible.

Based on the performance comparison between the simulated online and offline co-deployment scenario in the laboratory (CPU utilization 70+%) and the scenario where a single service is deployed online (CPU utilization 30%), the performance of online services (latency & throughput) in the co-deployment scenario ) The degree of degradation is controlled within 5% of the online service performance of a single deployment. Basically, it can be considered that the impact of mixed parts on performance is reduced to negligible.

Let's take a look at a customer case. This customer used Huawei Cloud's native co-location solution to optimize resource allocation and ultimately achieved a 35% increase in resource utilization.

This customer’s main pain points include:

Application interference: Big data and online voice, recommendation and other applications compete for resources, such as CPU/memory, network; affecting the service quality of high-quality tasks.
Unreasonable application resource configuration: In order to ensure successful scheduling, the request setting is very small and cannot feedback load resource requirements, causing resource conflicts.
Applications are bundled with cores: Some applications are bundled with cores, and overall resource utilization is low.

Based on customer pain points, we provide customers with the following solutions:

The customer switched the original node OS from CentOS to Huawei Cloud HCE OS;
Switch the scheduler from the original default scheduler to the Volcano scheduler;
Configure scheduling priority, isolation and other policies according to customer business attributes;

Through Huawei's cloud-native co-location solution, customers can ultimately benefit from a 35% increase in resource utilization.

7. CCE Autopilot: Pay-as-you-go and flexible specifications help customers save costs

CCE's newly launched Autopilot cluster supports pay-as-you-go based on the actual usage of the application. The advantage over CCE cluster is that Autopilot cluster fully hosts the management and operation of nodes, so you do not need to plan and purchase node resources in advance, thereby achieving refinement. cost management.

Here we look at two customer scenarios:

For Internet entertainment and social networking businesses, the traffic volume during the Spring Festival holiday is several times that of normal times. Special tracking and operation and maintenance guarantees are required, and resources are reserved in advance, which is costly.
The business of online car-hailing platforms has typical morning and evening peak characteristics. The traditional driving mode requires customers to manually purchase and reserve resources in advance, resulting in low resource utilization.

Through Autopilot, refined cost management can be achieved, ultimately achieving overall cost reduction and revenue maximization.

Click to follow and learn about Huawei Cloud’s new technologies as soon as possible~

Huawei Cloud’s cloud-native FinOps solution unleashes the greatest value of cloud-native