Intensive operation and maintenance resource cost control method - Finops thinking and practice of Internet companies

At present, reducing costs and increasing efficiency has become an important direction for major Internet companies, and IT costs account for the majority of Internet costs. As the cost of IT resources is getting higher and higher, many companies realize the importance of controlling and optimizing costs.

How to effectively reduce costs? How to do a good job of cost insight management and control? How to master the technical method of resource cost optimization?

In the 5th issue of UGeek, UGeek invited a very prestigious celebrity in the SRE field, Zhang Guanshi, the author of "SRE Principles and Practice" and former head of Huya's business operation and maintenance, as a guest in the live broadcast room, and brought "Intensive Operation and Maintenance Resource Cost Control" In the theme sharing of Methods - Finops Thinking and Practice of Internet Enterprises, Mr. Zhang explained the thinking and method exploration of FinOps cost optimization from a professional perspective.

Next, let's review the exciting content of this issue with Lu Xiao U.

"Live review"

1. Understand the cost of IT resources and the necessity of engaging in Finops

In the understanding of FinOps, there is a concept that saving is profit.

Let's look at a case first.

 

The above picture is an analysis report of a brokerage firm, which is an analysis of the IT resources of a certain live broadcast platform. In the red box on the left side of the figure above, the server bandwidth cost of each paying user is 125 yuan in 2019 and 2020, and it will drop to 81 yuan in 2022. It can be seen that the server bandwidth cost of paid users of the live broadcast platform has dropped by more than 20% year by year.

The statistics on the right side of the above figure show that the income per user is 69 yuan, and the bandwidth cost per user is 6 yuan, which means that the bandwidth cost accounts for only 8% of the income per user. Looking at this ratio from another market perspective, if the industry is developing rapidly and revenue is growing rapidly, this cost can be ignored. Such costs may become a sensitive topic if business growth is stagnant. Therefore, cost optimization control will become more and more important.

>> How much IT expenditure accounts for revenue is reasonable

 The figure above shows the proportion of server and bandwidth costs in the total revenue. It can be seen that in 2019, the server and bandwidth accounted for 13%, which is actually quite high. The ideal situation is to achieve 5%.

You can think about what the ratio of IT costs to revenue in the company looks like? Does it account for 5%?

5% can be used as a reference value. If many companies can achieve 5%, it is a relatively ideal state. According to the specific situation analysis of each company, 5% is not an absolute reference value, it depends on the market share of the company. If our cost optimization is done well enough, the impact on the financial report will be very significant.

As mentioned above, what is saved is profit.

>> What are Finops?

There is a saying on the Internet that FinOps should be called operation and maintenance economics or cloud computing economics. If we talk about cloud cost optimization, it is clear that the concept of FinOps is far greater than this cloud cost optimization.

 First of all, FinOps does not only focus on cloud costs, but may also have the cost of self-built IDC or other IT resources. To maximize the productivity of limited resources, including how to learn the supply of resources and how to maximize the effectiveness of resources, For example, what is the relationship between the cloud and the supply chain, and how to coordinate between internal and external teams? From the perspective of the entire Internet, we must pay attention to a wider range of economic issues. Therefore, I think it may be more appropriate to call it economics or cloud computing economics.

>> waste is ubiquitous

Waste is actually a common phenomenon. Let’s take a look first. According to Flexera’s 2022 State of the Cloud Report data, interviewed companies believe that 32% of their cloud spending is wasted. In China, based on the data analysis and research of a public cloud customer, the utilization rate of online business resources is generally at 10%-15%

Reasons for waste may be:

  • In order to cope with possible burrs, apply for large specifications. A large amount of time is idle and not released. Because it comes from tradition, there is no habit, no technology
  • Organizational disjoint: applicants and users are different, users don't know the cost, applicants don't know the business. R&D, resources, operation and maintenance.
  • There are many types of resources and large scales, so it is difficult for O&M and R&D to pay full attention. Hundreds of types, several manufacturers, and various product parameter differences.
  • The relationship between efficiency and cost: R&D application, launch tomorrow, without careful evaluation, no one pays attention after hastily launched

>> Driving force behind Finops

I believe that in most businesses, savings of 10-30% may be relatively large. If you spend tens of millions of yuan a month, you can save millions of dollars a month. And it is a continuous and effective savings.

When promoting FinOps, there are two issues that need to be considered clearly:

Question 1: There is a big gap between the teams, and the goals of the stages are inconsistent. Where does the driving force for cost reduction come from?

Question 2: How to reduce costs in an enterprise without obvious waste?

After thinking clearly about the above issues, how to promote FinOps?

  • A large amount of cost reduction: it is worth investing in. In the context of slowing business growth or even shrinking business, cost reduction and efficiency increase are an inevitable choice
  • Incentives with rewards: A certain percentage is used to reward the optimization team, which is also worthy of the engineering team's great investment
  • Improve supply: strengthen IT resource management, provide more high-quality and low-cost resources, and help innovation with lower cost of trial and error

2. Up or down the cloud

Everyone is talking about cloud cost optimization. Obviously, FinOps is not limited to cloud cost optimization, but also includes IDC.

The pros and cons of using the cloud vs building your own

 From the above matrix diagram, it is obvious that going to the cloud has many benefits. For example, the types of resources are very rich, the interaction is fast, and the technical starting point of cloud products is relatively high. You can use them with confidence, the maintenance cost is low, and you can pay on demand. The biggest disadvantage of going to the cloud is that it is expensive, and that out of control management may cause more waste.

Why is Shangyun so expensive? Because going to the cloud includes many hidden costs, such as equipment costs, construction costs, R&D costs, sales expenses, support costs, operation and maintenance applications, and vacant costs, etc., the cost of going to the cloud is relatively high.

It is also beneficial for an enterprise to build itself, such as low cost and easy control. Like many traditional enterprises, including banks, securities, and state-owned enterprises, many of them have not been connected to the cloud, and they are all using their own IDC.

>> Hybrid cloud is mainstream

The advantages and disadvantages of cloud migration and self-build have led to hybrid cloud becoming the mainstream and suitable for most enterprise architectures.

 >> How to make good use of the cloud: self-built and cloud-based, both fish and bear's paw

 The self-built level of the IDC on the right side of the figure above is the part that the enterprise itself is responsible for:

  • The first level starts from the civil work, builds the building, and leads to the electrical cabinet server. This is safe self-construction.
  • The second level is to rent a building, that is, to rent a room and design a cabinet by yourself, starting from the organization, including comprehensive water, electricity and network.
  • The third level is that the network is its own, and it is provided by the operator or the computer room service provider from the beginning of the institution.
  • The fourth level is to rent a server, that is, the server is not even your own. It is configured by asking the operator to ask for requirements. Many game companies do this.

There are two paths to self-build:

  • One is the cloud going up in the early stage and the cloud going down in the steady state partly . At this stage, a certain proportion of cloud migration and self-build is maintained.
  • The other is constant self-built, elastic cloud . That is to say, IDC builds itself first, then incrementally migrates to the cloud, and finally reaches a balanced state.

>> How to make good use of the cloud?

  • Reasonable planning and fine control

According to the characteristics of enterprise business and cloud business scenarios, the budget and scale of cloud resources can be reasonably planned in advance, and can be evaluated and controlled according to the plan during the application process. For specific cloud resources, finely control available specification configuration and usage scenarios to prevent excessive application.

  • Multi-dimensional insights to identify and optimize

Provide multiple dimensions to track cloud costs in detail, continuously analyze and identify cloud cost parts that can be optimized, improve cloud cost insights and business decision-making capabilities for the demand side, and provide cost distribution and consumption from different perspectives such as time, business systems, departments, and cloud resources. Trends, cost structures.

  • Neutral docking with cloudy

Analyze business prices and discounts of multiple cloud vendors and cloud platforms. According to the business form, comprehensively recommend cloud services and cloud resources with the best cost performance.

  • Reasonable optimization, avoid waste

Identify unused or underutilized resources, and give reasonable optimization strategies to reduce unnecessary cost waste. Provide the comparison effect before and after the estimated optimization, including cost reduction and utilization, and optimize according to the optimization strategy.

3. Resource process and responsibilities

How should the team work? As mentioned above, FinOps is a production relationship, which involves many teams, such as business research and development, business procurement, operation and maintenance management, cloud product management, and access to external clouds, including business architecture, finance, internal audit, legal and even old boss. With so many teams, how should the work be carried out?

>> Resource application and delivery process

  1. Raise requirements: If a researcher wants to use resources, how should he obtain/apply?
  2. Technology assessment: how and by whom? If I want to use the XX product of XX cloud, I will ask him a few questions, why should I use it, can I use other cloud products, and can I use existing products to meet my needs?
  3. Business follow-up/evaluation: follow up with manufacturers, negotiate prices, discounts, vouchers, test quotas, etc.;
  4. Execute delivery: deliver via cloud console or unified platform, configure product parameters, business parameters, and deliver on demand;

 >> Challenges for resource managers/architects:

  • There needs to be an evaluation process, hundreds of research and development, all come to ask for various products, for example, I use XXGPU products of XX cloud.
  • How to evaluate, can there be multiple suppliers, can it be negotiated, and is there any other plan.
  • You need to know more than research and development, and communicate more deeply with manufacturers.
  • R&D generally does not know the price,
  • This product of this cloud is the same product of other cloud, functional features, architecture differences, price/this difference, and the degree of fit with the existing business architecture.

>> Build resource management and cost insight capabilities

1. Resource cost collection and cost labeling

  • Cost tracking: report cloud and IDC resource usage, go to CMDB module/cost tag, use resource group, tag, business, department, project and other dimensions to classify resources;
  • Cost collection: After the cloud resource is activated, it is linked to the CMDB module or the cost tag;
  • Cost apportionment: For those that cannot be mounted, for example, the apportionment method is used for sharing SLB bandwidth, and the apportionment needs to determine the ratio; for medium-sized businesses, the management method is sufficient according to PCU or DAU.

2. Establish a multi-dimensional resource usage perspective

  • Business level dimension: cost insights for each business
  • Allocation relationship dimension: Allocation rules and results of a product or public resource
  • Resource level dimension: Observation and insight by product and resource
  • Bill attribute dimension: multi-dimensional statistics measured by bill

4. Save cost by accumulating GPU servers

Let's take an example to talk about GPU server saving and cost reduction. Its background is like this, there are many inference scenarios, hundreds of cards, and T4 cards are used on the cloud. The cost is not low, and the quantity demanded is growing. Hope to find a cost reduction solution for GPU. After researching the problem, T4 is an old card in 2018, and a new card has also come out. Most of the industry still uses T4, and the benefits of the new card are not clear. First of all, I don't know what card to use? The old and new manufacturers are not well connected, so it is necessary to investigate the feasibility and whether it can save costs. During the process, it is necessary to investigate (server factory, card factory, industry, business research) card selection, model design, technical preparation, procurement, etc.; investigate internal operability and business applicability, etc. The result of this is a successful replacement of most, able to save 70%, on a low discount basis.

>> Positioning of the GPU card

If you don't have a GPU card, you may not know it very well, and you need to investigate to find out.

The current new cards of NVIDIA Amper architecture are mainly A100 (80G), A30, A16, A2, A40,

A10: It can replace the T4 card, the supply situation is uncertain, and the order has been stopped

A30: A GPU card with intermediate performance between A100 and V100 for light training and inference scenarios

A40: Mainly replace the previous RTX6000 and RTX8000, for graphics rendering scenarios

A2: For edge scenarios, for inference and rendering, the physical specifications are the same as T4 cards, 70% of the performance of T4 cards, the price is much lower, and the cost performance is better than T4 cards

A16: Mainly replaces M10, for virtualization scenarios

A100 (80G): It mainly replaces the previous V100 and A100 (40G), targeting high-performance computing scenarios

Single card comparison, in most scenarios of inference:

Performance: The performance of the A30 is about 1.3 to 1.7 times that of the A10, and the power consumption of the A30 is similar to that of the A10

Price: about 1.4 times,

Supply: The main push is A30, and the supply of A10 is hard to say.

>> Pick Card: Official Upgrade

 T4 can be upgraded to A10, A16, or A30. A16 is a graphics video card, which is a non-inferential scene. A10 Policies are uncertain, and the changes are relatively large. The point is that there are relatively few A30s on the market.

 The above picture shows the parameters of the card, you can pay attention to the power consumption and computing power of the card. The new-generation card understands the video decoder, but it does not have an encoder, which is a feature that has a lot to do with audio and video. The power consumption has more than doubled, which is actually something to consider in the server cost. A comparison of computing power also needs to be paid attention to.

 After understanding the parameters of the card, you can do some common business tests, that is, use some common models for reasoning. Business testing in general scenarios will focus on many indicators, such as memory selection, GPU, and speed and business performance in various programs to perform some revenue calculations.

It is necessary to consider comprehensively and not limit the use of FinOps, because it is not just cloud optimization. It should open up ideas and horizons, and consider how to use efficient methods from the perspective of the entire business resources, or from the perspective of the entire supply chain, or even from the perspective of the entire economy. To support business development.

5. CPU computing power cost reduction ideas and small cases

In fact, the optimization of the CPU, or the optimization of computing power, has been talked about a lot in the industry, and the most talked about here are actually CPU generation difference, billing method, HTTPS bandwidth, and storage back-to-source bandwidth.

>> General method for computing power optimization

The general method is to reduce capacity and reduce configuration: reduce capacity to reduce quantity, reduce specifications to reduce waste, squeeze utilization rate, upgrade new-generation models, partly go to the cloud, use edge computer rooms, and use remote computing power.

Advanced features: elastic expansion and contraction, tidal scheduling, passive elasticity, predictive elasticity, one-click expansion and contraction, mixed department upgrade and reuse, that is, offline mixed department/staggered peak mixed department, container granularity, cross-service mixed deployment

>> Is the specification you choose reasonable?

 Why ask this question? Because many companies may use cloud hosts or resources on the cloud as physical machines. It is possible that the applicant who applied two years ago has been using it without any problems. You can think about it, maybe two years ago, its bomb count was already very old in terms of its cpu specifications. This specification has not been upgraded to the latest? For example, I found that the historical prices of two hosts on the cloud are cheaper than those of the latest generation, accounting for 88% of the total, and their performance is 1.7 times higher. Think about the benefits.

So when applying for a host, don't just focus on the number of cores or frequency. The algebraic upgrade of its cpu will also bring relatively large performance benefits.

>> Computing power cost optimization: choose cloud machine algebra

  • CPU Model Tips
  • Intel Xeon(Ice Lake) Platinum 8369B
  • Intel ® Xeon ® Platinum 8269CY(Cascade Lake )
  • Intel ® Xeon ® Platinum 8163(Skylake)
  • Intel ® Xeon ® E5-2682 v4(Broadwell)

>> Computing power cost optimization: choose the billing method

Choose different billing methods, its cost difference is still very big.

 Here are a few points. You can see that hosts of the same model and specification are all yearly and monthly subscriptions, but the cost difference is still relatively large. If the scheduling ability can keep up, you can actually make good use of this billing method. cut costs.

For example, in a certain business, one department uses a batch of resources during the day, and another department uses another batch of resources at night. Different time periods have different requirements for resources, and nearly half of the time for yearly and monthly subscription resources is wasted. There are two ways to solve that:

  • Method 1: Save plan with pay-as-you-go, and the total cost will be reduced by 42% after switching.
  • Method 2: Realize mixed distribution through technology and use the same batch of resources.

6. Finops cost control method

Let's talk about the general method of cost control.

>> General method for cost optimization

In fact, the simplest optimization idea of ​​cost optimization is that the cost is equal to the quantity multiplied by the unit price. There are two ways to optimize costs, one is to reduce the amount used, and the other is to reduce the unit price.

There are many ways to reduce usage and optimize operations, including quantity reduction, utilization rate, price optimization, and resource price reduction. The second aspect is to use lower resource costs, either let the business negotiate a lower price, or find lower price resources to reduce the unit price. The further way is structural change, which is to use different resources to meet common business needs.

Reduce usage and improve resource efficiency

  1. Efficiency Utilization: Decrease/downgrade to improve utilization, reduce resource redundancy, and make full use of computing power, network bandwidth, storage space, etc.
  2. Time utilization: resources are mixed, making full use of resources in the time dimension
  3. Elasticity: Only apply for resources when the business capacity requires it
  4. Utilize new technologies: such as p2p, AI image quality;
  5. Improve the architecture to reduce usage, such as traffic going through the intranet, increasing cache to reduce bandwidth, and reducing calculations;

Reduce the unit price and use lower-cost resources

  • For the same resources, use cheaper ones, such as similar products in different clouds, products with different specifications in the same cloud, or self-built IDC servers; such as low-frequency storage of OSS;
  • For the same resource, use a lower-cost metering and billing method, such as 95 billing/traffic billing, monthly subscription/as-you-go/preemption/RI/SP/resource packs, etc.
  • Negotiate Low Prices: Utilize Migratable Architecture, Take advantage of Multi-Cloud, and Maintain Bargaining Power
  • Use different (more cost-effective) resources
  • Upgrade CPU, upgrade GPU, network card, disk, etc. Case: Upgrading GPU models, one for multiple
  • The server upgrades various components, one at most.

>> Resource cost optimization ideas

How to optimize and reduce costs, here are some suggestions for your reference.

  • The first point is that from the perspective of business, you can optimize the entire link, such as business from CDN to LB to computing power, storage, middleware, and database. It depends on which product is unreasonable and can be optimized.
  • The second point is from a vertical professional point of view. Do global optimization from technical components, such as special optimization of all computing power, storage, broadband, etc.
  • The third point is multi-cloud bargaining power, which improves the bargaining power of different cloud vendors.
  • The fourth point is to make good use of resource optimization, such as superior products, which are difficult to do by yourself, such as BGP LB, large-scale CDN, AI, etc.

>> Thinking about Finops construction

  • Organizational construction: build a Finops team, which can be virtual, to enhance the professionalism of resources, costs, and business architecture
  • Process specification: build business/application as the main body, including resource, cost and other management and control processes, and reasonably evaluate reasonable use, budget and quota control
  • Cost insight: three-dimensional, in-depth research on the cost structure of various resources and various businesses by business, resource, cost, usage, and utilization rate
  • Cost optimization: optimize the use efficiency of various resources and combine business with the whole link, conduct in-depth research and sort out reproducible optimization solutions

One, Q&A

Q1: How to reduce cloud costs?

Zhang Guanshi: The simplest optimization idea of ​​cost optimization is that the cost is equal to the quantity multiplied by the unit price, so reduce the quantity or lower the unit price. How to do it specifically depends on the specific situation of the enterprise. The simpler way is to reduce the number of resources or reduce its specifications as mentioned above, which is easier to increase the utilization rate of the entire resource. This is the simplest way.

The second point is relatively simple, which is to use low-cost, different billing methods, different resources, and resources of different levels from the perspective of business. It is also possible to do it through scheduling.

Q2: How is 5% calculated? How is the cost calculated in the case? Such as the server bandwidth cost of a single user.

Zhang Guanshi: As far as the case data is concerned, it was analyzed by a brokerage firm from its financial report, which is not our internal data. For our resources, this data should have such data. First of all, your revenue can be obtained from the public or internally. The cost of IT expenditure should be known. This data should be relatively easy to obtain. 5 %. How to calculate for a single user? There are many algorithms to calculate, mainly depends on which one is more suitable for the company's business. Some are calculated from the pc, that is, how much does a single pc cost? Or calculate according to the way of MAU miu, how much is the cost per person or the way of single order? It is to combine cost and business volume.

Q3: Resource utilization is sometimes not easy to count, especially when the business develops to a certain extent, and different business departments digest cloud consulting in different ways. If the boss wants to see detailed statistical results, what is the better way to deal with it?

Zhang Guanshi: Statistical utilization rate is actually a relatively important task, because it is a basis for cost insight and has a greater relationship with our observability. The utilization of a single server, I believe, is summarized through monitoring data in every company. If our company uses the whole platform or the dimension to be graded to count its utilization rate, this requires an algorithm, but this algorithm can be discussed in many ways, as long as the company can reach an agreement , I think it is a way. There is also a simpler way, that is, you divide the utilization into gears, which ones belong to high-speed servers, and which ones belong to empty servers, and you list them in quantitative proportions. This is a simple management method.

Q4: How to charge for private cloud? How to amortize private cloud costs? Is there a good implementation plan for private cloud expense?

Zhang Guanshi: I understand that it is a matter of apportionment of private cloud costs. If you want to do a private cloud, or if you want to do FinOps, it can be shared, which is also a relatively important task. The foundation made a point, that is, all your costs should be shared among various businesses. So how to apportion it? There are several ways, not only private cloud, but also public cloud. In this way, we need to consider how to share and how to link to various business teams. One way is to tag. Just like the public cloud, I have to hang a business tag on each resource, and hang it under this tag to aggregate it by tag, so as to collect our costs. Another way is to mount through cmdb. Because cmdb can be said to correspond to our business, you attach all resources to cmdb, and the way of attaching and collecting can be unified.

Guess you like

Origin blog.csdn.net/EasyOps_DevOps/article/details/131602192