How to improve efficiency and save costs in the data center?

The previous section discussed how to ensure the data quality of the data platform and make the data "accurate". In addition to "fast" and "accurate", data center is also inseparable from "saving". As the scale of data becomes larger and larger, the cost is higher and higher. If the cost is not reasonably controlled, the profit of the enterprise will be consumed before you can dig out the application value of the data.

Whether it can achieve refined cost management is related to the success or failure of the data center project.

Growth trend of data construction resources for an e-commerce business (CU= 1vcpu + 4G memory):

The growth trend of big data resource consumption of an e-commerce platform, the annual resource scale in 2019 is 25000CU, and the annual machine budget is 3500W. Obviously not a small expense for start-ups.

One day, Li Haoliang, the head of the data team, was called to the office by the CEO:

  • What business is this 3500W spent on?
  • What cost optimization measures have you taken and how effective are they?

Confused by Li Wen, he thought to himself: The cost of the team is calculated based on the machine, not the data application. In the data center, the underlying data between data applications is multiplexed, so how much money is spent on each data product or report, how can I know if I don’t have such data.

But these are very important to the CEO, because resources are limited, he must ensure that resources are used at key nodes of strategic goals. For example, the core KPI of the e-commerce team this year is to increase the consumption of a single registered member on the platform. From the perspective of the boss, he must ensure that resources are invested in KPI-related businesses, such as targeted marketing of registered members based on data, and increase the consumption of members on the platform.

Has something similar happened to your team? The data department is the cost center of the enterprise. To demonstrate its value:

  • Support the business well and gain business recognition
  • Streamline costs and save money for the company

Therefore, today the focus is on saving money and talking about the refined cost management of the data platform.

1 Cost trap

When you first build a data center, you tend to focus on the access of new services, data integration, and data value mining, ignoring cost control issues, thus falling into a trap, resulting in explosive cost growth. Therefore, it is necessary to have a deep understanding of the pitfalls and try to avoid them in daily development.

Here are 8 traps:

  • 1~3 widely exist, but easy to be ignored
  • 4~8 involve some skills in data development, just pay attention when developing

"Knowing what it is, but more importantly knowing why it is", can discover the essence of the problem and deeply grasp the method to solve the problem.

1.1 It is easy to get data online but hard to get offline

For a data center project, table-related usage statistics. Half of the tables are not accessed within 30d, and these tables account for 26% of storage. If the output tasks of these tables are taken out separately, 5000 Core CPU computing resources will be consumed during the peak period, and 125 servers will be needed if converted into servers (calculated based on the CPU 40 Cores that can be allocated to one server), and the cost will be nearly 500W per year. I actually have so much useless data? I often compare data to pictures in a mobile phone. We keep taking photos and generating pictures, but we are too lazy to clear them. In the end, the storage of mobile phones is often insufficient.

The data cannot be cleared in time, and data development also has difficulties. They don't know a table:

  • What other tasks are referencing
  • Who else is inquiring

Naturally, I dare not stop the data processing of this table, making it easy to get data online but difficult to get it offline.

1.2 Low-value data applications consume a lot of resources

Data seems to be accessed every day, but how much value does it produce and is the ROI worth it?

There is a wide table (a table with many columns, which often appears in the summary layer data in the downstream of the data center), plus the task of upstream processing links, processing this wide table will cost 6,000 yuan per day, 200W a year, which can be traced Later, we found out that this wide table is actually used by only one person every day, and it is still an operation intern. Clearly, there is a huge mismatch between inputs and outputs.

Indirectly, the data department pays more attention to the value brought to the business by new data products, but ignores whether the existing products or reports still have value, which eventually leads to low-value applications still consuming a lot of resources.

1.3 Chimney development model

Not only is the research and development efficiency low, but resources are wasted due to repeated processing of data. A 500T table, processing this table, computing tasks need to consume 300Core during the peak period, which is equivalent to 7 servers (calculated based on the CPU 40Core that can be allocated to a server), plus the storage disk cost (calculated based on 0.7 yuan/TB*day), a The annual consumption is 40W.

And every time this table is reused, it can save 40W. Therefore, model reuse can also save money.

Fourth, data skew.

Data skew will degrade task performance and waste a lot of resources. So what is data skew?

Single-Stage Spark task data sharding operation diagram

You must have heard of the barrel effect, right? How much water a barrel holds depends mainly on the shortest board. This effect also exists for a distributed parallel computing framework. For the Spark computing engine, it can divide massive data into different partitions (Partition), assign them to tasks running on different machines, and perform parallel computing, thereby achieving horizontal expansion of computing power.

But the running time of the entire task actually depends on the longest running task. Because the amount of data in each shard may be different, the resources required for each task are also different. Since different tasks cannot be allocated different resources, the total resource consumption of tasks = max {resources consumed by a single task} * number of tasks. In this way, tasks with a small amount of data will consume more resources, which will cause a waste of resources.

Let's take an example of an e-commerce scenario.

Suppose you need to count the transaction amount of each merchant according to the granularity of merchants. At this time, we need to perform group by calculation on the order flow table by merchant. On the platform, the order transaction volume of each merchant varies greatly. Some orders have a large transaction volume, while others have relatively few orders.

img

We use Spark SQL to complete the calculation process.

Schematic diagram of data skew

In the figure above, task A reads the data of a shard on the left, aggregates it according to the supplier, and then outputs it to tasks B, C, and D of the next stage.

You can see that after aggregation, the amount of data input by tasks B, C, and D is very different. B processes more data than C and D, and naturally consumes more memory. Suppose a single Executor needs to allocate 16G, while B, C, and D cannot set different memory sizes, so C and D are also set to 16G. But in fact, according to the data volume of C and D, only 4G is enough. This results in a waste of resource allocation for C and D tasks.

Fifth, the data does not set a life cycle.

In Lecture 06 , I emphasized that the general original data and detailed data will retain complete historical data. At the aggregation layer, marketplace layer, or application layer, considering storage costs, it is recommended to manage data according to the life cycle, and usually keep snapshots or partitions for several days. If there is a large table without setting a life cycle, storage resources will be wasted.

Sixth, the scheduling cycle is unreasonable.

img

From this picture, you can see that the resource consumption of big data tasks has obvious peak and trough effects. Generally, the peak period is from 12:00 pm to 9:00 the next day, and the trough period is from 9:00 pm to 12:00 pm.

Although tasks have obvious peak and trough effects, server resources are not elastic, so there will be situations where the server is relatively idle during the trough period and busy during the peak period. The resource allocation of the entire cluster depends on the task consumption during the peak period. Therefore, migrating some unnecessary tasks to run during the peak period to run during the off-peak period can also save resource consumption.

Seventh, task parameter configuration.

Unreasonable configuration of task parameters often wastes resources. For example, in Spark, the Executor memory setting is too large; the CPU setting is too much; and Spark does not enable the dynamic resource allocation strategy, some Executors that have already run Tasks cannot be released, and continue to occupy resources, especially in the case of data skew. , resource waste will be more obvious.

Eighth, the data is not compressed.

In order to achieve high availability, HDFS of Hadoop stores 3 copies of data by default, so the physical storage consumption of big data is relatively large. Especially for some large tables in the original data layer and detailed data layer, which can easily exceed 500 T, equivalent to physical storage requires 1.5P (three copies, so the actual physical storage is 500 3), about 16 physical servers are required (one server can allocate Storage is calculated based on 12 8T), if compression is not enabled, the cost of storage resources will be high.

In addition, during the calculation process of Hive or Spark, the intermediate results also need to be compressed, which can reduce the network transmission volume and improve the performance of Shuffer (the process of data transmission between different nodes during the calculation process of Hive or Spark).

You see, I have listed 8 typical cost traps for you, then you may ask, teacher, I have already been recruited, what should I do? Don't worry, let's take a look at how to carry out refined cost management.

2 How to achieve refined cost management?

Cost governance should follow the four steps of overall inventory, problem discovery, governance optimization and effect evaluation.

2.1 Global asset inventory

Conduct a comprehensive inventory of all data in the data center, and establish a full-link data asset view based on the data lineage provided by the metadata center.

img

Full link data asset view:

  • The downstream end is connected to the data application (report: financial analysis)
  • The upstream starting point is the raw data that has just entered the data center
  • Data is connected through tasks

Calculate the cost and value of the terminal data in the full-link data asset view (the terminal data is the most downstream table of the processing link, such as Table A and Table G in the figure).

Why do you have to start from the end? Because when calculating the value of the intermediate data, the use of the downstream table must also be considered, which is difficult to calculate clearly, so start from the end data. This is also consistent with the order of the offline table. For example, if the value of the data is low and the cost is high, the offline data will start from the end data.

How should data costs be calculated?

Calculate the cost of the financial analysis report in the above figure, the upstream link of this report involves a, b, c, 3 tasks, A, B, C, D, E, F, 6 tables:

The cost of this report = the cost of computing resources consumed by the processing of 3 tasks + the cost of storage resources consumed by 6 tables.

If a table is reused by multiple downstream applications, the storage resource cost of the table and the cost of output tasks need to be apportioned among multiple applications.

How should the value be calculated?

If the terminal data is an application layer table, which is connected to a data report, then the value of this data mainly depends on the scope and frequency of use of the report.

When calculating the scope of use, the weekly activity assessment is usually used, and the weight of people at different management levels must also be considered. For the boss, his weight alone can be equivalent to 1,000 ordinary employees. Therefore, this design takes into account that the higher the management level, the greater the impact of making business decisions, and the greater the natural value. The frequency of use is generally measured by the number of times a single user views the report every week. The higher the number of times, the greater the value of the report.

For example, the terminal data docking is not a data report, but a data application for specific scenarios (such as the supply chain analysis and decision-making system I mentioned before, which is mainly aimed at the supply chain department). To measure the value of such products, the coverage rate of the target population and the direct business value output are mainly considered. What is a direct business value output? , in the supply chain decision-making system, is the proportion of purchase orders automatically generated by the system to all purchase orders.

The terminal data may still be a market-level table, which is mainly used to provide analysts with exploratory queries. The value of this type of table depends on which analysts use it and how often it is used. When using scope evaluation, analysts are also weighted by level.

2.2 Problems found

The global inventory provides data support for finding problems, focusing on:

  • End-of-life data that continues to generate costs but is no longer used (generally refers to no visits within 30 days)

    The table that is not used but has been consuming costs corresponds to the trap 1 I mentioned

  • The data application value is very low, but the cost is high. These data apply all relevant data on the upstream link

    Low-value output and high-cost data applications correspond to trap 2

  • High data consumption during peak periods

    High-cost data, corresponding to pitfalls 4-8

Pitfall 3 is actually addressed in Section 6 Model Design.

2.3 Governance optimization

Develop appropriate strategies for these three types of problems.

The first category is to deal with the offline of the table. Be cautious when going offline, refer to the execution process diagram of data going offline:

After the terminal data is deleted, the upstream data of the original terminal data will become the new terminal data, and it is also necessary to repeat from the discovery of the problem to the governance optimization until all the terminal data do not meet the offline strategy.

For the second type of problem, we need to evaluate whether the application is still necessary according to the application granularity. For reports, you can follow the policy of automatically offline applications that have not been accessed within 30 days, first destroy the report, and then offline the upstream table of the report. If the table is still referenced by other applications, it cannot be offline. For the offline steps, please refer to the previous offline steps.

The third type of problem is mainly aimed at high-consumption data, and is specifically divided into high-consumption tasks of output data and high consumption of data storage. For high consumption of output tasks, the first thing to consider is whether it is data skew. How to judge? In fact, you can judge by the amount of Shuffer data in MR or Spark logs. If there is a certain Task with a very large amount of data and few others, it can be determined that there is a data skew.

图 Spark task shuffer records:

图 MR reduce task records:

Data skew processing?

There are some applicable solutions for different scenarios:

  • For example, when some large tables are associated with small tables, the uneven distribution of Keys causes data skew, and mapjoin can be used
  • A more general processing method, such as processing hotspot keys separately, then processing the remaining keys, and then combining the results

recommended reading

In addition to data skew, you should also check the configuration parameters of the task. Such as Spark execution engine:

  • Is the number of Executors too large?
  • Whether there are too many executor-cores and executor-memory, and the utilization rate is low

Generally, executors-memorty is set to 4G-8G, and executor-cores is set to 2-4 (the configuration item with the highest utilization rate has been practiced).

It is also necessary to consider whether the task is really necessary to be executed during the peak period. According to the cluster load situation, the task can be migrated to the off-peak period as much as possible to "cut the peak and fill the valley".

The above points are the high consumption of output tasks.

For tasks with relatively large storage consumption, first consider whether to compress, especially for the original data layer and detailed data layer, it is recommended to compress

compression method

  • Compression of small files, regardless of split, gzip is more suitable
  • For large files, lzo is recommended, and split is supported. Under the premise of ensuring compression efficiency, there is a relatively stable compression ratio

Also consider if the lifetime is set:

  • ODS raw data layer and DWD detailed data layer, suitable for permanent retention strategy
  • For some product and user dimension tables, 3-5 year retention strategies may be considered

Overall, the underlying tables are long-term retained. The focus should be on the tables above the summary layer (including the summary layer). Generally, a 7-day and 1-month retention strategy can be formulated according to the importance of the data.

Governance Effect Evaluation

Quantify Governance Results - How Much Money Is Saved

If you directly measure the number of servers, it cannot truly reflect the governance effect, because the reasons for business growth must also be considered, and you can consider the cost of tasks and data:

  • How many tasks and data are offline
  • How many resources are these tasks consuming per day
  • How much storage space the data takes up

Take these resources to calculate the cost, and you can calculate how much money you saved. For example, in the beginning case, task A runs for 3 hours, and consumes a total of 5,384,503 cpu*s and 3,700,7892 GB *s during the running process. Assume that our 1 CU (1 cpu, 4g memeory) costs 1,300 yuan a year, equivalent to 3.5 yuan per day (The calculation formula is 1300/365).

Regardless of whether it is an optimization or offline task, only the peak hours are counted, because optimizing the off-peak hours cannot actually save resources.

The peak time period is 8 hours, which is equivalent to 0.00012153 per second, then the cost of the job is max{5384503*0.00012153, 37007892/4 * 0.00012153} = max{654, 1124} = 1124. Save 1124 yuan after the offline task, and multiply the storage space occupied by table A by the cost per GB, you can get the cost saved when data table A is offline.

Cost Management Center

Cost governance is not a once-and-for-all job. It needs to be persistent, constantly discover problems, and then optimize governance. The premise of establishing a long-term operation mechanism is to lower the threshold of cost governance. Next, let’s take a look at EasyCost, a cost governance platform of NetEase.

The system provides the function of data diagnosis, which can set the offline strategy according to the access time, access frequency, and associated applications, and supports one-click grayscale offline, which greatly improves the efficiency of management.

It can be deposited into the product in a systematic way, and then the management efficiency can be improved through the product, so as to realize the long-term implementation of the governance mechanism.

Summarize

Through the data center:

  • You can get the dividends brought by big data as an asset center
  • It may also fall into the abyss of cost and pay for the savage growth of big data costs

Start with common cost traps, analyze the possible causes of cost waste, then introduce the method of refined cost management, and finally emphasize:

  • The offline of useless data should start from the end of the full link data asset view, and then unravel, layer by layer, and advance to the upstream of the data processing link.
  • The value of the application layer table should be measured by the value of the data application, and the application with low value output should be offline at the granularity of the application.
  • To optimize high-consumption tasks, you only need to pay attention to the tasks in the peak period of the cluster. The overall resource consumption of the project depends only on the task consumption in the peak period. Of course, if you use public cloud resources, you can implement differentiated cost settlement between peaks and valleys , the low period should also be paid attention to.

FAQ

In the bazaar layer of the data center, there are some large and wide tables with hundreds of fields, and there may be dozens of tables in the upstream. The cost of calculating this table will be very high. In this table, the field access frequency is different, how to optimize this wide table?

  1. Vertical splitting: divide the wide table according to the access frequency of the fields, divide the fields with high access frequency into a table separately, and divide the fields with low access frequency into a table separately. This can reduce the number of fields scanned during query and improve query efficiency

  2. Horizontal splitting: split the wide table by row, and control the number of fields in each split table within an acceptable range, which can reduce the number of fields in a single table and improve query efficiency

  3. Indexing: For fields with high access frequency in wide tables, indexes can be built, which can speed up the query

  4. Cache mechanism: For data with high query frequency, a cache mechanism can be used to cache the data in memory, which can reduce query time

  5. Data compression: For cold data in wide tables, data compression technology can be used to reduce storage space and improve query efficiency

You can choose an appropriate optimization method according to the actual situation to improve query efficiency.

This article is published by OpenWrite, a multi-post platform for blogging !

Guess you like

Origin blog.csdn.net/qq_33589510/article/details/131969158