Sohu Smart Media's road to cost reduction and efficiency increase based on Tencent Cloud Big Data EMR

In 2022, Sohu Smart Media has completed the elastic computing project of migrating to Tencent Cloud, in which the entire big data business has been migrated to Tencent Cloud. After migrating to the cloud, the overall service performance, cost control, and operation and maintenance efficiency have achieved good results. The expected cost reduction and efficiency increase goals have been achieved.

This article mainly introduces the big data business of Sohu Smart Media, the work and experience related to the migration of basic systems, historical data, business systems, etc. in the process of migrating Tencent Cloud Big Data EMR, as well as the key technical transformation in the process. 

Author of this article:

Zhai Dongbo Senior Development Engineer of Sohu Smart Media R&D Center

Qi Laijun, Senior Development Engineer, Sohu Smart Media R&D Center

Big Data Business Overview

1.1 Big Data Business Classification 

a73ffefb2162c47bc697adf4dc464321.png

Figure 1 - Big Data Business Classification Diagram

Based on the classification of smart media big data business, it is mainly carried out from two orthogonal dimensions, one is the dimension of data operation; the other is the dimension of data timeliness.

1. According to the data operation dimension, it is mainly divided into data production and analysis applications.

Data production can be understood as generalized ETL production, which mainly includes a series of data operations such as data cleaning, format conversion, and data association. Its main features are firstly based on large-scale calculations, large data volume input and output, and long running time. Second, data processing should have high fault tolerance, such as MapReduce, Spark and other computing engines, which can perform fault tolerance and Retry for a single Task failure. Such operations do not affect the execution of the entire Job or Application.

Analytical applications, based on the data produced by data, perform upper-level operations, mainly multi-dimensional analysis, such as scrolling, drilling down, slicing, etc. The main feature is that the query result set is small, that is, there is a large amount of data input and a small amount of data output, but for The timeliness of the query response is very high, generally at the second or even sub-second level.

2. According to the data timeliness dimension, it is mainly divided into offline data and real-time data.

Offline data requires timeliness at the day or hour level, and offline data will be managed and operated using hierarchical modeling.

For real-time data, the timeliness is generally required to be at the minute/second level. For some IoT scenarios, the data timeliness delay is even required to be controlled at the sub-second level. Since real-time data analysis accounts for a relatively small proportion of overall data analysis, chimney development methods are generally used for management and operation, and different real-time data tasks are established for different data production and analysis requirements; currently in practice, there are also Carry out a layered model architecture similar to offline data, perform layered modeling on real-time data, write the data of the underlying model to Kafka in real time, and provide it to the high-level model; with the development of open source databases such as StarRocks that support real-time writing of large throughput data Appeared, in the real-time data application practice, began to use the "post-calculation" mode, that is, the original data is written into the database in real time as an ODS table through ETL, and based on these ODS table data, complex class Ads such as Join and aggregation are written. Hoc SQL query statements to meet different analysis and query requirements.

According to the division of the above two dimensions, the data business can be divided into four scenarios:

1) Offline analysis, mainly the traditional data warehouse business type, generally T+1 analysis scenario, the main technologies used in engineering practice include Impala, Presto, StarRocks, etc.;

2) Real-time analysis, aimed at real-time data warehouse scenarios, the main technology used is StarRocks, etc.;

3) Offline ETL is aimed at the Batch Processing scenario, and the main technologies used include HIVE/Spark/MapReduce, etc.

4) Real-time ETL is aimed at Stream Processing scenarios. The main technologies used include StarRocks, Flink, Spark Streaming, etc. Since Routine Load of StarRocks supports data cleaning, format conversion and other effects, in some real-time business scenarios of intelligent media, direct Use Routine Load for real-time data ETL operations.

1.2 Big Data Business Architecture

09acbeb4e8f12761e7ca908633a2e34c.png

Figure 2 - Big Data Business Architecture Diagram

According to the above classification of big data business, the figure above shows the overall structure of big data business of smart media, which is mainly divided into three layers: data source layer, data computing and storage, and data application.

1. The data source layer mainly includes two categories: business data, which is the data directly operated by the business system, which is mainly stored in databases such as MySQL, Oracle, and MongoDB; log data, which is the data representing business system events, The data collected from the client through buried points, or the logs printed by the server.

2. Data computing and storage is the core level of the entire big data business architecture. The left side of the figure above is the offline data business part, which adopts a layered model and is generally divided into offline data warehouses such as STG/ODS/DWD/DWS/ADS; the right side is the real-time data business part, which adopts three data development methods, From left to right, the chimney development method, layered modeling and post calculation.

1) Chimney development builds real-time data tasks for different businesses, and complex processing such as Join will be performed inside the ETL task. Since the processing of Join operations in Stream Processing is very complicated, and there will be problems such as data accuracy, chimney development is also gradually is decreasing;

2) Hierarchical modeling, which uses offline business models to stratify data and reduce chimney development. It is mainly used in scenarios such as real-time data update and real-time recommendation models for online business;

3) Post-calculation, generating ODS table data in the source layer through real-time ETL, writing SQL statements directly based on ODS table data for business analysis requirements, and post-calculating operations such as data join, window, and aggregation to query time;

3. The data application layer is mainly connected to the internal BI system of the department, based on Impala, Presto, and StarRocks to meet the needs of fast data query in different scenarios.

1.3 Technical architecture of big data under the cloud

8e80479be316c393dc0febcb786eefb5.jpeg

Figure 3-Technical Architecture Diagram of Big Data under the Cloud 

According to the business architecture of big data, the technical architecture of big data under the cloud is shown in the figure above.

For offline data, log data synchronization mainly uses Flume, business data synchronization uses Sqoop and self-developed MongoDB, Elasticsearch, Redis and other Sqoop plug-ins; data storage uses HDFS, and batch processing uses HIVE and Spark based on YARN resource management, based on self-developed The offline data management platform adopts hierarchical modeling to build an offline data warehouse, and at the same time manages the DAG of ODS/DWD/DWS/ADS data processing tasks, and has functions such as complement, data quality, and metadata management; offline analysis mainly Based on MPP query engines such as Impala/Presto/StarRocks.

For real-time data, Flume is mainly used for log data synchronization, Canal and other CDC products are used for business data synchronization; Kafka and StarRocks are used for data storage, Flink and Spark Streaming are used for real-time ETL, and StarRocks is mainly used for real-time analysis. 

The above is an overall introduction to the technical architecture of big data under the cloud. The general summary of big data under the cloud is as follows:

1) Basic system: Cloud uses the Hadoop platform provided by Sohu, and has a dedicated team responsible for operation and maintenance; StarRocks is built and operated by the team itself.

2) Historical data: the current accumulated historical data volume is at the 10PB level.

3) Business system: including self-built Report, OLAP, Ad Hoc and other BI systems; and a self-developed offline data management platform, which manages offline data layered modeling tasks. The migration of the entire big data business to the cloud revolves around the above three parts. 

The big data architecture under the cloud has also encountered some pain points during the years of development, such as the long expansion period of the computer room, it is difficult to quickly replenish computing resources according to business needs; regardless of whether there are computing tasks, resources need to be reserved at any time; and multiple business The problem of cost allocation for departments sharing computing resources. These pain points have also been satisfactorily resolved in the process of migrating to the cloud.

The Road to Cloud Cost Reduction and Efficiency Increase

2.1 Technical architecture of big data on the cloud

9c6cf68ad50e0da1475760563ccec981.png

Figure 4-Technical Architecture Diagram of Big Data on the Cloud

In order to ensure the rapid migration of big data services to the cloud, big data components are migrated to Tencent Cloud EMR in the form of flat relocation. While EMR optimizes the open source components at the kernel level, it also ensures perfect compatibility with the open source components, avoiding problems due to Component versions lead to business incompatibility problems, and minimize the workload, difficulty, risk and other factors of cloud migration.

The main tasks of migrating big data to Tencent Cloud EMR are as follows: 

1. Basic system:

1) Hadoop under the cloud uses CDH 5.XX version, and EMR on the cloud we choose 2.6. In actual use, the functions of the components of the two versions of Hadoop are basically compatible;

2) Independent StarRocks cluster components can be selected on EMR, which can completely replace StarRocks under the cloud;

3) Flink We use Tencent's Oceanus, which provides a more powerful task management capability and a more stable operating environment on the basis of providing a fast Flink SQL development method. Because Oceanus is fully containerized, it can achieve more refined resource management than traditional YARN-based scheduling. In the ETL link scenario, even 0.25 CPU can be used to run an operator, which greatly saves computing costs;

2. Historical data:

Considering the cost issue, and as a cloud-native big data platform, EMR naturally supports the storage-computing separation architecture, and can directly use object storage as the file system for data storage. Components such as Hive, Spark, Impala, and Presto can directly operate COS/ data on OFS; so we decided to migrate the historical data in HDFS to OFS, the metadata acceleration bucket of object storage, which solves the problem of high cost of massive historical data without changing the data operation method at all, and allows subsequent retention It is possible to analyze more historical data or machine learning training.

3. Business system:

The migration of the BI system is relatively simple. After the data and basic systems have been migrated, it is enough to configure the database link information to the new Impala, Presto, StarRocks and other systems; for the offline data management platform, the workload of migrating to the cloud is relatively large, and the accumulated data For thousands of offline data tasks, the task DAG must be run successfully on the cloud platform. 

2.2 Migration main work

2.2.1 Basic system migration

The migration of the basic system mainly includes the following aspects:

1. Cluster planning and construction:

According to the big data business scenarios and data processing flow, two sets of EMR clusters are mainly planned: one is used for offline data processing, and the other is used for Spark Streaming tasks of real-time data. The reason why two sets of clusters are built is mainly because the resource usage of offline data processing has obvious peaks and valleys characteristics, and the resource elastic scaling function of EMR can be used; while Spark Streaming tasks are all Long Running tasks, resource consumption will increase. very smooth. Since EMR has the ability to quickly build clusters at the minute level, resource isolation at the cluster level is more effective than queue division within the same cluster, and it is also more convenient for subsequent cost sharing.

For the StarRocks cluster, only two sets of coarse-grained clusters were built due to the limitation of machines and maintenance costs under the cloud. The construction of StarRocks clusters on the cloud is no longer limited by resources, and the cost of creation and maintenance is much lower. We have created multiple fine-grained clusters according to business divisions to reduce the use interference between businesses.

The task of Flink directly uses Oceanus, a flow computing platform provided by Tencent, and encapsulates SQL API, common data source data source Connector, etc. on Flink, and has made a lot of enhancements based on the community version kernel and CDC, which is better than that in the Hadoop cluster alone. Using Flink is much more convenient. At the same time, Oceanus can also control the use of task resources to the 0.25CU level. Compared with the open source Flink, each CPU can only allocate a single slot, which greatly increases the resource usage of stream computing tasks.

2. Optimization of EMR offline cluster configuration and deployment.

1) Dynamic elastic expansion and contraction policy configuration: At first, we used load scaling to perform elastic expansion. However, during the test load scaling process, we found that the computing tasks submitted by users often do not actively specify resource usage, resulting in resource utilization. Glitch on monitor. If the monitoring sensitivity of the load threshold is configured too high, it is easy to repeatedly trigger expansion. If the monitoring sensitivity of the load threshold is configured too low, the expansion response is likely to lag. After discussing with Tencent Cloud architects, we observed that most of the offline tasks are executed in the early hours of the morning, with an obvious time period, so we directly used the time-scaling expansion mode, which simply and quickly satisfied the business requirements for time-sharing scheduling resources. needs;

2) YARN scheduling: Since the Hadoop cluster under the cloud is a fixed large resource pool, and all users share the cost equally, the scheduling strategy uses a fair scheduling method. When migrating to the cloud, we expect to improve resource utilization as much as possible. Compared with IDC's resident queue with over 10,000 cores, on EMR, we can achieve that the usual resident queue is only a few thousand cores smaller. If you continue to use the fair scheduling strategy at this time, a large number of tasks can apply for resources from the RM at the same time, resulting in each task failing to obtain sufficient resources. Under the suggestion of Tencent Cloud architects, we changed the capacity scheduling method, and resources can be allocated to the tasks in the advanced queue in Running first to ensure that the tasks are completed in time;

3) HIVE configuration: According to the tuning experience of the Hive cluster under the cloud and the exploration in the use of EMR, many parameters have been adjusted, such as JVM heap memory, MR Task memory, log level, number of Session connections, etc.;

4) Impala/Presto: EMR supports the use of independent Task nodes for ad hoc query engine deployment to avoid resource contention caused by mixed deployment with the Node Manager. When encountering some large queries or high-concurrency query scenarios, it will not put pressure on the Master node, and it can even perform rapid expansion of the query engine for queries alone.

3. Cluster operation and maintenance:

Tencent Cloud's EMR is a semi-managed PaaS product, which is more flexible and customizable than Hadoop clusters under the cloud. Even if customers do not have rich operation and maintenance experience, they can easily participate in the operation and maintenance work with the help of the white-screen operation and maintenance tool provided by EMR, and perform flexible configuration according to business needs to obtain better performance and scalability.

For the Hadoop cluster under the cloud, the company has a dedicated team for operation and maintenance, and needs to share the labor costs of operation and maintenance with the business team every month. After migrating to the cloud, Tencent Cloud provided a professional operation and maintenance team to provide customers with comprehensive 7*24 hours of free technical support. Customers can obtain timely technical support and problem solving, and the operation and maintenance efficiency has been significantly improved to ensure business stable operation.

On this basis, we expect to further transform the operation and maintenance work in the direction of automation and intelligence. At present, Sohu and Tencent jointly build an alarm-driven operation and maintenance method, and configure alarm monitoring from multiple aspects, mainly including EMR hardware/software monitoring alarms, Tencent Cloud background inspection alarms, and Sohu business monitoring alarms. The three alarms form a union. Cover all possible failure scenarios of the EMR as much as possible. Through active operation and maintenance, faults are reduced before they occur, which significantly improves the ability to troubleshoot problems and the efficiency of operation and maintenance.

869f0b95e73d5163d919ebd27d69d086.png

Figure 5-Complete monitoring alarm composition diagram 

2.2.2 Historical data migration

The migration of historical data mainly includes the following aspects:

1. Data Warehouse

In order to save storage costs, we migrated the historical data of the warehouse data in HDFS under the cloud to object storage, and solved a series of problems in the process:

1) The Hadoop cluster under the cloud is shared by multiple business departments, so Kerberos authentication is enabled. On the cloud, the network, security group, and cluster-level isolation have been implemented, and the business side can only submit code through the scheduling system. In order to facilitate the management team, The cluster on the cloud does not enable Kerberos, and the data migration Distcp task is pulled from Hadoop under the cloud;

2) Since COS-Distcp needs to introduce object storage dependency packages into the Hadoop cluster, in order to avoid changes to the Hadoop production cluster under the cloud, the data migration is performed using the EMR cluster on the cloud. First, the data is migrated to HDFS on the cloud through Distcp. Then use COS-Distcp to synchronize the HDFS data on the cloud to the object storage. After the migration is completed, use the SkipTrash parameter to directly clear the HDFS transfer data on the cloud;

3) Limited by the bandwidth limitation, since there is a bandwidth limitation between the computer room under the cloud and the computer room on the cloud, we must always pay attention to the impact on the bandwidth when copying data, and at the same time introduce the Bandwidth and m parameters when executing Hadoop Distcp to control the migration task. Bandwidth and Map concurrency;

4) Data verification problem. Since the Hadoop Distcp command cannot verify the consistency of HDFS and object storage data, it is necessary to use the COS-Distcp tool provided by Tencent Cloud to verify after data migration;

5) For file time issues, use the -pt parameter to migrate the file time attributes on the HDFS under the cloud to the object storage, and then perform archiving operations based on the file time attributes. 

2. Hive metadata migration

Based on the metadata management module of the offline data management platform, obtain all databases and data tables under the cloud and obtain the table creation statement of the data table through Show Create Table XXX. The location of the data table is consistent with the path under the cloud, and through the offline data management on the cloud The platform implements batch creation of data tables.

3. Raw Log Migration

Migrate the Raw Log data stored in HDFS under the cloud to COS. Combined with the data usage scenarios of the business, the data that was basically not used a month ago is stored in the deep archive, and the Raw Log data a week ago is used infrequently. Storage Take advantage of COS' deep archiving and low-frequency capabilities to further reduce storage costs.

4. StarRocks Migration

Currently, there are three main ways to migrate data from StarRocks under the cloud to StarRocks on the cloud:

1) Export the data of StarRocks under the cloud to HDFS through EXPORT, and then import the data into StarRocks on the cloud through Broker Load. This method is suitable for large-scale data migration without special field types.

2) Build the External Table of StarRocks under the cloud in StarRocks on the cloud, and then import the data through the Insert Into XXX Select XXX method. This method is suitable for migration of data tables with HLL and Bitmap fields, but if it is a large table, the import speed is relatively slower

3) For data migration of the legacy Apache Doris system, due to the incompatibility between StarRocks and Apache Doris data formats, the Expro method cannot be used to redirect the data query results to the local through the MySQL Client, and then import the data to StarRocks on the cloud through Stream Load.

2.2.3 Business system migration

08eec94d20982bab28c237600d246707.png

Figure 6 - Schematic diagram of distributed deployment of business systems

Business system migration work

Mainly the migration of the offline data management platform,

1. The service process is deployed on the Router Node. Compared with the cloud, the machine node resources are more abundant, and the Router Node can be scaled as needed;

2. There are data tasks and table meta information in MySQL, which can be easily synchronized to the cloud by using tools such as DTS;

3. Migration of data tasks. With the support of the Tencent Cloud big data team, thousands of data tasks are run and tested through tools, mainly to verify the HIVE and Spark SQL statements in the data tasks. The SQL on the cloud and off the cloud are basically compatible. Only a few SQL statement compatibility problems were encountered among thousands of data tasks. During the test, it was found that EMR's HIVE CLI and Beeline would occupy a large amount of CPU at the beginning of execution, and related Jar replacements were carried out;

Finally, through testing, double-running, and stream switching, the entire data task DAG is gradually migrated to the cloud.

2.3 Transformation of key technologies

2.3.1 Storage and calculation separation

Storage and calculation separation technology transformation

I believe that when all enterprises switch from traditional off-cloud IDC to cloud services, how to choose data storage is the biggest problem encountered in the process of migrating to the cloud. Tencent Cloud provides three options:

1. Use EMR's native HDFS plus local disk solution, which can achieve higher local throughput and lower storage costs.

2. Use EMR's native HDFS plus cloud disk solution, which can obtain higher flexibility and can expand the disk capacity at any time as needed.

3. Using object storage COS, this solution can allow customers to achieve high availability of data while only saving a single copy, thereby saving data storage costs, but due to the change from local reading to network reading, some reading and writing will be sacrificed and throughput performance.

Before choosing a plan, let's first look at the prices of these resources:

f69d156a01ab2d60fdaeb6aaa0d399fa.png

Table 1 - Resource Price List

1) The existence of all HDFS will cause the cost of EMR to be too high.

According to the statistics of 1P data volume, use HDFS storage, use D3 as DataNode, and calculate according to 3 copies (disk failure rate is 3/1000 per year, the configuration of less than 3 copies may have the risk of data loss), at least 70 One node costs about 457,600/month; while using OFS standard storage costs about 123,700/month, you can also use the archiving function to further reduce the cost, and the cost difference between the two is more than 5 times. In addition, extensive use of HDFS will also make the cluster not easy to scale elastically. DataNode nodes can only be offline for less than 2 at a time. The node offline will also cause problems such as rebalance, which will affect normal business.

2) Use object storage (OFS) to achieve complete storage and calculation separation 

Because each bucket of object storage has a network bandwidth limit, which is tens of Gb/s, during the execution of a large number of concurrent tasks, it will affect the efficiency of data task execution. With DataNode, the bandwidth of each machine node is 10Gb/s, the cumulative bandwidth of dozens of machines is hundreds of Gb/s.

3) Storage and calculation separation of Hadoop and OFS

After fully discussing Sohu Smart Media's current data architecture and business logic with Tencent Cloud, we drew on the current industry-popular architecture and jointly designed a storage-computing separation architecture that combines Hadoop and object storage. In the data structure of Sohu Smart Media, the time effect of offline data is very obvious. The closer the life cycle of data, the higher the frequency of use. A large number of offline tasks at night mainly calculate the data of the day. Therefore, we divide the data into hot and cold based on a week. The cold data of a week ago is settled in the object storage (OFS) to reduce storage costs; the hot data of the latest week is stored in HDFS to ensure the efficiency of data tasks. Since Hadoop no longer stores a large amount of data, we can compress the amount of HDFS DataNode machine resources to the limit, and deploy computing resources YARN on D3 and SA3 nodes, among which SA3 performs elastic scaling according to time or computing requirements, which greatly reduces computing costs and guarantees effectiveness. The data on HDFS includes not only the data regularly generated by daily offline data tasks, but also the historical data generated by means of supplementary data, which may accumulate a large amount of data in a short period of time. Therefore, the migration of cold data to OFS must be done in a timely and efficient manner. Reliable, and cannot affect the cluster yet.

ac4342235dec5cf19cf3d48bc7aaf799.png

Figure 7 - Storage and calculation separation data migration high availability technology architecture diagram

Data Migration High Availability Technology Architecture with Storage and Computing Separation As shown in the figure above, the migration function is designed in the offline data management platform mentioned above and implemented based on the Quartz distributed task scheduling architecture. Run the Distributer Job in Quartz. Through Quartz's high-availability architecture, the Distributer Job can be guaranteed to run all the time. The Distributer Job will obtain the HIVE metadata in real time, and judge whether the Location corresponding to the Partition one week ago partitioned by date is in the OFS, if not In OFS, the table information is put into the task scheduling queue. In order to control the impact of data migration tasks on the system load, the system only sets up 10 Worker Jobs to run data migration tasks. When a Worker Job is executed, Distributor will create a new Worker Job in Quartz for data migration. These Worker Jobs will It is evenly dispatched to each node by Quartz. Within each Worker Job, data will be migrated from HDFS to OFS, metadata information such as Hive and Impala will be updated, and finally the data in HDFS will be deleted. The practical effect of separation of storage and calculation:

1) The storage space usage of the entire HDFS cluster can be controlled at about 65%, as shown in the figure below. Greatly save HDFS storage overhead caused by business data growth. 

043ea9aa1ab52dc024c0c34a7d5ad5e6.png

Figure 8 - Trend chart of HDFS storage capacity of Tencent Cloud EMR in the past 7 days

2) The offline EMR cluster elastically scales and expands according to time. 2/3 of the total resources will be pulled up at 12 o'clock in the morning every day, and these resources will be released after 6 o'clock in the morning. At this stage, the utilization rate of Vcore is basically above 90% , as shown in the figure below; throughout the day and night, only 1/3 of the total resources are reserved. It can be seen that the resource usage rate of this part is maintained at about 60%, and there are some usage peaks at short-term points.

2ac59c73dfa11dc4577df3c8d0cf7ff7.png

Figure 9 - YARN Vcores trend chart of Tencent Cloud EMR cluster in the past 7 days 

2.3.2 Cost Management 

In terms of cost, Tencent Cloud EMR currently only provides the cost of the entire cluster, and cannot see the cost of a single task. On the other hand, we hope to be able to do fine-grained management of costs, collect task resource information run by a person, and then count resource information used by individuals and teams. In response to this requirement, we collect relevant data and use StarRocks for statistical analysis of the data. Data collection is divided into two parts: 

Part of it is collected from YARN. The Cluster Applications API provided by YARN includes fields such as ID, AllocatedMB, and AllocatedVCores. As shown in the figure below, use a scheduled task to collect data every 5 seconds and write the data into Kafka. 

5bbb075d4d0ec3b150cddf2feaa45d4a.png

Figure 10-Introduction to related parameters provided by the Cluster Applications API provided by YARN

The other part of the data is based on the offline data management platform. Since the Application in YARN is submitted from the offline data management platform, it corresponds to a Job in the management platform. As shown in the figure below, the management platform will collect the log information printed by clients such as HIVE/Spark, obtain the Application ID in it, and write the Application ID and the associated Job ID into Kafka. It is worth noting that if it is a HIVE task, one Job ID may be associated with multiple Application IDs.

2b1ffe8adafc3d3108844f95bda0f77b.png

Figure 11 - Schematic diagram of interaction between offline data platform and EMR YARN

In StarRocks, two Routie Load tasks will be established to consume data in Kafka, and a MySQL table will be created to obtain information such as the ID and Author of the data platform Job. According to the idea of ​​"computing after the post", relying on the powerful computing power of StarRocks, the The data from the three data sources are associated, aggregated and other operations are performed for various analysis, such as statistics on which user uses the most Vcore resources within a certain period of time.

2.4 The effect of migrating to the cloud

1. In terms of cost reduction:

1) Compared with the total cost of Hadoop under the cloud, the cost control of big data on the cloud has achieved satisfactory results;

2) Expense management is clearer, can be accurate to team, individual and individual tasks, and control unnecessary computing needs;

3) More cost control methods, such as EMR elastic scaling, COS/OFS archiving and other functions, continuously improve cost control.

2. Efficiency enhancement:

1) The execution of a single data task on the cloud is faster than that under the cloud. The Hadoop under the cloud is mainly shared by multiple departments. The resource utilization rate has been maintained at a high level, and the tasks have a great influence on each other;

2) The daily baseline compliance rate has been significantly improved, and data tasks can be completed before going to work, improving the efficiency of data use;

3) Agile response to resource expansion requirements. When there is a large computing demand, resources can be quickly expanded to meet the computing demand;

4) The technical support is more powerful, and Tencent Cloud will mobilize resources from various aspects to solve various problems.

Subsequent planning

We will continue to practice on the road of reducing costs and increasing efficiency, and continue to build together with Tencent Cloud.

1. In terms of cost reduction:

1) Enable OFS archiving and deep archiving, and develop supporting recovery functions to reduce the ever-increasing data storage costs;

2) Try the EMR container version, and the computing resource demand can be scaled according to the load to achieve complete elasticity;

3) Try to use hosted PAAS/SAAS products to reduce operation and maintenance costs.

2. Efficiency enhancement:

1) Use StarRocks to replace Impala/Presto as a unified entrance for interactive analysis. StarRocks has functions such as vectorization and CBO, and its performance has obvious advantages over Impala/Presto;

2) Use multiple OFS buckets to improve overall bandwidth performance and improve data usage efficiency;

3) Try to use the new technologies in the industry, such as data lakes and other products, to improve the overall data development efficiency.

Guess you like

Origin blog.csdn.net/cloudbigdata/article/details/130776612