NetEase Media's low-cost quasi-real-time computing practice based on Arctic

In the actual big data business of NetEase Media, there are a large number of quasi-real-time computing demand scenarios, and the business side’s requirements for data effectiveness are generally at the minute level; in this scenario, the traditional offline data warehouse solution cannot satisfy users in terms of effectiveness. requirements, and the use of full-link real-time computing solutions will bring high resource occupation.

Based on the research on open source data lake solutions, we noticed NetEase Shufan's open source Arctic data lake solution based on Apache Iceberg. Arctic can relatively well support and serve the mixed stream and batch scenarios. Its open superimposed architecture can help us to transition smoothly and realize the upgrade from Hive to data lake, and because the offline data warehouse of the media has been connected to several , the cost of transforming existing business through Arctic is low, so we are going to try to solve the pain points in the push business scenario by introducing Arctic.

1 Project background

Taking the real-time data warehouse of media push as an example, news push has high uncertainty in factors such as region, time, and frequency, and it is very prone to occasional traffic peaks, especially when there are sudden social hot news. If a full-link real-time computing solution is used for processing, more resource buffers need to be reserved to deal with it.

Due to the uncertainty of the push timing, the data indicators of the push business are generally not incremental, but mainly based on various cumulative indicators from the current day to the present. The calculation window usually ranges from fifteen minutes to half an hour. Statistics Dimensions distinguish sending types, content classifications, sending votes, sending manufacturers, first launch methods, user activity, AB experiments, etc., and have the characteristics of large traffic fluctuations and various data calibers.

img

In the previous full-link Flink real-time computing solution, the following problems were mainly encountered:

(1) High cost of resource occupation

In order to cope with traffic floods, high resources need to be reserved for real-time task allocation, and multiple aggregation tasks need to consume the same upstream data, so there is a problem of read amplification. The push-related real-time computing process accounts for 18+% of the total real-time tasks, and the resource usage accounts for nearly 25% of the total real-time resource usage.

(2) Decrease in task stability caused by large state

When window computing is performed in the push business scenario, large traffic will bring about a large state problem, and the maintenance of large state will easily affect the stability of the task while causing resource expenditure.

(3) It is difficult to restore data in time when the task is abnormal

When an abnormality occurs in a real-time task, the timeliness of retrieving data in real-time is slow and the process is complicated; while correction in an offline process will bring double the cost of manpower and storage.

2 Project idea and plan

2.1 Project ideas

Through the research on the data lake, we expect to take advantage of the characteristics of data entering the lake in real time, and at the same time use offline resources such as Spark to complete calculations, and meet the business needs of real-time computing scenarios at a lower cost. We use the push business scenario as a pilot to explore and implement the solution, and then gradually promote the solution to more similar business scenarios.

Based on the research on open source data lake solutions, we noticed NetEase Shufan's open source Arctic data lake solution based on Apache Iceberg. Arctic can relatively well support and serve the mixed stream and batch scenarios. Its open superimposed architecture can help us to transition smoothly and realize the upgrade from Hive to data lake, and because the offline data warehouse of the media has been connected to several , the cost of transforming existing business through Arctic is low, so we are going to try to solve the pain points in the push business scenario by introducing Arctic.

Arctic is a streaming lake warehouse system open sourced by Netease Shufan, which adds more real-time scene capabilities on top of Iceberg and Hive. Through Arctic, users can implement more optimized CDC, streaming updates, OLAP and other functions on engines such as Flink, Spark, Trino, and Impala.

img

To realize the transformation of the data lake in the push business scenario, only the Flink Connector provided by Arctic can be used to quickly realize the real-time entry of push detailed data into the lake.

What we need to pay attention to at this time is that the data output needs to meet the minute-level business needs. Data output latency consists of two parts:

  • Data ready delay depends on the Commit interval of Flink real-time tasks, generally at the minute level;
  • Time-consuming data calculation depends on the calculation engine and business logic: data output delay = data ready delay + data calculation time-consuming

2.2 Solutions

2.2.1 Data into the lake in real time

Arctic is compatible with existing storage media (such as HDFS) and table structures (such as Hive, Iceberg), and provides transparent streaming and batch table services on top of it. The storage structure is mainly divided into two parts: Basestore and Changestore:

(1) The stock data of the table is stored in Basestore. It is usually written for the first time by an engine such as Spark/Flink, and then the data in the Changestore is converted and written through an automatic structure optimization process.

(2) Changestore stores the latest change data on the table. Changestore stores the most recent change data on the table. It is usually written in real-time by Apache Flink tasks and used for near-real-time streaming consumption by downstream Flink tasks. At the same time, it can also be directly batch calculated or combined with the data in the Basestore to provide batch query capabilities with minute-level delays through the Merge-On-Read (hereinafter referred to as MOR) query method.

img

The Arctic table supports streaming writing of real-time data. In order to ensure the effectiveness of the data during the data writing process, the writing side needs to submit data frequently, but this will generate a large number of small files. On the one hand, the backlog of small files will affect Data query performance, on the other hand, will also put pressure on the file system. In this regard, Arctic supports row-level updates based on primary keys, and provides Optimizer for data update and automatic structure optimization to help users solve common problems in data lakes such as small files, read amplification, and write amplification.

Taking the media push data warehouse scenario as an example, detailed data such as push sending, delivery, click, and display needs to be written into Arctic in real time through Flink jobs. Since the upstream has already done ETL cleaning, at this stage, the upstream data can be easily written into the Changestore only through FlinkSQL. Changestore contains insert files for storing inserted data and equality delete files for storing deleted data. The updated data will be split into pre-update items and post-update items and stored in delete files and insert files respectively.

Specifically, for scenarios with primary keys, insert/update_after messages will be written to the insert file of Changestore, and delete/update_before messages will be written to the delete file of Arctic. When optimizing, it will first read the delete file into the memory to form a delete map, the key of the map is the primary key of the record, and the value is record_lsn. Then read the insert files in Basestore and Changestore, and compare the record_lsn of the rows with the same primary key. If the record_lsn in the insert record is smaller than the record_lsn of the same primary key in the deletemap, it is considered that this record has been deleted and will not be added to the base; otherwise, write the data into a new file, and finally realize the row-level update.

2.2.2 Lake water level sensing

Traditional offline computing requires a trigger mechanism in terms of scheduling, which is generally handled by the job scheduling system according to the dependencies between tasks. When all upstream tasks succeed, the downstream tasks are automatically called up. However, in the scenario of real-time access to the lake, downstream tasks lack a way to perceive whether the data is ready. Taking the push scenario as an example, the indicators that need to be produced are mainly to calculate various statistical values ​​accumulated on the day according to the specified time granularity. At this time, if the downstream cannot perceive the current lake surface water level, a more redundant one needs to be reserved. buffer time to ensure data is ready, or there is a possibility of data leakage. After all, the traffic changes in the push scene are very fluctuating.

The media big data team and the Arctic team drew on the processing mechanism of Flink Watermark and the solution discussed by the Iceberg community, and wrote the Watermark information into the metadata file of the Iceberg table, and then Arctic exposed it through the message queue or API to achieve downstream tasks The active perception reduces the startup delay as much as possible. The specific plan is as follows:

(1) Arctic meter water level sensing

At present, only the scenario written by Flink is considered, and the business defines event time and Watermark in Flink's source. ArcticSinkConnector contains two operators, one is a multi-concurrent ArcticWriter responsible for writing files, and the other is a single-concurrent ArcticFileCommitter responsible for submitting files. When executing checkpoint, the ArcticFileCommitter operator will take the smallest Watermark after Watermark alignment. A new AMS Transaction similar to the Iceberg transaction will be created. In this transaction, in addition to AppendFiles to Iceberg, the TransactionID and Watermark will be reported to AMS through the thrift interface of AMS.

img

(2) Hive table water level perception

The data visible in the Hive table is the data after Optimize. Optimize is scheduled by AMS. Flink tasks abnormally execute file read-write and merge, and report the metric to AMS. AMS commits the result of Optimize execution this time. AMS naturally Knowing which Transaction Optimize has advanced to this time, and AMS itself also stores the Watermark corresponding to the Transaction, you know where the Hive table water level has advanced.

2.2.3 Data Lake Query

Arctic provides connector support for computing engines such as Spark/Flink/Trino/Impala. By using the Arctic data source, each computing engine can read committed files in real time, and the Commit interval is generally at the minute level according to business requirements. The following uses the push business as an example to introduce query solutions and corresponding costs in several scenarios:

(1) Arctic + Trino/Impala satisfies second-level OLAP queries

In OLAP scenarios, users generally pay more attention to the time-consuming calculations and are relatively insensitive to data readiness. For Arctic tables with small and medium-sized data volumes or relatively simple queries, OLAP query through Trino/Impala is a relatively efficient solution, which can basically achieve second-level MOR query time consumption. In terms of cost, it is necessary to build a Trino/Impala cluster. If the team is already using it, it can be reused according to the load situation.

img

Arctic released its own benchmark data at the open source conference. In the scenario of continuous streaming ingestion by CDC of the database, comparing the OLAP benchmark performance of each data lake format, the overall performance of Arctic with Optimize is better than that of Hudi, which is mainly due to There is a set of efficient file index Arctic Tree inside Arctic, which can achieve more fine-grained and accurate merge in MOR scenarios. For a detailed comparison report, please refer to: https://arctic.netease.com/ch/benchmark/.

img

(2) Arctic + Spark satisfies minute-level pre-aggregation queries

For the scenario of providing downstream data report display, it is generally necessary to go through the pre-calculation process to persist the results, which is highly sensitive to data readiness and calculation time consumption, and the query logic is relatively complex, and the Trino/Impala cluster is relatively small , the execution is prone to failure, resulting in poor stability. In this scenario, we use the Spark engine with the largest cluster deployment to handle it, and achieve the reuse of offline computing resources without introducing new resource costs.

In terms of data readiness, a low minute-level readiness delay can be achieved through the Arctic table water level sensing solution.

In terms of computing, Arctic provides some read optimizations for Spark Connector. Users can adjust the Arctic Combine Task by configuring the values ​​of the two parameters read.split.planning-parallelism and read.split.planning-parallelism-factor in the Arctic table. Quantity, and then control the concurrency of computing tasks. Since Spark's offline computing resources are relatively flexible and sufficient, we can ensure that the computing needs of the business are completed within 2 to 3 minutes by adjusting the degree of concurrency mentioned above.

img

(3) Hive + Spark satisfies the scheduling of traditional offline data warehouse production links

Arctic supports using the Hive table as a basestore. During Full Optimize, files will be written to the Hive data directory to achieve the purpose of updating Hive's native read content, and reduce costs through the integration of streams and batches on the storage architecture. Therefore, the traditional offline data warehouse production link can directly use the corresponding Hive table as part of the offline data warehouse link. Compared with the Arctic table in terms of timeliness, although it lacks MOR, the Hive table water level sensing solution can be used to To the ready delay that is acceptable to the business, so as to meet the scheduling of traditional offline data warehouse production links.

img

3 Project influence and output value

3.1 Project influence

Through the exploration and implementation of the Arctic + X solution in the media, it provides a new solution for the quasi-real-time computing scenario of the media. This idea not only reduces the pressure on real-time resources and the burden of development and maintenance brought by the full-link Flink real-time computing solution, but also better reuses existing storage and computing resources such as HDFS and Spark, achieving cost reduction and efficiency increase .

In addition, Arctic has also landed in multiple BUs such as music and Youdao. For example, in music public technology, it is used for storage of ES cold data, which reduces the storage cost of user ES; and the research and development team of Youdao excellent courses is also actively exploring and using Arctic As a solution for some of its business scenarios.

At present, Arctic has been open sourced on github, and has received continuous attention from the open source community and external users. During the construction and development of Arctic, it has also received many high-quality PRs submitted by external users.

3.2 Project output value

Through the above scheme, we push ETL detailed data into the lake in real time through Flink to Arctic, then configure minute-level scheduling tasks on the scheduling platform, calculate according to different cross dimensions, write cumulative indicators into the relational database, and finally Continuous data display has achieved the minute-level time-sensitive data output required by the business side. Compared with the original full-link Flink real-time computing solution, the modified solution:

(1) Fully reuse offline idle computing power, reducing the cost of real-time computing resources

The solution utilizes offline computing resources in an idle state, and basically does not bring new resource expenditures. Offline computing business scenarios are destined to peak resource usage in the early morning hours, while news push and hot news are mostly generated in non-early morning hours. On the premise of meeting the timeliness of quasi-real-time computing, the comprehensive utilization of offline computing clusters is improved through multiplexing Rate. In addition, this solution can help us release about 2.4T real-time computing memory resources.

(2) Reduce task maintenance costs and improve task stability

The Arctic + Spark water level-aware trigger scheduling scheme can reduce the maintenance cost of 17+ real-time tasks, and reduce the stability problems caused by the large status of Flink's real-time computing tasks. Offline scheduling tasks through Spark can make full use of offline resource pools to adjust computing parallelism, effectively improving the robustness when dealing with sudden hot news traffic peaks.

(3) Improve the repair ability when the data is abnormal, and reduce the time spent on data repair

Through the stream-batch integrated Arctic data lake storage architecture, when the data is abnormal and needs to be corrected, the abnormal data can be repaired flexibly, reducing the cost of correction; and if the data is traced back through real-time calculation links or corrected through additional offline processes, Then you need to redo the status accumulation or complex ETL process.

4 Future planning and outlook of the project

At present, there are still some scenarios where Arctic cannot provide good support. The media big data team and the Arctic team will continue to explore and implement solutions in the following scenarios:

(1) The current push detail data before entering the lake is generated through the join of multiple upstream data streams, and there will also be a large state problem. However, Arctic currently only supports row-level update capabilities. If the ability to update some columns of the primary key table can be implemented, it can help businesses directly implement multi-stream joins at a lower cost when entering the lake.

(2) Further improve the water level definition and perception scheme of the Arctic table and Hive table, improve timeliness, and promote it to more business scenarios. The current solution only supports the scenario where a single Spark/Flink task writes. For the scenario where multiple tasks write tables concurrently, it needs to be further improved.

That's all for this sharing, thank you all.

Author: Lu Chengxiang Ma Yifan

The country's first IDE that supports multi-environment development——CEC-IDE Microsoft has integrated Python into Excel, and Uncle Gui participated in the framework formulation. Chinese programmers refused to write gambling programs and were pulled out 14 teeth, with 88% body damage . Podman Desktop, an open-source imitation Song font, breaks through 500,000 downloads. Automatically skips opening screen advertisements. The application "Li Tiao Tiao" stops updating indefinitely. There is a remote code execution vulnerability Xiaomi filed mios.cn website domain name
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4565392/blog/5590091