IQIYI Data Lake in Action - Application of Advertising Data Lake

01

   background

Advertising data mainly includes requests for advertising forms such as effect, brand, and ADX, and a series of logs generated in the delivery link. After processing, it is used in scenarios such as algorithm model training, advertising operation analysis, and advertising placement decisions. The advertising business has high requirements on the timeliness, accuracy and query performance of data. At present, the overall advertising data link adopts the Lambda architecture, and there are two links, offline and real-time, which bring high usage costs and risks of data inconsistency.

In order to solve these problems, the advertising data team and the iQiyi big data team actively researched the cutting-edge technology of big data, and paid attention to the rise and development of data lake technology very early. The data lake not only supports large-scale data storage, but also has near real-time timeliness and interactive query efficiency, which is very suitable for the needs of advertising data scenarios. Aiming at some pain points and difficulties of advertising data, we have made a series of attempts in the data lake. This article will briefly introduce it according to different needs and business scenarios.

In addition to the advertising business, the data lake has been applied in more than 20 business scenarios of iQIYI, greatly improving the efficiency of data circulation and promoting business speed and efficiency. For the technical details and applications of the data lake, you can read the previously published " iQiyi Data Lake Actual Combat ".

02

   Advertising Data Schema

Advertising business data analysis scenarios often need to query the data of the past few months, involving a large amount of data, and at the same time require low end-to-end latency and fast query speed. The data warehouse built based on Hive cannot meet such requirements. Before migrating to the data lake, the advertising data link adopts the Lambda architecture commonly used in the industry:

  • Real-time data link : consume Kafka real-time streaming data through Spark Streaming tasks and write to Kudu. In order to improve query performance, Kudu service is deployed in an independent OLAP cluster. At the same time, based on cost considerations, only the data of the last 7 days is retained

  • Offline data link : Advertising business query also needs the data of the last 90 days, so it is necessary to synchronize the advertising Hive data of the last 90 days on the public cluster (shared by multiple businesses) to the Hive table on the independent OLAP cluster

  • Data query : split the query into real-time tables and offline tables based on data progress, and use Impala to automatically stitch data in Kudu and Hive

As shown in Figure 2-1.

4f246b6c8d3400982295a6eb544c795c.png

Figure 2-1 Advertising Lambda architecture

This scheme has the following disadvantages:

  • The Lambda architecture uses a variety of different technical frameworks, which brings high data development and use costs

  • Offline data is synchronized regularly, and there is a large data delay

  • The real-time data link is strongly dependent on Kudu, which cannot guarantee end-to-end data consistency

  • Independent OLAP clusters cause data redundancy and additional storage costs

In order to solve these pain points, Advertisement introduced data lake technology. The characteristics of the data lake can better meet the needs of the advertising business:

  • Near-real-time writing: data lakes are based on the submission frequency, and the data delay can be at the minute level

  • Integration of storage streams and batches: the data lake supports both real-time writing and offline overwriting, eliminating the need for two heterogeneous storage systems

  • Strong consistency: data lake modification guarantees atomicity, enabling real-time write Exactly Once semantics

  • Low cost: data lakes can share existing large-scale storage such as HDFS

At present, multiple advertising data scenarios have been connected to the data lake, which greatly improves the timeliness of data. The next chapter will introduce our architectural transformation and related optimizations for these scenarios.

03

   Advertising data lake application

  • Real-time retrieval of business data

When an advertiser checks the relevant information of the advertisement, he generally checks the budget of the advertisement itself, the current consumption and the inventory of the advertisement. Among them, the consumption and inventory data can be obtained through calculation from the flow data, and real-time and offline reports are provided for query, but the budget-related data is stored in the business database MySQL, pulled through Sqoop, only offline reports are provided, and the delay is 1 hour above. In order to optimize the user experience and improve the overall timeliness, the real-timeness of the budget table must be improved. We use the update feature of Flink CDC and Iceberg v2 type to transform the budget link, as shown in Figure 3-1.

5842494561a19e4c93211996d61501cb.png

Figure 3-1 Budget link

Budget-related data mainly involves the join of multiple tables such as advertisers and advertising orders. Use the Flink CDC Connector to access the binlog log of MySQL, and write the result set into the Iceberg v2 data table. At the beginning of the table's launch, a file is generated every time it is updated, resulting in too many small files in the table. After a few days of operation, the data cannot be queried. To solve this problem, firstly, adjust the table structure, set the bucket partition, and cooperate with the timing merge strategy of the bucket partition, and then configure the distributed mode of writing the table as hash, and shuffle according to the partition before the data is dropped into the table to avoid A file is generated for each node.

The data reports provided externally all have time partitions, such as hourly or day-level data. The tables exported by the above Flink CDC cannot be directly used externally. After subsequent scheduled scheduling, hour-level reports are generated after full reading, and the delay is reduced from 1 hour to 5 minutes. At this point, consumption and inventory data can be viewed at the same time to complete the overall link. Latency optimization.

  • Real-time data warehouse

Incremental reading of Iceberg is the key to using the data lake to build a real-time data warehouse. In order to further verify and test the feasibility, the inventory data with a large amount of data is used for verification. The intermediate results of real-time advertising data are all output to Kafka, and only the external report data lands. Although Kafka can meet the needs of advertising in terms of efficiency, the data cannot be kept for too long. If there is a problem that needs to be checked, it is very difficult to check because there is no intermediate detailed data. Based on Iceberg's large storage and high-efficiency features, the real-time data warehouse built using Iceberg can meet the requirements of data traceability and rerun, and at the same time, the data delay is within 5 minutes. The specific solution is shown in Figure 3-2.

7ac4bd16ceec40230cd694c91f0c6542.png

Figure 3-2 Incrementally read Iceberg

After the original log is associated with the dictionary table, it enters the lake to generate an ODS table. Subsequent reports read the ODS table to calculate dimensions and indicators to generate an intermediate table, and then generate an ADS table. During this process, if there is a problem with the follow-up data or troubleshooting is required, the intermediate detailed data will be saved in Iceberg and can be viewed at any time.

The main problem encountered during this period was that there were too many small files, which caused data delays. The ODS table adopts day and hour partitions, the parallelism is 100, and the checkpoint interval is 1 minute. 100 small files are generated every minute. The downstream task of reading the ODS table has too many small files and does not limit the number of snapshots to be read at one time, which causes the checkpoint to be too large and fails, resulting in a large data delay. In order to solve this problem, a bucket partition is added to the table. According to experience, a node can process about 160 MB per minute. Set the number of bucket partitions as much as possible to make the file size of a commit around 100 MB. At the same time, the maximum number of snapshots read each time is limited. In addition, it should be noted that the task will recover from the checkpoint after the checkpoint failure is abnormal. If the checkpoint cannot be recovered, the task needs to be configured with a start snapshot id, otherwise it will be read from the source table .

After the above-mentioned optimization is launched, the delay of ADS report is about 3-4 minutes, which is in line with expectations overall.

  • Real-time OLAP analysis

In the current Lambda architecture of advertising data, real-time data is written to Kudu, and offline data is synchronized to the OLAP cluster, which is uniformly queried by Impala. Due to the limited resources of Impala, for frequent queries of millions of rows of data, the cluster will be unavailable due to high query pressure. At the same time, there are many reports that only have offline data, and the data delay is relatively large. In order to alleviate the pressure on the cluster and improve the real-time performance of data, it is realized by writing real-time and offline data into Iceberg. Take the advertising Qilin hourly report as an example for specific description.

Qilin data currently only has hourly reports, and the data delay time is about 2-3 hours, and the delay is relatively high. With the gradual expansion of Qilin’s business, the requirements for data timeliness have increased, and relevant data needs to be observed in real time, and then Qilin has developed The real-time project of the hourly report is shown in Figure 3-3.

ed14a1345a16175243cf3056797b3cc2.png

Figure 3-3 Qilin Hourly Report Optimization

Real-time data mainly includes dimension tables and original logs. Dimension tables are incrementally synchronized to Redis in real time through Flink CDC. Raw logs are dimensionally expanded by associating Redis dimension table data (use asynchrony and caching to improve join efficiency), generate ODS tables, and write them into Kafka , then read the ODS table to perform index and dimension calculations and then drop the Iceberg table. Solve problems such as small files through shuffle and bucket partitioning, and enable active caching to speed up data query efficiency.

The offline data reads the HDFS log, calculates the related business dimension table through the report calculation of each layer of the offline data warehouse, and produces a Qilin report. After the offline report output is completed, the real-time table is overwritten, including hour-level coverage and day-level coverage. Simultaneously open Active caching.

Through the transformation of real-time and offline data, the overall delay is reduced from 2-3 hours to 3-4 minutes.

  • real timeETL data landing

A series of logs related to the behavior of advertisements in performance advertisements, including advertisement exposure clicks, etc., are recorded by Tracking logs. With the iterative development of business requirements, more and more fields need to be placed in the Tracking log, and the length of the Tracking url is getting longer and longer, which will cause two problems: (1) The data length of the advertisement response increases, resulting in a delay in the response Increase; (2) The front end may truncate when returning Tracking, resulting in information loss. In order to solve the above problems, the Tracking log is split into two parts of real-time data, namely "billing data" and "traffic data". However, in order to facilitate the construction and use of subsequent links, the two parts of the data need to be associated and merged into a whole on the data side. At the same time, considering the integrated calculation of streams and batches in the future and improving the timeliness of subsequent link calculations, we decided to enter the lake. The structure is shown in Figure 3-4.

c62306008398a4cb18dff672ac30c2bb.pngFigure 3-4 Landing of real-time ETL data

For the convenience of downstream use, the "traffic data" and "billing data" need to be combined into one and written into Iceberg, and the two parts of data need to be joined. At the same time, due to business needs and association accuracy requirements, the "traffic data" needs to be stored for a long time, and the data volume is tens of terabytes in size. We choose to use HBase storage, and then use "billing data" to associate HBase in batches in real time. At the same time Considering that there may be a time difference between the two streams, and the first association cannot be made, we have retried the association, basically ensuring that the two data can be completely associated under reasonable circumstances. Because 3 retries are set, there will be a data delay of more than ten minutes, which can be further optimized in the future. In addition, to solve the problem of many small files in Iceberg data, a small file merging strategy is configured to significantly reduce the number of small files. At present, a data lake data warehouse application link has been built based on the data entering the lake, and the link will be gradually put into use in the future and used as a template to promote and apply it to other data lake data warehouses.

At present, the link is relatively stable, and the success rate of key indicator association is 99% or above. Compared with the delay of at least hours for offline links, the delay of this link is only ten minutes (mainly affected by association retries), which is greatly improved. timeliness of the data. At the same time, based on the data lake, a stream-batch integrated calculation can be built in the future, and the calculation caliber can be unified.

04

   future outlook

The data lake is developing rapidly, and it is also growing rapidly within the company. Next, the advertising data will use the data lake to realize the transformation of streaming and batch integration. Currently, offline data lands in HDFS, which has poor timeliness. The two sets of real-time and offline computing logics are likely to cause data inconsistency. At the same time, development and maintenance costs are high. With the implementation of real-time ETL data, the real-time offline code logic will be unified to realize the integration of streaming and batching.

In addition, in order to provide complete data that can be checked externally, it is necessary to provide data progress. Currently, to provide minute-level progress, the partition structure of the Iceberg table is (dt, hour, timestamp). By obtaining the delay of the task and the records data of the partition in the table to judge progress. The partition structure leads to a large number of small files in the table, and it takes a long time to view the metadata of the table at the same time, and the progress is delayed. At present, we are trying to use the watermark scheme to determine the data progress of the table, and the relevant results need to be further tested and verified.

At the same time, I look forward to the data lake showing its strength in federated query and Flink Table Store, providing new inspiration for advertising data and other data scenarios!

e6fe1f9112c4a10e53c0a0b8b70f1872.jpeg

maybe you want to see

Prometheus monitoring index query performance tuning

iQIYI DRM training road

iQiyi Big Data Acceleration: From Hive to Spark SQL

Guess you like

Origin blog.csdn.net/weixin_38753262/article/details/131485825