Youdao builds quasi-real-time lake warehouse practice based on Amoro Mixed Format

About the Author

Xie Yi , senior big data development engineer at NetEase Youdao, is currently mainly involved in the research and development of real-time computing and lake-warehouse integration.

Wang Tao , senior platform development engineer at NetEase, mainly engaged in big data and lake warehouse platform construction

business background

Youdao's data layer architecture can be divided into two parts: offline and real-time. Offline calculations mainly use Hive and Spark, and batch processing is used for scheduled scheduling. The real-time part uses Flink+Doris (version 0.14.0) to build a real-time data warehouse, which is used to process real-time buried logs and business database change data. The ODS layer data sources are Kafka buried logs and database business data. The DWD and DWS layer data are processed by the Flink computing engine and written into Doris. At the same time, Doris data is regularly synchronized to Hive for offline scheduling calculations. This architecture has the following problems:

  1. High development and operation and maintenance costs: The syntax of Flink SQL is very different from Hive/Spark, the migration cost of Hive/Spark to Flink is high, and the operation, maintenance and optimization of Flink's large-state tasks are difficult.
  2. The support for full incremental streaming reading scenarios is weak, and it is difficult to meet the needs of Flink for fully reading Hive historical data and Kafka incremental data in Youdao scenarios.
  3. Streaming batch storage is not unified, resulting in doubled data development and storage costs, and it is easy to cause inconsistency in data caliber.
  4. As a data island, Doris uses SSD storage, which is expensive and not suitable for large-scale dictionary log data storage. Two sets of storage solutions in the long term are not conducive to cost optimization.
  5. Data needs to be continuously imported and exported between Hive and Doris. A long link can easily introduce instability factors. For example, when large-scale data is written, Doris exports data to Hive and occasionally loses data, and it does not support the storage of long String type strings.

Based on the above problems, Youdao hopes to upgrade from Hive to an integrated lake and warehouse solution that supports streaming batch reading and writing and unified data storage. And build minute/hour-level near-real-time data warehouse based on Spark/Trino/Hive ETL to reduce development and operation and maintenance costs. It can replace Doris' minute-level data warehouse scenarios in most scenarios, reduce database data synchronization costs, and effectively reduce costs and increase costs. effect.

Introducing Amoro Mixed Hive

Amoro Mixed Hive provides Hive read-write compatibility and data self-optimization capabilities. Based on this, it provides two different time-sensitive read capabilities:

  1. Merge on read can achieve minute-level data freshness
  2. Hive read can achieve hour-level freshness and also realize the update and deletion of Hive data. Old downstream Hive tasks can enjoy data timeliness improved to hour-level without any modification. For analysts who are accustomed to using Hive, this can be done It reduces the threshold for using new technologies.

Hive table format compatible

Amoro designed Mixed Hive to be compatible with Hive. The storage structure of Mixed Hive is as shown in the figure. BaseStore is an independent iceberg table. The Hive table is a part of BaseStore. ChangeStore is an independent iceberg table. Mixed Hive supports:

  • Schema, partition, and types are consistent with Hive format
  • Use the Hive connector to read and write the Mixed Hive table as a Hive table
  • Hive tables can be upgraded in-place to Mixed Hive tables. There is no data rewriting or migration during the upgrade process, and the response is within seconds.
  • It has the characteristics of integrating lakes and warehouses, including primary key upsert, streaming read and write, ACID, time travel and other functions.

The read-write compatibility feature of Hive enables the seamless migration of Hive tables to Mixed Hive, and can achieve no awareness of upstream and downstream.

Hive data update

Amoro uses Self-optimizing to merge real-time written changes into Hive to implement Hive data updates. The goal of Self-optimizing is to create an out-of-the-box streaming lake warehouse service like a database or traditional data warehouse based on the new data lake table format. Self-optimizing includes but is not limited to file merging, deduplication, and sorting. Amoro divides the files in the table into two categories according to size:

  • Fragment File: Fragmented files, files smaller than 16 MB in the default configuration. Such files need to be merged into larger files in time to improve reading performance.
  • Segment File: Files larger than 16MB in the default configuration. Such files already have a certain size, but are less than the ideal 128MB.

Based on file classification, Amoro divides file optimization tasks into three categories: Minor optimizing, Major optimizing, and Full optimizing. It responds to write-friendly and read-friendly scenarios and ensures a balance of read performance while ensuring write performance. In particular, Full optimizing will regularly merge the data written in real time into the Hive directory to update the Hive data view and improve the timeliness of Hive data. Continuous Self-optimizing can effectively optimize the size distribution of files in the table, reduce the number of small files, and reduce the performance overhead of AP queries.

Implementation plan

Data link transformation

Based on Amoro, we have made the following transformations on traditional links:

  1. In terms of development method, the data import at the source layer has been changed from Flink SQL to based on the real-time data lake platform, and the business can complete the Hive upgrade and the construction of the link into the lake through simple interaction.
  2. The link from the database to Hive is synchronized regularly through data transmission and transformed into a real-time Mixed Hive format table. While data timeliness is improved, the offline workflow baseline is also advanced, and the data output time is greatly advanced.
  3. Amoro replaces Doirs, reduces the complexity of data links, unifies storage streams and batches, and improves stability.
  4. On the data query side, data timeliness is improved by directly querying the Mixed Hive format table. The timeliness of data reports can reach the minute level; the timeliness of the original report link for querying Hive can be improved to the hour level.

Real-time data lake platform co-construction

In order to shield the learning cost of business development caused by underlying storage changes, NetEase Hangyan provides a real-time data lake development platform internally based on Amoro, which encapsulates the entire process from Hive table upgrade to building data into the lake, helping users complete development and operation in one stop. Dimensioning, lowering user barriers and costs.

  • Hive tables are upgraded to Mixed Hive tables, including primary key configuration and partition key configuration.
  • Create a lake entry task from the source end to the Mixed Hive table, supporting database cdc entry into the lake and log entry into the lake.
    • Based on NDC (NetEase Data Canal), the full incremental incoming lake link from the source database binlog is directly output to the Mixed Hive table.
    • Supports configuring real-time lake link from log kakfa to Mixed Hive table.

Through user research, compared with the architecture based on Flink real-time access to the lake before the introduction of Amoro, users' development costs have dropped by 65%, and operation and maintenance costs have dropped by 40%. Through the real-time data lake platform, Flink can be accessed into the lake, users can write Spark/Trino sql to implement ETL processing at the DWD and DWS layers, and build minute- and hour-level near-real-time data warehouses, which greatly reduces users' development costs. Moreover, this solution has the characteristics of unified storage flow and batch, which also reduces the cost of user data development and data repair. At the same time, Amoro's full/incremental Flink streaming reading feature can also meet streaming processing scenarios that require higher timeliness.

On top of the open source Amoro and Hangyan real-time data lake platforms, Youdao has also been deeply involved in community contributions and platform co-construction, including:

  • Contribute Mixed Format to support ORC format, which solves the limitation that Amoro only supports Parquet format Hive tables, and avoids the process of upgrading by copying ORC to Parquet tables. It is expected to save 20TB of redundant storage storage.
  • Build Amoro platform monitoring system and automatic operation and maintenance optimization to ensure online table quality, data timeliness, and cluster stability.
  • Trino engine queries Amoro, supports Hadoop-proxy, and implements permission management based on Youdao HDFS ranger.
  • Multiple Amoro Optimizer optimizations to improve Optimizer stability, such as: Flink Optimizer task retry reporting to AMS
  • Multiple real-time data lake platform optimizations to improve reliability and user experience, such as: the real-time data lake platform supports Amoro high availability

Query optimization

Currently, Youdao mainly uses the MPP engine Trino to perform minute-level time-sensitive OLAP queries on Mixed Hive tables. In most scenarios, the default query performance meets the requirements. For some scenarios with high query response requirements, it is found that the query performance of Mixed Hive cannot meet the business requirements and is worse than the query performance of other business environments. Amoro and Youdao analyzed Trino's query profile and underlying HDFS performance, and found three optimization points. After optimization, the query performance of the Mixed Hive table was significantly improved, and the query time was reduced by 92%, which is basically close to Hive static data query, and has been Can meet business requirements. The three optimizations are as follows:

  1. For the Query Plan stage, Amoro rewrites the original Plan logic and obtains the Sequence Number needed to determine whether the data has been deleted directly from a multi-threaded Plan, reducing the need to use an additional single-threaded Plan to obtain the variable. s expenses;
  2. To address the problem of data skew, Amoro splits tasks with smaller delete file overhead but a larger number of files into a more fine-grained manner. By increasing the degree of parallelism, performance is improved by 50%.
  3. Youdao optimized HDFS. During the analysis process, it was found that the RPC response time of .95 accessed through the Router reached 262ms, far exceeding the 5ms of the normal cluster. By switching HDFS access to the directly connected HDFS cluster, the RPC response time was reduced to .95. It takes 15ms, and the average query time of the Mixed Hive table is reduced by 83.3%. In addition, in scenarios where timeliness requirements can be reduced, directly querying the BaseStore of Mixed Hive can also achieve minute-level timeliness, and the query performance is better, which is comparable to Hive querying static data.

Application

In early 2023, Amoro started to be used online in Youdao. Currently, it has 500+ online tables, 100+ TB storage, and daily storage increment of 200GB/day. The average daily query volume of Spark/Trino is 6,000+, covering 10+ business departments of Youdao. , has implemented minute-level near-real-time data warehouses in multiple scenarios such as renewal and delivery, and through the data self-optimization function hosted by Amoro, it can effectively avoid small file problems in the data warehouse business and achieve the effect of continuous cost reduction and efficiency improvement.

In terms of replacing Doris, Youdao Dictionary has completed the replacement of Doris with Amoro and removed the nodes of the Doris cluster. Youdao Premium Course is also gradually replacing Doris with Amoro, and the replacement is expected to be completed in the first half of next year.

In the practice of building a near-real-time data warehouse, three business departments have completed building a near-real-time data warehouse based on Amoro, and the delay of the entire link can be as low as 3mins. At the same time, real-time incremental writing of Mixed Hive tables replaces the traditional full data transmission task, advancing the offline workflow baseline, and the ADS table output time can be advanced by up to 6 hours.

In terms of business income, data output efficiency has been improved from T+1 to hour/minute level, enabling faster and more effective decision-making analysis (delivery, sales strategy, etc.), which has brought cost reduction or efficiency improvement to many departments of Youdao , for example, the playback time of dictionary community videos increased by 10%, and the click-through rate increased by 4.6%.

community contribution

In the Amoro open source community, Youdao has had 13 PR merges, including:

  • Mix Table Format supports ORC file format
  • Flink DDL supports calculated columns and Watermark
  • Trino engine supports hadoop-proxy
  • Support HTTP request to create optimizer group
  • Table deletion operation optimization
  • Flink Optimizer failover retry reporting to AMS

future plan

  1. In terms of query performance, Amoro's AP query performance has basically met user needs in most scenarios with high query response requirements. The goal of the platform layer is to ensure that 90% of query response times are within 5 seconds. The average query time of Doris based on SSD is 1.8 seconds. The average query time of Amoro after optimization is 5 seconds, which can satisfy most query scenarios. In the future, we will continue to optimize query performance, such as introducing Z-order in Full optimize to improve the hit rate of data skipping.
  2. The current lake entry task is a single form task with a large number of changing database tables, and the resource utilization rate of the lake entry task is relatively low. Hangyan's real-time data lake platform has launched Flink Session into the lake, reusing Session's JM/TM resources and optimizing the resource utilization of lake-entering tasks. It is expected to reduce the memory and CPU resources of the lake entry task by about 30% to 50%.
  3. Paimon has good usability in Flink partial update scenarios, and plans to try to implement it through Amoro in the future. We hope to use Amoro's Unified Catalog to uniformly manage Mixed Hive tables, Paimon tables, and Mixed Iceberg tables. We hope to use Paimon tables just like Mixed Hive and retain users' current Amoro-based development experience.

End~

If you are interested in data lake, lake-warehouse integration, table format or Amoro community , please contact us for in-depth discussions.

More information about Amoro can be found at:

Community communication : Search kllnn999 , add Ashida Aina assistant WeChat to join the community

Author: Xie Yi & Wang Tao

Editor: Viridian

Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6895272/blog/10322356