Zhejiang Telecom builds real-time lake warehouse practice based on Amoro + Apache Iceberg

Amoro is a lake warehouse management system built on open data lake tables such as Apache Iceberg. It provides a set of pluggable data self-optimization mechanisms and management services, aiming to bring out-of-the-box lake warehouse usage to users. experience.

01 About the author

Yu Zhiqiang , head of the big data center platform group of Zhejiang Telecom, has more than 10 years of experience in data warehouse and big data . Experienced commercial MPP database Vertica and open source MPP database  StarRocks-DBA. Currently, he is mainly involved in the construction and implementation of the integrated lake-warehouse architecture based on Apache Iceberg and the open source MPP product data application market.

 

02  Apache  Iceberg at Zhejiang Telecom

Why choose Iceberg

Zhejiang Telecom’s Big Data Center is mainly responsible for telecom’s business data aggregation and data warehouse production as well as some data applications. Big data architecture innovation has generally gone through three stages so far .

Phase 1: Data warehouse transformation Hive exploration

With the iteration of the big data system, we began to build a real-time analysis big data system based on Hive, and simultaneously explored the feasibility of transforming the data warehouse into Hive. However, after switching to Hive, we encountered the following problems:

  • Using MR execution, offline batch processing is inefficient, and the production completion time lags 4-5 hours compared to the commercial MPP database.

  • The lack of constraints of relational databases, strict field type restrictions, ACID semantics, and the assistance of third-party tools and platforms have resulted in higher subsequent data quality maintenance costs and substandard data quality.

Based on the above factors, Data Warehouse suspended the process of transforming Hive and returned to looking for more affordable and commercial MPP products (mainly starting from the original MPP row storage bottleneck and introducing a columnar storage MPP database based on x86 architecture). The completed big data cluster synchronization is based on Hive exploration to undertake some data application tasks that do not require high timeliness. The process of writing data to Hive is as follows:

The derivative tool here is mainly developed by the local collection team through java. It regularly extracts and generates formatted text files from the library by accessing Oracle and then writes them into Hive. This completes the synchronization of business data to data warehouse data. A major foundation of this method is that Oracle's reading performance from the database is guaranteed and it does not affect the use of the business system library. However, the regular triggering method also destined the data timeliness in Hive to be relatively poor.

Phase 2: Moving business systems to the cloud causes architectural adjustments to back-end data warehouses and application systems

Subsequently, as Zhejiang Telecom started the task of migrating the system to the cloud, the business system library gradually changed from Oracle to TeleDB (Telecom's self-developed relational database based on MySQL). The traditional data flow link of directly reading business library data and writing it into the data warehouse system will have TeleDB business library causes a lot of pressure. Driven by these issues, our data links have also changed.

The derivative tool here mainly uses two products. One is from an external product, which is packaged through the open source Canal technology. Later, because this product only adapted TeleDB, it did not meet the actual needs (TelePG was introduced later in production (Telecom based on postgresql self-developed) database), and later introduced the cross-IDC synchronization tool self-developed by Telecom, which enabled the innovation of the data warehouse ODS layer from offline data collection to real-time data collection. However, the data timeliness of the Hive data system for business analysis is still relatively poor.

Phase 3: Rethink the direction of big data and build an integrated lake-warehouse cluster

After the cloud migration of the production system has stabilized, the data warehouse and application mart have also started the process of cloud migration. We have rethought the positioning and direction of big data and considered reconstructing the gradually marginalized big data clusters to support Zhejiang Telecom. Real-time aggregation, integration and reconstruction of data warehouses for all business system data. The original CDH cluster was upgraded to the big data base version developed by China Telecom. Synchronization requires selecting a format for the base of the lake warehouse that can support real-time writing and reading of data. With the gradual maturity of data lake technology, and in the context of the group's encouragement of big data teams to embrace open source, our research found that data lake products can well solve the problem of real-time data writing and reading.

After comparing Hudi, Delta lake and Iceberg, we believe that Iceberg not only supports CDC data writing well, but also has richer support for streaming batch engines and MPP databases. Different engines can be selected according to different business analysis scenarios to meet performance needs. , and NetEase data development and management platform EasyData (hereinafter referred to as NetEase EasyData) and Wushu BI also have good support for Iceberg. In the end we chose Iceberg as our new table format.

The data link has also gone through two stages here :

  • Stage one:

Kafka is written to Iceberg mainly through the real-time development of FlinkSQL module by NetEase EasyData. However, with the insufficient guarantee of the original synchronization tool and the introduction of the data center, the data link was optimized and adjusted.

  • Stage two:

In the early days, TeleDB was written to Iceberg in real time through NetEase EasyData real-time data transmission (based on FlinkCDC). As the online task volume became larger and larger, incomplete problems in the adaptation of Telecom's self-developed databases TeleDB and TelePG were discovered. Later, due to the tight time of project launch, , our team has launched a self-developed data into the lake collection platform based on FlinkCDC capabilities to realize real-time input of business data into the lake. The overall operation is currently stable, but resource optimization and ease of use require further research and development.

business practices

Our team's data sources come from a wide range of sources, including business data from Zhejiang Telecom's BMO domain and other systems. After the data falls into the ODS layer, it is processed by DWD and DWS, and then various reports are generated on the Youshu platform according to different business analysis scenario requirements (real-time and offline reports), and the contents of these reports are embedded into the business and analysis Applications used daily by personnel are available to each operating node.

Different usage scenarios of reports determine the response speed of data, which also determines that we need to use different calculation engines. For daily self-service analysis, we use Spark and Trino to support the query of Iceberg reports. The batch processing of the data warehouse production layer is mainly offline and mainly relies on Spark. For scenarios with high response speed requirements in the application layer, we will use the Doris database (catalog Direct access to the Iceberg table) to support the needs of various scenarios.

Usage

After determining Iceberg as the table format of the data warehouse system, we have converted most ODS to Iceberg, and will gradually convert DWD and DWS to Iceberg in the future.

The current total Iceberg table storage capacity is 1.1PB, and it is expected that it will be 10-15PB after the transformation of all data warehouses is completed. Among them, nearly 10,000 Iceberg CDC tables are written in real time, and it is estimated that a total of 30,000 to 40,000 Iceberg tables will be written after the initial transformation is completed.

03 Amoro in Zhejiang Telecom

Why choose Amoro

Real-time writing to the Iceberg table will generate a large number of small files. Especially in order to ensure data consistency, we have turned on the upsert mode in many scenarios (in this mode, each time a piece of data is written, an insert and a delete data will be generated). Generating a large number of equality-delete files aggravates the problem of small files, has a great impact on query performance, and even causes OOM on the engine side. Therefore, timely monitoring and merging of small files and reducing equality-delete files are very important to ensure the availability of Iceberg tables.

When merging Iceberg files, we mainly encountered the following difficulties :

  1. The merging effect of equality-delete files is not ideal : we use the merging method provided by the Iceberg community, and periodically schedule Flink Spark tasks to merge files. This method of merging files through rewriting is relatively expensive, with hundreds of G of data. Merging can take several hours or even longer each time, and there may still be problems with the equality-delete files not being deleted immediately after the merge is completed.
  2.  A large backlog of equality-delete files leads to merge failure : Once the file status of the table is not detected in time, or the equality-delete backlog results from merge failure, the file merge plan, memory consumption, and read and write overhead are very large, often due to OOM , the execution time is too long and other problems, eventually the merge cannot be completed, and the table can only be restored to restore it, which greatly increases the maintenance cost.

Based on the above problems, we hope to have a mature system that can help us discover tables with many fragmented files in time and optimize these tables in a timely manner. Through research, we found that Amoro can solve this scenario very well, and Amoro provides a variety of capabilities to help us better manage table optimization : 

  • The Web UI displays indicators about the table and optimizing, which can help us more intuitively observe the file status of the table, the writing frequency, and the details and operation status of optimizing.

  • Permanent tasks continuously automate the optimize table

  • Flexible expansion and contraction of optimizer resources

  • Optimizing resources of isolated tables through optimizer group

Usage

We quickly connected all Iceberg tables after deploying Amoro. We did a lot of testing and verification during the initial deployment and debugging phase, and simultaneously optimized the iceberg table creation method (such as the use of partitions, etc.). We are also very grateful to the community for their support. Great support. Currently, we have divided a total of 2 optimizer groups based on the size and business of the table. The total resources use 78 cores and 776GB memory (which consumes relatively more memory resources), and can manage and merge small files stably and efficiently.

Effect

Amoro provides efficient and stable merging

Amoro provides a lightweight Minor Optimizing method that merges small files and avoids rewriting large files while converting equality-delete files into pos-delete files with higher query performance. Compared with the Iceberg community's file merging method, this merging method has less overhead, so it can be executed more frequently, and the execution cycle is at the minute level, thus effectively avoiding the backlog of small files.

Before using Amoro, the number of equality-delete files in the Iceberg table frequently written by CDC often remained above 1,000, and the size of the equality-delete file associated with a single datafile was above 5GB.

After using Amoro, Iceberg table equality-delete and the number of small files remain below 50.

Amoro handles backlog of equality-delete files

 The problem after we enabled  the upsert configuration  for all Iceberg tables is that there are a large number of equality-delete files in the Iceberg tables. Once the equality-delete backlog occurs, it can easily lead to OOM of the merge task. This is because , in the Iceberg plan task mechanism, datafile will be associated with all equality -delete files whose sequence numbers are larger than itself ( the sequence number of file is equal to the sequence number of snapshot by default , and the sequence number of snapshot will automatically increase ) . Therefore, files submitted for historical snapshots will reference a lot of equality-delete files . These equality-delete files need to be read into memory during Iceberg MOR and file merging to delete the data in the associated datafile. This consumes a lot of memory.                       

In response to the above problems, the Amoro community has made optimizations in optimizing . In the upsert scenario, a lot of the data in the equality-delete file associated with the datafile is not related to the current datafile, so there is actually a lot of invalid content in the equality-delete read in the memory, but they occupy a lot of memory. In addition, Amoro can limit the size of the datafile in each task by configuring the self-optimizing.max-task-size-bytes configuration when planning self-optimizing tasks, so the datafile size of each task is expected, but equality-delete The file size is unpredictable. Then we can use the contents of the datafile in the task to filter out equality-delete records that do not need to be cached in memory.

The specific implementation method is that we construct a BloomFilter through the equality field records of the datafile in the task. Through this BloomFilter, we filter out the data that has intersections between the equality field data and the datafile in the many equality-delete files associated with the self-optimizing datafile and load them. The memory is used to participate in MOR, which significantly reduces the memory usage, thereby avoiding the upsert scenario equality-delete file that causes optimizer OOM when there are many files.

04 Future planning

  • In the future, we will upgrade the current Iceberg version to the Telecom self-developed Iceberg version. First, we will optimize the number of files generated during real-time writing by Flink to reduce the pressure on Amoro in self-optimizing. Synchronization will be actively co-constructed with all parties. Iceberg open source community

  • Gradually shift the DWD and DWS layers from commercial MPP databases to Iceberg-based lake-warehouse integrated clusters

  • Considering that there may be a need to read the data lake in real time in the future, Iceberg still does not support real-time reading of V2 tables, so we are also investigating and trying to use Amoro Mixed Iceberg to solve the scenario of real-time reading of the data lake.

Finally, I would like to thank the Amoro community again for its strong support and wish the Amoro community better and better.

 


End~

If you are interested in data lake, lake-warehouse integration, table format or Amoro community, please contact us for in-depth discussions.

More information about Amoro can be found at:

Community communication: Search kllnn999  , add Ashida Aina assistant WeChat to join the community

Author: Yu Zhiqiang

Editor: Viridian

Broadcom announced the termination of the existing VMware partner program deepin-IDE version update, a new look. WAVE SUMMIT is celebrating its 10th edition. Wen Xinyiyan will have the latest disclosure! Zhou Hongyi: Hongmeng native will definitely succeed. The complete source code of GTA 5 has been publicly leaked. Linus: I won’t read the code on Christmas Eve. I will release a new version of the Java tool set Hutool-5.8.24 next year. Let’s complain about Furion together. Commercial exploration: the boat has passed. Wan Zhongshan, v4.9.1.15 Apple releases open source multi-modal large language model Ferret Yakult Company confirms that 95 G data was leaked
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/6895272/blog/10451991