[Meet Doris] Middle-Taiwan Construction Practice of Youdao Quality Course Data Based on Doris

Li Rongqian Youdao Quality Course Data Middle and Taiwan Team Data Middle and Taiwan Real-time Data Warehouse Leader


This time, we would like to share with you the evolution of the architecture of the Youdao Quality Course Data Center, and how Doris, as an MPP analytical database, provides effective support and data empowerment for the growing business volume.

This article takes our experience in real-time data warehouse selection as the starting point, and further focuses on sharing the problems encountered in the process of using Doris and the adjustments and optimizations we have made to address these problems.

1 Background

1.1 Business Scenario

According to business requirements, the current data layer architecture of Youdao Premium Course can be divided into two parts: offline and real-time. The offline system mainly deals with data related to buried points, and uses batch processing to calculate regularly.

The real-time stream data mainly comes from the data stream generated in real time by various business systems and the change log of the database. It is necessary to consider the accuracy, real-time and timing characteristics of the data, and the processing process is very complicated.

Relying on its real-time computing capabilities, the Youdao Quality Course data middle-office team mainly assumes the role of real-time data processing in the entire data architecture, and provides real-time data synchronization services for downstream offline data warehouses.

The user roles and corresponding data requirements of the main services of the data center are as follows:

  • The operation/strategy/responsible person mainly checks the overall situation of the students, and queries the real-time aggregated data of some course dimensions in the data center
  • Tutoring/Sales focus on various real-time breakdowns of students served
  • Quality control mainly checks the overall data of each dimension of course/teacher/tutoring, and checks it through the offline report of T+1
  • The data analyst performs interactive analysis on the data synchronized from the data center T+1 to the offline data warehouse

1.2 Early System Architecture and Business Pain Points of Data Center

As shown in the figure above, in the data center 1.0 architecture, our real-time data storage mainly relies on Elasticsearch, and encountered the following problems:

  1. Aggregate query efficiency is not high
  2. Low data compression space
  3. Multi-index join is not supported. In business design, we can only set up many large wide tables to solve the problem.
  4. Does not support standard SQL, the query cost is high

2 Real-time warehouse selection

Based on the above business pain points, we began to conduct research on real-time data warehouses. At that time, I investigated Doris, ClickHouse, TiDB+TiFlash, Druid, Kylin.

Since there were only two developers in our data center at first, and storage-related things needed to be operated and maintained by ourselves, we were more sensitive to the cost of operation and maintenance. In this regard, we first eliminated Kylin and ClickHouse.

In terms of query, most of our scenarios are detailed + aggregated multi-dimensional analysis, so Druid is also excluded.

Finally, we compared the efficiency of aggregate analysis. Since Doris supports Bitmap and RollUp, but TiDB+TiFlash does not, we finally chose Doris as the main storage for our data center.

3 Data Center 2.0 based on Apache Doris

3.1 Architecture upgrade

After completing the selection of real-time data warehouses, we made some architectural changes to Doris to maximize its role, which are mainly divided into the following aspects:

Flink double write

Rewrite all Flink jobs, output a copy of data to Kafka by-pass when writing to Elasticsearch, and create downstream tasks for complex nested data to transform and send them to Kafka. Doris uses Routine Load to import data.

Doris on ES

Since our real-time data warehouse only had ES before, in the early days of using Doris, we chose to use Doris to create an ES appearance to improve our Doris data warehouse bottom table. At the same time, the query cost is also reduced, and the business side can use the data warehouse bottom table without perception.

The specific query Demo is as follows. We use the basic information of the students to join various exercise information to complete the student data.

data synchronization

It turned out that when we used ES, because many tables did not have time to write data, data analysts needed to scan the entire table every day to export the full amount of data to Hive, which put a lot of pressure on our cluster and also led to increased data latency. After the introduction of Doris, the three fields eventStamp, updateStamp, deleted are added to all data warehouse tables.

  • eventStamp: the time when the event occurred
  • updateStamp: Doris data update time, generated in Routine Load
  • deleted: Whether the data is deleted or not. Since many of our real-time data warehouses need to be synchronized to offline data warehouses regularly, the data needs to be soft deleted.

When data is synchronized downstream, eventStamp or updateStamp can be flexibly selected for incremental synchronization.

For data synchronization, we use various methods to determine different synchronization scenarios through the Hive table name suffix:

  • _f: Full synchronization every day/hour, based on Doris Export full export
  • _i: Daily/hourly incremental synchronization, export by partition based on Doris Export/Export by NetEase data scan table
  • _d: Mirror synchronization every day, full export based on Doris Export

Indicator Domain Division/Data Hierarchy

After sorting out the data in Elasticsearch and combining the subsequent business scenarios, we have divided the following four indicator domains:

According to the above indicator domain, we started to build real-time data warehouses based on the star model, built more than 20 data warehouse bottom tables and more than 10 dimension tables in Doris, and built a complete indicator system through NetEase.

Micro batch generation DWS/ADS layer

Since most of our scenarios are analysis of detailed + aggregated data, based on the import method of Doris insert into select, we have implemented a set of logic to generate DWS/ADS layer data according to DWD layer data at regular intervals, and the minimum delay can be supported to the minute level. The overall multi-level warehouse table calculation process is as follows:

For detailed data in TiDB or ES, we chose to perform window aggregation in Flink and write it to downstream Doris or ES. For the data whose detailed data only exists in Doris alone, since most of us use the asynchronous writing method, the data cannot be read immediately. We have built a timing execution engine that supports templated configuration on the periphery, and supports the minute/hour level. The changes to the scan details table are written to the downstream aggregation table. The specific template configuration is as follows:

You need to configure the monitored source tables and change fields, scan multiple source tables within the configured interval time window, then merge the results to generate parameters, split the parameters according to the configured threshold, and pass in multiple inserts sql, and perform T+1 full aggregation in the early morning every day to repair the wrong data in micro-batch calculations.

The specific calculation trigger logic is as follows:

data lineage

Based on the method of pulling Routine Load and Flink data and reporting services, we have achieved a complete data lineage in the data center for data developers/data analysts to query.

Since our Flink development mode is in the form of submitting jars, in order to obtain the bloodline of the task, we format and encapsulate the name of each operator, and the bloodline service regularly pulls /v1/jobs/overview data for analysis. The format names of different operators are encapsulated into the following types:

  • Source:sourceTypeName [address] [attr]
  • Sink:sinkTypeName [address] [attr]

The specific bloodline service logic is shown in the following figure:

After the internal analysis of the bloodline service, the bloodline data is divided into Node and Edge in batches and stored in NebulaGraph. The front-end service can query to obtain a complete bloodline as shown in the following figure:

3.2 Data Center 2.0 Architecture

Based on the system architecture adjustment around Doris, we have completed the data center 2.0 architecture

  • Replacing Canal with NetEase Data Canal, with better data subscription monitoring
  • The Flink computing layer introduces Redis/Tidb for temporary/persistent caching
  • Splitting complex business logic into Grpc services to reduce business logic in Flink
  • The data adaptation layer adds a Restful service to achieve some complex indicator acquisition requirements by case
  • Real-time-to-offline data synchronization was run through NetEase’s offline scheduling
  • Added two data outlets for data reporting/self-service analysis system

4 Benefits from Doris

1. The data import method is simple, we use three import methods for different business scenarios

  • Routine Load: Real-time asynchronous data import

  • Broker Load: regularly synchronize offline data warehouse data for query acceleration

  • Insert into: regularly generate DWS/ADS layer warehouse table through DWD layer number warehouse table

2. The data occupied space is reduced from about 1T in the original Es to about 200G

3. The cost of using data warehouses is reduced

  • Doris supports the MySQL protocol, and data analysts can directly fetch data by themselves. Some temporary analysis needs do not need to synchronize Elasticsearch data to Hive for analysts to query.

  • For some detailed tables in ES, we expose the query through the Doris appearance, which greatly reduces the query cost of the business side.

  • At the same time, because Doris supports Join, some logic that needs to query multiple indexes and then calculate from the memory can be directly pushed down to Doris, which improves the stability of the query service and speeds up the response time.

  • Aggregate computing speed has been greatly improved through the advantages of materialized views and column storage.

5 Online performance

At present, dozens of real-time data reports have been launched, and the P99 of the online cluster is stable at about 1s. At the same time, some time-consuming analytical queries have been launched, and the P99 of the offline cluster is stable at about 1 minute.

At the same time, we completed the construction of a standardized data warehouse based on Doris, and ran a complete set of processes in data development, which made the daily iteration of our data requirements faster.

6 Summary and planning

The introduction of Doris promotes the construction of Youdao quality course data layering and accelerates the standardization process of real-time data warehouses.

On this basis, the data middle-office team provides a unified data interface for all business lines of the entire platform, and relies on Doris to produce real-time data dashboards, and on the other hand, regularly synchronizes real-time data warehouse data to downstream offline data warehouses for analysts to carry out. Self-service analysis provides data support for real-time and offline scenarios.

For the follow-up work, we have made the following plans:

  1. Generate more upper-level aggregation tables based on the Doris detailed table, reduce the computational pressure of Doris, and improve the overall response time of the query service
  2. Implement Doris Connector based on Flink, realize Flink's read and write function of Doris
  3. Develop Doris on ES to support nested data queries

Finally, I would like to thank the Baidu Palo (Doris) team for their strong support, which provided us with a lot of technical help, and the problem was solved very quickly, providing reliable support for the rapid development of our data center.


About Apache Doris (Incubating)

Apache Doris (Incubating) is an interactive SQL analysis database based on large-scale parallel processing technology. It was contributed by Baidu to the Apache Foundation in 2018 and is currently in the Apache Foundation incubator.

Doris official website: http://doris.incubator.apache.org/master/zh-CN/
Doris Github: https://github.com/apache/incubator-doris
Doris Gitee mirror: https://gitee.com/ baidu/apache-doris
Doris developer mail group: [How to subscribe]
Doris WeChat public account:

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324040891&siteId=291194637