Application Practice | The efficiency of the data warehouse system has been improved in an all-round way! The construction of data warehouse based on Apache Doris of Tongcheng Digital

Introduction: Tongcheng Digital was established in 2015 and is a financial service platform for the tourism industry under the Tongcheng Group. In 2020, Tongcheng Digital Technology introduced Apache Doris to build the data warehouse architecture 2.0 based on Apache Doris's rich data access methods, excellent parallel computing capabilities, and minimalist operation and maintenance. This article describes in detail the evolution process of architecture 1.0 to 2.0 and the application practice of Doris, hoping to help everyone.

Author Wang Xing, Senior Big Data Engineer of Tongcheng Digital Science and Technology Co., Ltd.

business background

Business introduction

Tongcheng Digital is a financial service platform for the tourism industry under the Tongcheng Group. It was formerly known as Tongcheng Financial Services and was officially established in 2015. With the vision of "digital technology leading the tourism industry", Tongcheng Digital Technology insists on empowering my country's tourism industry with the power of technology.

At present, Tongcheng Digital's business covers industrial financial services, consumer financial services, financial technology and digital technology and other sectors, with cumulative services covering more than 10 million users and 76 cities.

picture

Figure 1.1 Business Scenario - Business Introduction

Business needs

It mainly includes four categories:

  • Kanban: Mainly includes business real-time cockpit and T+1 business Kanban.
  • Early warning category: mainly includes risk control fuse, abnormal funds, and flow monitoring.
  • Analysis category: It mainly includes timely data query analysis and temporary data retrieval.
  • Financial category: mainly includes clearing and payment reconciliation requirements.

picture

Figure 1.2 Business Scenario - Business Requirements

Based on the above business requirements, we have carried out system architecture construction.

Architecture Evolution 1.0

work process

picture

Figure 2.1 Architecture Evolution - Architecture 1.0

Architecture 1.0 is the first generation architecture with SteamSets and Apache Kudu at its core, which was very popular in previous years.

The architecture collects the database Binlog through StreamSets and writes it to Apache Kudu in real time, and finally queries and uses it through Apache Impala and visualization tools. In this process, there are problems of long architecture links and poor reusability of SteamSets for some configurations. In addition, Apache Kudu's multi-table association and large table association have certain performance bottlenecks, and have high requirements for IO.

Figure 2.1 The application of the real-time computing process in the lower half is similar to the upper half. In the real-time computing, the buried point data will be sent to Kafka for real-time computing through Flink, and the calculation result data will fall into the analysis library and the Hive library. Used for data warehouse association.

Strengths and Weaknesses

picture

Figure 2.2 Architecture Evolution Advantages and Disadvantages

Advantage:

  • Architecture 1.0 chooses the CDH family bucket. CDH provides a number of big data components that can be integrated and put into use, and their configuration is relatively flexible.
  • The SteamSets used support visual drag-and-drop and configuration development methods, so developers have a high degree of acceptance of SteamSets. .

insufficient:

  • Too many components are introduced, and maintenance costs increase accordingly; when data problems occur, the link for troubleshooting and repairing is relatively long.
  • Various technical architectures and long development links have increased the learning cost and requirements of data warehouse personnel. Data warehouse personnel need to switch to different places before developing, resulting in unsmooth development process and reduced development efficiency.
  • Apache Kudu has poor performance when it comes to large table joins.
  • Since the architecture is built with CDH, the offline cluster and the real-time cluster are not separated, resulting in competition for resources; the process of running batches offline consumes a lot of IO or disk, and the timeliness of real-time data cannot be guaranteed.
  • While SteamSets are equipped with early warning capabilities, job recovery capabilities are relatively lacking. Configuring multiple tasks consumes a lot of the JVM, resulting in slower recovery.

Architecture Evolution 2.0

work process

Because the shortcomings of Architecture 1.0 far outweigh the advantages, in 2020, we investigated many components for real-time development in the market and discovered Apache Doris. Through research and comparison, we finally decided to introduce Apache Doris into Architecture 2.0.

picture

Figure 3.1 Architecture Evolution - Architecture 2.0

After the introduction of Apache Doris, the following changes have been made to the overall architecture:

  • Through Canal's CDC capability, MySQL data is collected into Kafka. Because Apache Doris is highly compatible with Kafka, Routine Load can be easily used for data loading and access.
  • Minor adjustments have been made to the original offline computing data link. For data stored in Hive, Apahce Doris supports the introduction of Hive data through Broker Load, so the data of offline clusters can be directly loaded into Doris.

Selection of Doris

picture

Figure 3.2 Architecture 2.0-Selection Doris

During the selection process, the overall performance of Apache Doris was amazing:

  • Data access: Provides a wealth of data import methods, which can support the access of many data sources.
  • Data connection: Doris supports connection methods such as JDBC and ODBC. It is friendly to the visual display of BI tools and can easily connect with BI tools. In addition, Doris implements the MySQL protocol layer, and can directly access Doris through various Client tools.
  • SQL syntax: Doris supports standard SQL, the syntax is compatible with MySQL, and the learning cost for data warehouse personnel is low;
  • MPP parallel computing: Doris provides excellent parallel computing capabilities based on the MPP architecture, and supports large table Join very well.
  • The most important point: Doris official documentation is very sound, and it is quicker for users to get started.

During the system selection research, we also learned about ClickHouse. ClickHouse has a high CPU utilization rate and performs well in single-table query, but it does not perform well in the case of multiple queries and high QPS.

Combining the above factors, we finally chose Apache Doris.

Doris Deployment Architecture

picture

Figure 3.3 Architecture 2.0-Doris Deployment Architecture

The Apache Doris deployment architecture is extremely simple, mainly FE and BE:

FE is the front-end node, which mainly performs user request access, metadata and cluster management, and query plan generation.

BE is the back-end node, mainly for data storage and query plan generation and execution.

Doris is very easy to operate and maintain:

In March, we carried out a rolling migration of the machines in the computer room. All 12 Doris node machines were migrated within three days. The overall operation was relatively simple, mainly used for the removal, removal and installation of the machines; the time spent on FE expansion and reduction. Not much, only simple commands such as Add and Drop are used.

Special attention : try not to use instructions such as Drop to directly operate on BE. When using the Drop command for forced deletion, Doris will prompt and ask to manually confirm whether to delete it or not. After the forced deletion, the data cannot be recovered. Therefore, it is recommended to use the contact method to offline the node. After the data migration is completed, the BE node can be directly added again, which is more flexible.

Doris real-time system architecture

picture

Figure 3.4 Doris real-time system architecture

Data source: In the real-time system architecture, data sources come from business lines such as industrial finance, consumer finance, and risk control data, and are collected through Canal and API interfaces.

Data collection: After Canal collects data through Canal-Admin, it sends the data to the Kafka message queue, and then connects to the Doris cluster through Routine Load.

Doris data warehouse: Doris cluster builds three layers of data warehouse, namely: DWD detail layer using Unique model, DWS summary layer and ADS application layer using Aggregate model.

Data application: The architecture is applied to three aspects: real-time Kanban, data timeliness analysis, and data services.

Features of Doris New Data Warehouse

picture

Figure 3.5 Features of Doris' new warehouse

The data import method is simple, and 3 different import methods are adopted according to different scenarios:

  • Routine Load: It is mainly used for business data access and exists as a resident task for consuming Kafka. When we submit the Rountine Load task, there will be a resident process inside Doris that consumes Kafka in real time, and continuously reads data from Kafka and imports it into Doris.
  • Broker Load: Perform offline data import tasks such as basic dimension tables and historical data.
  • Insert Into: It is used to run batch jobs regularly, and is responsible for processing the data of the DWD layer to form the DWS layer and the ADS layer.

Doris's good data model improves our development efficiency:

  • The Unique model is used during DWD layer access, which can effectively prevent repeated consumption of data.
  • Aggregate models are used as aggregates. In Doris, Aggregate supports 4 aggregation models such as Sum, Replace, Min, and Max. Using the underlying model of Aggregate during the aggregation process can reduce the amount of SQL code, and it is no longer necessary to do Sum, Min, Max and other actions by yourself. , which is friendly from DWD layer to DWS/ADS layer.

Doris has a low threshold for use and high query efficiency:

  • It supports MySQL protocol, supports standard SQL, and the query syntax is highly compatible with MySQL, which is friendly to analysts.
  • Supports materialized views and Rollup materialized indexes. The underlying concept of materialized view is similar to Cube and the pre-calculation process is similar to the way of changing space for time in Kylin. It generates a special table at the bottom and responds quickly when the materialized view is hit in the query.

Special Note: Although materialized views are helpful, if they are used too much, each materialized view needs to occupy additional storage space, which will lead to a decrease in efficiency when data is imported.

Doris has a minimalist system architecture with low operation and maintenance costs:

  • The system has only two modules, BE and FE, and does not depend on third-party components such as Zookeeper, and is easy to deploy.
  • The operations of FE and BE have been monitored and configured, and timely restart will be performed when an exception occurs.

Doris Experience Summary

picture

Figure 4.1 How to use Doris more friendly

In the process of using Apache Doris, we have compiled some experiences to help developers use Doris more friendly. For developers, the areas of greatest concern are the following:

  • Development: how to connect external data to Doris and quickly implement ETL development, which will affect the developer's report output speed.
  • Scheduling management: Developers do not want to report errors or become unstable after the development is completed and the task is launched. It is necessary to ensure the stability of task scheduling and the ability to restore scheduling.
  • Data query: Due to the partition between the production and the office network, the office network cannot directly use the connection of the production network, and the network partition cannot be solved in the form of the client, but can only be solved in the form of the Web, how to query and analyze safely and conveniently Become a developer Concerns.
  • Cluster management: When an abnormal situation occurs in the cluster, it can be captured and automatically processed in time.

All in all, we hope to build a platform with high efficiency, high quality and high stability .

Doris development optimization

According to several issues that developers are concerned about, we have made some development optimizations.

data access

In terms of data access, semi-automatic related work has been done and components have been quickly generated. Routine Load scripts can be generated according to data sources/tables, and Routine Load tasks can be quickly formed by modifying Kafka's Broker or Topic. The Broker Load task is similar to Routine Load. After selecting the data warehouse source, the script required by the Broker Load can be generated in time. When accessing Doris, you need to create a table in advance. Similar operations can be performed in this regard, and create statements can be quickly generated through the source.

picture

Figure 5.1 Data Platform - Developed by Doris

The above mainly uses the underlying metadata. After obtaining different metadata according to different data sources, tasks can be quickly generated.

Commit action and maintenance management

After the task is generated, we encapsulate it in Routine Load. Since Routine Load is a resident process, we only need to submit it again, and the status will become Running. If there is an abnormal status, it will be detected, and the monitoring will be shown to you later.

picture

Figure 5.2 Data Platform - Developed by Doris

Monitoring and Management

We can query the submitted Routine Load and check whether there is any abnormality. At the same time, we can add the Routine Load we need to pay attention to in the monitoring. The monitoring will automatically scan the task on a regular basis. When a problem occurs, it will prompt and try to pull the task again. rise.

Broker Load can also monitor tasks. In view of the problem that the Broker Load Label name cannot be repeated, we adopt the method of generating UUID to solve it, so as to better help you improve the user experience.

picture

Figure 5.3 Data Platform - Developed by Doris

As shown in the figure above, we can pause and terminate actions in Routine Load to help you better use the development work and management.

Self-developed query page, integrated Doris Help function

Due to the isolation of production and office network segments, we can only query through the Web. We have tried to use Hue to integrate Doris for query solution before. Doris supports connecting to Hue through MySQL protocol, but if we integrate Hue, everyone can query Doris data through Hue, the security issue cannot be guaranteed and cannot satisfy our requirements for permission requirements.

picture

Figure 5.4 Data Platform - Doris Data Query

So we developed a query page within our own platform to solve this problem. The left part of the figure can list all the following tables according to the DB, and the right part is the query analysis page and the query results, which is a client software similar to Navicat developed by us.

At the same time, we have integrated the Doris Help function to provide help when you don't know how to use Doris. By integrating Doris Help, we can use the keyword search function for grammar and example queries to solve problems.

Even if Doris Help is not integrated, you can view it on the web page that comes with the FE node. The FE node has a built-in web page that can view the entire cluster information and has the Help function. After we implement the self-developed query page and integrate Doris Help, it can be used directly, thus skipping the steps of using the Admin account to connect to use FE.

Doris cluster monitoring page

At the same time, we developed the Doris cluster monitoring page, where you can see the node status of FE, BE and Broker. When an abnormal situation occurs in the cluster, the monitoring system will send an automatic reminder and try to pull up the cluster. At the same time, the health status of the nodes can be observed in the form of pages.

picture

Figure 5.5 Data Platform - Doris Cluster Monitoring

For Doris upper-layer applications, it mainly relies on the APIs and instructions provided by Doris to complete the upper-layer application actions of Doris. What we do is to integrate and display the instructions provided by Doris more friendly to users.

The benefits of the new structure

picture

Figure 6.1 Benefits of the new architecture

  • Data Access: In the early process of data access via SteamSets, Kudu tables need to be created manually. Due to lack of tools, the entire table building and task creation process takes 20-30 minutes. Now it is possible to quickly access data through the platform and the rapid construction statement. The access process for each table has been shortened from 20-30 minutes before to 3-5 minutes now, and the performance has been improved by 5-6 times.
  • Data development: When doing aggregations or other actions in earlier architectures, it was necessary to write a lot of long-form SQL code. After using Doris, we can directly use the data models such as Unique and Aggregate that come with Doris and the Duplicate model that can support log scenarios well, greatly speeding up the development process in the ETL process.
  • Query analysis: The bottom layer of Doris has functions such as materialized views and Rollup materialized indexes, which can improve query efficiency. At the same time, the bottom layer of Doris has implemented many optimization strategies for large table associations, such as Runtime Filter and other Join and custom optimization strategies. Compared with Doris, Apache Kudu requires more in-depth optimization experience to use it better.
  • Data report: Initially using Kudu report query takes 1-2 minutes to complete the rendering, while Doris has a response speed of seconds or even milliseconds.
  • Environmental maintenance: Doris does not have the complexity of the Hadoop ecosystem, the entire link is relatively clear, and the maintenance cost is much lower than that of Hadoop. Especially in the process of cluster migration, Doris is particularly convenient for operation and maintenance.

future outlook

picture

Figure 7.1 Future Outlook

  • Attempt to introduce Doris Manager: Doris Manager is being promoted in the community, and we are also preparing to introduce and actively participate in Doris Manager for cluster maintenance and management.
  • Realize data access based on Flink CDC: The current architecture does not introduce Flink CDC, but continues to use the architecture that Canal collects from Kafka and then collects from Doris, and the link is relatively long. Although Flink CDC can continue to simplify the overall architecture, it still needs to write a certain amount of code, which is not friendly for BI personnel to use directly. We hope that data warehouse personnel only need SQL or complete operations on the page to use. In the planning of the 3.0 architecture, we plan to introduce the Flink CDC function and expand the upper-layer application. The introduction of Flink CDC brings you the idea of ​​"fast is slow, slow is fast". Of course, the Flink community is developing very fast. Only after fully learning from everyone's experience can it be introduced more amicably. The architecture is iterated and optimized in the process.
  • Keep up with the community iteration plan: The Doris version we are using is relatively old. Now the new version of Doris has greatly improved in memory management, query performance, etc. In the future, we will follow the community iteration rhythm to upgrade the cluster and reflect it. New features.
  • Strengthen the construction of related systems: Our current indicator system management, such as the maintenance and management of report metadata and business metadata, still needs to be improved. In terms of data quality monitoring, although the data quality monitoring function is currently included, the monitoring of the entire platform and automatic data monitoring still need to be strengthened and improved.

Join the community

Welcome more friends who love open source to join the Apache Doris community and participate in community construction. In addition to submitting PR or Issue on GitHub, you are also welcome to actively participate in the daily construction of the community, such as:

Participate in community essay activities , and produce articles such as technical analysis and application practice; participate in online and offline activities of the Doris community as a lecturer; actively participate in the questions and answers of the Doris community user group, etc.

Finally, more open source technology enthusiasts are welcome to join the Apache Doris community, grow together, and build a community ecosystem.

picture

SelectDB is an open source technology company dedicated to providing the Apache Doris community with a team of full-time engineers, product managers and support engineers, prospering the open source community ecosystem, and creating an international industry standard in the field of real-time analytical databases. SelectDB, a new generation of cloud-native real-time data warehouses developed based on Apache Doris, runs on multiple clouds, providing users and customers with out-of-the-box capabilities.

Related Links:

SelectDB official website:

https://selectdb.com

Apache Doris official website:

http://doris.apache.org

Apache Doris Github:

https://github.com/apache/doris

Apache Doris developer mailing group:

[email protected]

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/5550562