Tongcheng Data Science builds a unified real-time data warehouse based on Apache Doris, and the query speed is increased by dozens of times!

Guide to this article:

Tongcheng Digital is a financial technology service platform for the tourism industry under Tongcheng Group, providing digital financial technology services for upstream and downstream enterprises and individual consumers. In recent years, with the continuous expansion of Tongcheng Data's business and the increase in the number of users, the construction of an efficient and reliable one-stop data center has become an indispensable demand. In order to help business personnel improve the efficiency and quality of data development, Tongcheng Digital has undergone three generations of architecture evolution, and finally introduced Apache Doris to build a unified real-time data warehouse. In subsequent practical applications, the real-time data warehouse will be platformized to further build a one-stop The Ark data platform provides business personnel with an easy-to-use and easy-to-maintain system, and realizes functions such as independent development of tasks, flexible online, easy query, and continuous monitoring.

Author: Chen Song, Big Data Platform Engineer of Tongcheng Digital

Tongcheng Digital is a financial technology service platform for the tourism industry under Tongcheng Group. It was formerly known as Tongcheng Financial Services. The company builds its competitiveness on the deep ecological chain, and at the same time provides digital financial technology services for upstream and downstream enterprises and individual consumers in the tourism industry chain. Tongcheng Digital's business covers tourism industry chain financial services, tourism consumption financial services, payment technology and other sectors. It has served more than 10 million users and covers 76 cities. At present, Tongcheng Digital has obtained the first round of strategic investment, and has joined forces with a number of industrial financial institutions to develop a new business service platform for the tourism industry.

In recent years, with the continuous expansion of Tongcheng Digital's business and the continuous increase in the number of users, we increasingly need a reliable and efficient data center to help companies better understand business operations and formulate strategies, including but It is not limited to the establishment of analysis tools such as real-time business report boards, real-time business indicator warnings, marketing user portraits and labels, and real-time monitoring of financial risk control. Therefore, we pay more attention to the construction of real-time data warehouses, hoping to use data warehouses to help business personnel improve the efficiency and quality of data development, thereby providing a strong backing for business analysis.

Based on this, we started the exploration journey of real-time data warehouse. Today, the data warehouse architecture has undergone three generations of evolution. After the use of the first-generation offline architecture and the second-generation Lambda architecture, through demand analysis and research, Apache Doris was finally introduced to build a unified real-time data warehouse. This article will introduce the evolution process of the three-generation architecture in detail, share how we build a one-stop data platform Ark based on Apache Doris, and how to achieve the benefits and results of cost reduction and efficiency increase in business use, system maintenance, and data warehouse development.

Early Architecture Evolution

In the early stage of the development and application of big data technology, Tongcheng Data established an offline data warehouse with Apache Hive as the core, and used Hive for data warehouse layering. After the data enters the offline data warehouse from the source, it is processed through ODS, DWD, and DWS levels, and the data is output to application databases such as MySQL, Redis, and HBase for use by the reporting platform. Although this architecture has the advantages of low coupling and high stability, its disadvantages are also obvious. It is mainly reflected in the need to merge the data in full when performing partial updates. The process is lengthy, which makes the data update time longer and the timeliness cannot be guaranteed. . As the scale of data continues to increase and the demand for partial updates increases, the disadvantages of this architecture, such as low data computing efficiency and insufficient resource utilization, become more and more obvious.

Based on the problems of the first generation architecture, we have upgraded the architecture. The second-generation architecture is a typical Lambda architecture. While retaining the original offline data warehouse, a real-time data warehouse with Apache Flink and Apache Kafka as the core has been added. In this architecture, the offline link mainly processes data in batches and is responsible for solving the problem of periodic data misruns. The newly added real-time link uses Flink to stream process the data source, uses Kafka to layer the data warehouse, and finally outputs to the application database.

Although this architecture solves the problem of low data timeliness in the first-generation architecture, in the long-term operation, we found that there are still some pain points in use:

  • Complex architecture and high difficulty in operation and maintenance: Since two sets of links run simultaneously, the real-time link needs to process data streams through Apache Flink and Apache Kafka, and the offline link needs to use Apache Hive and Apache Spark to perform batch processing of data, and the two The dimensions of each link are stored using MySQL or Redis, which leads to too many components involved in the overall architecture and the data processing process is too complicated. In addition, the architecture will repeatedly calculate the same data, resulting in increased overall resource occupation, increased operation and maintenance management costs, and increased difficulty in later maintenance.
  • High data development cost: The real-time data warehouse part completely relies on Apache Kafka for data warehouse layering, and Kafka has restrictions on the storage period of data. New data import tasks require additional development work, which will greatly increase development costs.
  • Low data consistency: The same data is stream-processed in the real-time data warehouse and batch-processed in the offline data warehouse. There is a problem of inconsistent data processing logic, and data consistency and accuracy cannot be guaranteed. Since the management systems such as data lineage and data quality in the first-generation architecture cannot be reused, when the real-time link has out-of-sequence problems during operation, it is necessary to play back the full amount of logs for data backtracking, which increases the complexity of data repair.

Apache Doris and Clickhouse selection comparison

In order to completely solve the problems of the early architecture, before introducing the new architecture, we decided to conduct in-depth product research to choose a more suitable data warehouse construction solution. We found that the MPP architecture database can support unified real-time data analysis, and can effectively solve the problems of complex Lambda architecture and inability to guarantee data consistency. Under this product segment, Apache Doris and Clickhouse are more suitable for our business demands. Based on this, we compared the selection of these two MPP architecture databases, and found that Doris performed better and met our selection requirements. The specific performance is as follows:

  • Ease of use: ClickHouse does not support standard SQL, but Apache Doris supports standard SQL and is compatible with the MySQL protocol, making it easy for developers to get started without paying additional learning costs.
  • Excellent Join performance: Doris supports distributed Join, with high query flexibility and excellent performance. However, ClickHouse does not meet our current business needs due to Join query limitations, function limitations, and poor maintainability.
  • Data import: Doris has complete data import functions and supports multiple data import methods such as Routine Load, Stream Load, and JDBC Insert Into. It can maintain stable data writing even under massive data, and its performance and speed are much higher than ClickHouse.
  • Difficulty in operation and maintenance: Doris has a simplified architecture, with only two roles of FE and BE. The overall deployment is simple and fast. At the same time, Doris is very fast in terms of capacity expansion, and supports rolling upgrades. You only need to replace the relevant installation packages. However, ClickHouse relies heavily on components, and requires a lot of preparatory work for use and expansion, which requires a professional team to support daily development and operation and maintenance.

More importantly, Doris can simultaneously support multiple scenarios such as real-time data services, interactive data analysis, and offline data processing. Multi-Catalog provides federated query capabilities, supports reading multiple data sources, improves data accuracy and quality, and simplifies the task development process. In addition, this feature can enable developers to find the required data more quickly, reduce query time and cost, and improve query efficiency. Therefore, the advantages of high-efficiency operating performance and low development cost of Apache Doris are more in line with our needs for building a one-stop data platform.

A new generation of unified real-time data warehouse

After introducing Apache Doris, we refactored the architecture. As shown in the figure above, we use Apache Doris for unified data storage and calculation, which completely replaces the original offline architecture and Lambda architecture, and builds a one-stop data warehouse, which not only ensures data consistency, but also simplifies the architecture , which greatly reduces the cost of architecture operation and maintenance. Secondly, when the data source enters the real-time data warehouse, we have added an Input unified data integration engine, which supports data synchronization of multiple heterogeneous data sources and realizes the unification of data entry. All in all, the introduction of Doris really helped us realize the unification of data integration, storage, calculation, and output, and truly realized the real-time unified data warehouse.

One-stop data platform based on Apache Doris

Based on the new generation of data warehouses, we have built a one-stop data platform Ark, hoping to realize integrated services such as task development, task submission and testing, task scheduling and monitoring, data query, and cluster monitoring through this data platform. Improve the efficiency of task development and improve the quality of task monitoring in actual business.

  • Data development: We hope that when external data is connected to Apache Doris, ETL development can be carried out efficiently and the report output speed can be improved.
  • Scheduling management: After the business personnel develop and launch tasks, we need to ensure the stability of task scheduling and scheduling recovery capabilities to avoid problems.
  • Data query: Due to the partition between the production and office networks, the office network cannot directly use the connection of the production network, and the network partition can only be resolved through the Web. We hope that the platform can provide safe and convenient query and analysis methods.
  • Cluster management: When an abnormal situation occurs in the cluster, we hope that the platform can monitor and capture it in time and perform automatic processing.

Generate task scripts with one click to improve task development efficiency

Apache Doris supports access to rich data sources. Using this function, the Ark platform can obtain corresponding metadata information from different data sources to form scripts and realize rapid generation of tasks. In terms of data access, the platform has carried out semi-automated code related work and created rapid generation components. As shown in the figure above, inputting data source or table information in the platform can automatically generate a Routine Load script. Based on this script, you only need to modify the topic of the Apache Kafka access source, and the Routine Load task can be generated immediately. Similarly, the principle of task development for Broker Load is the same. After selecting the corresponding data warehouse source, the script required for Broker Load can be generated in time. Utilizing the ability to write multi-source heterogeneous data in Doris, the platform can quickly build codes and realize efficient task development for Routine Load and Broker Load.

Automatic scheduling and monitoring to ensure the normal operation of tasks

After the task is developed and submitted, the platform can perform routine scheduling operations such as querying and checking whether there is an exception for the Routine Load or Broker Load task. For tasks that require special attention, you can add them to the monitoring list, so that the system will automatically scan the tasks on a regular basis. When a problem occurs, it will prompt and try to restart the task. In addition, since Routine Load is a resident process, for the monitoring of this task, the platform supports regular and continuous automatic monitoring functions. For Broker Load and other routine tasks, the platform will give early warning prompts for failed tasks after regular scanning.

Safe and convenient visual query analysis

Due to the isolation of production and office network segments, we can only query through the Web, which is cumbersome and inconvenient to use. In order to solve this problem, we have tried to use the way of integrating Hue to make Doris connect to Hue through MySQL protocol for data query. Although the query process is simplified, this method has potential data security risks.

Therefore, Tongcheng Digital developed its own internal query analysis page and set up authority management to solve the problem of query security. At the same time, the Doris Help function is integrated in the Ark platform, enabling business personnel to query SQL grammar and examples through keyword search, and solve routine query operation problems, thereby reducing learning costs and improving the convenience of internal personnel inquiries.

Complete and intelligent cluster monitoring

The status of FE, BE and Broker nodes can be monitored in real time through the Apache Doris cluster monitoring page. When an abnormal situation occurs in the cluster, the monitoring system will send an automatic reminder and try to pull the cluster up, and automatically handle the abnormal situation in time to avoid causing bigger problems. At the same time, the dashboard of the cluster monitoring can also help us observe the health of the nodes, and judge the health through the status of the FE nodes.

Summarize Benefits and Outcomes

Currently, Tongcheng Digital has built a highly unified real-time data warehouse based on Apache Doris, and uses dozens of Doris node machines. In addition, we have integrated the Doris function platform into the Ark one-stop data platform to realize the original intention of developing the Ark platform in an all-encompassing manner. The introduction of Doris has brought us the following benefits and results:

  • Reduced development cycle: Using the one-click development function of the platform, business personnel can develop independently without submitting requirements to the big data team. The development time is shortened from the original half an hour to only three minutes, which significantly reduces the task development cycle and improves development efficiency. ten times
  • Flexible data development: With Ark's one-stop data platform, data development can be flexibly analyzed, and requirements can be quickly launched;
  • Unified data processing: Doris realizes unification in data import, storage, and calculation, ensures data consistency, and realizes real-time unification in real sense;
  • Improve query efficiency: from the minute-level effect time in the past to the second-level or even millisecond-level today, the query efficiency has been improved by dozens of times;
  • Reduce learning costs: Because Apache Doris is compatible with the MySQL protocol and uses standard SQL, it is easy to use. Business personnel can use big data like databases, thereby further reducing learning costs;
  • Reduce operation and maintenance costs: Doris is easy to deploy, and the streamlined architecture makes the overall link system simple and easy to maintain.

future plan

In the future, we hope to build a real-time data warehouse ecosystem based on Apache Doris and use it on a large scale within Tongcheng Digital. We will continue to build and optimize the one-stop real-time data warehouse architecture based on Apache Doris, and improve the unified computing and storage, stream-batch integration capabilities. For the continuous iterative enhancement of the Ark one-stop data platform, the entire real-time data warehouse system is developing towards timeliness, stability, and flexibility. Improve the graphical function of the Ark data integration platform, continue to increase the data synchronization function between more heterogeneous data sources, and enhance the engine's ability to process data.

Secondly, we will continue to pay attention to the capabilities of Apache Doris in data lake analysis. We hope that multi-source heterogeneous data can be collected in the lake to achieve unified data storage and unified multi-paradigm computing. Finally, the API interface of Doris will be unified externally. Provide services. In addition, we are also very interested in the early adopter test of Apache Doris 2.0, especially the optimization of the inverted index function and the JSONB data type . In the subsequent architecture optimization, we will consider using the inverted index to replace the existing log system. The Json data type further improves the query capability.

Special thanks to the SelectDB technical team for their long-term timely response and technical support. In the future, we will also actively participate in community contributions and activities, and make progress and grow together with the community. Welcome everyone to choose and use Doris, and I believe Doris will not let you down!

{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5735652/blog/10084245