Dry information | Save 60% of development hours, the implementation of the off-line integrated data warehouse system in Ctrip Travel

About the Author

Chengrui, a Ctrip back-end development expert, focuses on real-time data processing, AI basic platform construction and data products.

This article mainly introduces the implementation and practice of off-line data warehouse construction in the Ctrip travel team. It will start from the dimensions of business pain points, business goals, project structure, project construction and other dimensions.

1. Business pain points

As the demand for real-time data increases, more and more business pain points are exposed in offline data warehouses, such as:

  • Real-time demand chimney development model

  • Poor reusability of intermediate data

  • Separation from online data development

  • Data production -> long service cycle

  • Real-time tables/tasks are messy and unmanageable

  • Lack of real-time bloodline/basic information/monitoring, etc.

  • Real-time data quality monitoring without tools

  • Real-time task operation and maintenance threshold and high quality system are weak

This type of typical problems will bring great challenges to our human efficiency, quality, management and other aspects, and we urgently need a systematic platform to solve them.

2. Business goals

Focusing on known business pain points and relying on the company's existing computing resources, storage resources, offline data warehouse standards, etc., our goal is to build systems at the levels of human efficiency, quality, and management. As shown below:

28dc11cc070e41fae042fb63b4f5ad0e.png 2.1 Human efficiency level
  • Realize the standardization of offline and online data development solutions, such as standardized data processing, offline and online code compatibility, computing power integration, etc.

  • Minute-level data deployment enables visual operations such as data interface registration, publishing, and debugging at the BI student level.

2.2 Quality level

  • DQC of data content, such as whether the content is correct, incomplete, timely, consistent with online, etc.

  • Data task warning, such as whether there is delay, back pressure, throughput, whether system resources are sufficient, etc.

2.3 Management level

  • Visual management platform, such as full-link lineage, data tables/tasks, quality coverage and other basic information

  • Integrated data warehouse full process specifications, such as data modeling specifications, data quality specifications, data governance specifications, storage selection specifications, etc.

3. Project structure

The project structure is as shown below. The system mainly includes: original data -> data development -> data service -> data quality -> data management and other modules, providing the ability to process real-time data in seconds and deploy data services in minutes for real-time data development. Students, back-end data service development and use.

Data from different data sources are first standardized through standardized ETL components, and are pre-processed by traffic forwarding tools. The flow-batch fusion tools and business data processing modules are used for hierarchical and domain-based construction. The produced data is directly processed using the data service module. The data is deployed through data api and ultimately used by business applications. The entire link will have a corresponding quality and operation and maintenance guarantee system.

c1d16a8ff92ec1fb72f92068805bc9c3.png 4. Project construction

4.1 Data development

This module mainly includes data preprocessing tools and data development solution selection.

4.1.1 Traffic forwarding tool

Due to the large number of entrances and large traffic, the following main problems exist:

  • There may be multiple data sources and analysis methods for the same dimension.

  • The buried point data used accounts for about 20% of the total. The full consumption of resources is a serious waste, and each downstream operation will be repeated.

  • After adding buried points, data processing requires development intervention (in extreme cases, all users are involved)

As shown in the figure below, the traffic forwarding tool can dynamically access multiple data sources, perform simple data processing, and standardize the valid data and write it downstream, which can solve the above problems.

c385fc458c4b8d5e9a061fa3fc4e2fd9.png 4. 1.2 Evolution of business data processing solutions
Solution 1 - Simple fusion of off-line data
background

Since the business needs were relatively simple at the beginning, such as calculating the real-time order volume of users' history, aggregating the information of attractions purchased by users in history, etc. Such simple requirements can be abstracted into simple aggregation of offline data and real-time data, such as numerical addition, subtraction, multiplication and division, character append, deduplication summary, etc.

solution

As shown in the figure below, the data provider: provides standardized T+1 and real-time data access; data processing: T+1 and real-time data integration; consistency verification; dynamic rule engine processing, etc.; data storage: supports horizontal expansion of aggregated data ;Tag mapping, etc.

50f019462069bc5f37b5fd4441235fb1.png Option 2 - Support SQL
background

Although option 1 has the following advantages:

  • Simple layering and strong timeliness

  • Rule configuration responds quickly and can handle a large number of complex UDFs

  • Rule engine and other processing

  • Compatible with the entire java ecosystem

But there are also obvious disadvantages:

  • BI SQL developers are basically unable to intervene and are strongly dependent on development

  • There are many SQL scenarios, and using Java to develop is costly and has poor stability.

  • No valid data tiering

  • Process data is basically unavailable. If you want to save process data, you need to repeat calculations, which wastes computing resources.

solution

As shown in the figure below, Kafka carries the data tiering function, Flink SQL's computing engine, and OLAP carries data storage and hierarchical query, completing the hierarchical construction of a typical data warehouse system.

3048b16cead8b087502945018b06fb95.png

However, since kafka and olap storage engines are two entities, there may be data inconsistencies. For example, if kafka is normal and the database is abnormal, the intermediate layered data will be abnormal, but the final result will be normal. In order to solve the above problems, as shown below, the binlog mode used in traditional databases is developed. Kafka data strongly relies on DB data changes. In this way, the final result strongly depends on the intermediate tiered results. The data inconsistency problem caused by the big component cannot be avoided, but it is large. Some scenarios are already basically available.

16c750c3eb6f3ddb0b222f00f2f84112.png
Option 3
background

Although option 2 has the following advantages:

  • SQLization

  • Natural hierarchical query

But there are also obvious disadvantages:

  • Data inconsistency problem

  • There is no problem with binlog when inserting, but updating and deleting are difficult, and a lot of deduplication operations are required when updating, which is very unfriendly to SQL.

  • Long-term data aggregation, some operators such as max, min, etc. have large flink states and are prone to instability.

  • We also need to consider the data coverage problem caused by kafka data disorder.

solution

As shown in the figure below, the computing power of the storage engine is borrowed. Kafka's binlog is only used as the trigger logic for data calculation, and Flink UDF is directly used for direct DB query.

abad286fe829f7df716f2b59eb8147ce.png Advantage:
  • SQLization

  • Natural hierarchical query

  • Data consistent

  • FLink status is small

  • Can support long-term persistent data aggregation

  • No need to worry about problems caused by binlog disorder, update, etc.

Disadvantages:

  • Concurrency cannot be supported and it relies heavily on the performance of the Olap engine. When we use the data source, we will limit the window current or horizontally expand the db.

  • The combination with the retracement flow is interrupted when sinking, such as: group by, which is actually a brainless upsert. The aggregation of udf cannot replace the native aggregation of flink.

Each solution has applicable scenarios, and solution selection needs to be made based on different business scenarios and delay requirements. At present, 86% of our scenarios can be undertaken using Solution 3, and due to the various offline and online integration features of Flink 1.16, all scenarios can be covered in the future.

4.2 Data services

This module provides capabilities such as data synchronization -> data storage -> data query -> data services. Simple scenarios can achieve minute-level data service deployment capabilities, saving 90% of development hours. Implemented strong dependence on offline data DQC, protection of DQC exceptions on the engineering side, client->interface level resource isolation/current limiting/circuit breaker, and full-link bloodline (client->server->table->hive table->hive bloodline) management, etc., providing on-demand interface deployment and operation and maintenance support capabilities for various performance requirements.

The architecture is as follows:

2f55c666e3647f592ca0e4aa28f1c3e7.png

4.3 Data quality

This module is mainly divided into data content quality and data task quality.

4.3.1 Data content

Correctness/timeliness/stability

This part is further divided into data operation changes, data content consistency, data reading consistency, data correctness/timeliness, etc. As shown in the figure below, data changes: If there is an abnormality, the data can be entered into the company's hickwall alarm center and an alarm will be issued according to the early warning rules. Data content: There will be scheduled tasks to execute user-defined SQL statements and write data to the alarm center, which can achieve second-level and minute-level warnings.

e08271d26dd907b0c11ea80527fe8e89.png read consistency

As shown in the figure below, when reading data, if there is a joint query across tables, and if there is a problem with one of the tables, in most cases the wrong data will not be displayed, only the correct data in history will be displayed, and it will be displayed after the table is restored. Show them all.

2ec5de5d30e7d147d6a33b6d6fcce488.png For example: Exposure requires dividing the data of Table 1 and Table 2 (Table 1/Table 2). If the data production of Table 2 is abnormal and there is no data in the last 2 hours, when exposing it to users, the business needs to only display the data 2 hours ago. , abnormal data gives front-end exception reminders based on the concept of flink watermark, and displays the correct data.

off-line consistency

About offline and real-time data consistency. As shown in the figure below, we use a relatively simple method to directly synchronize real-time data to hudi, and use hudi to compare offline and real-time data and enter the alarm center.

8732a52f0489ca2ac4327fa25e776823.png

image.png

4.3.2 Data tasks

upstream tasks

Relying on the company's customized early warning points, alarm middle platform, computing platform and other tools, key indicators such as whether the upstream message queue is delayed and whether the volume is abnormal can be monitored and early warned.

5338dae6fc77cde4db5dfdc6e41c06c4.png current task

Key indicators such as throughput, delay, back pressure, and resources of data processing tasks can be monitored and alerted to avoid long-term abnormalities in data tasks.

e21df5e45dd22a8903661e7a95f347bf.png

4.4 Data management

This module can connect various modules such as data processing and quality in series to provide a visual management platform, such as table kinship/basic information, DQC configuration, task status, monitoring, etc.

The figure below shows the blood relationship between the upstream and downstream data production tasks of each data table.

cb763cac777d2f6cd310210df342251d.png

The picture below shows the details of the quality information in the data sheet

d1a77fc02837c7aca14f2b28b89673d8.png

The following figure is a summary of the basic information of various UDF tables.

edb40485ca44864d7b3d7ad28a44b4e8.png

5. Outlook

At present, the system can basically handle most of the team's data development needs. In the future, we will continue to explore aspects such as reliability, stability, and ease of use, such as improving the entire data governance system, building automatic data recovery tools, and troubleshooting operations. Dimension intelligent components, integrated exploration of service analysis, etc.

[Recommended reading]

c7a37f938620fc454307864a80ab5f8e.jpeg

 “Ctrip Technology” public account

  Share, communicate, grow

Guess you like

Origin blog.csdn.net/ctrip_tech/article/details/131842766