Didi real-time data link construction component selection practice

written in front

With the continuous unification of Didi's internal technology stack, the continuous integration of real-time related technical component resources, and the continuous precipitation of real-time data-related development experience in various business lines, a set of optimal technology selection and management for the company's different business scenarios has basically been formed. specific implementation plan. But at the same time, we also found that most real-time development students generally equate real-time data construction with flink data development in the process of real-time data construction, and often put other related components in the real-time data processing process at the edge, which cannot be efficiently implemented. Integrate data processing components to meet the real-time requirements of different business scenarios. To this end, starting from the current typical real-time data development solutions in the company, we sort out the selection of real-time data construction technologies in different scenarios to help everyone better carry out real-time data construction and continuously output high-quality and stable real-time data for business value.

This article is divided into:

1. Main business scenarios of real-time data development in the company

2. Common solutions for real-time data development in the company

  • data source

  • data channel

  • sync center

  • real-time development platform

  • data set

  • real-time data application

3. Selection of real-time data development components in specific scenarios

  • Real-time indicator monitoring scenario

  • Real-time BI analysis scenario

  • Real-time data online service scenario

  • Real-time feature and label system

4. Principles for using resources of each component

5. Summary and Outlook 

1. Main business scenarios of real-time data development in the company

At present, the main scenarios for using real-time data in various business lines in the company are divided into four parts:

28aa057156719c2052d647740712a876.png

Real-time indicator monitoring: For example, indicator stability monitoring on the production and research side, abnormal fluctuation monitoring on real-time indicators on the business side, and business health monitoring on the operating market, etc. The main feature of this type of scenario is that it has high requirements for data timeliness and is highly dependent on time series, mainly relying on the time axis as an analysis measure, and the complexity of data analysis is average.

Real-time BI analysis: mainly for data analysts and operation students to configure real-time Kanban or real-time reports, including the company's operation dashboard, real-time core Kanban, real-time large screens in the exhibition hall, etc. The main feature of this type of scenario is that it requires extremely high data accuracy, and there is a certain delay in the timeliness of data, and it needs to support more complex data analysis capabilities.

Real-time data online service: It mainly provides real-time indicators in the form of API interface, and is mostly used to provide real-time data for data products. This type of scenario has high requirements for data timeliness and accuracy, average index calculation complexity, and very high requirements for interface query QPS. While providing real-time data, it is necessary to ensure the high availability of the entire service.

Real-time features: mainly used for machine learning model updates, recommendation predictions, recommendation strategies, labeling systems, etc. This type of scenario has general requirements for data timeliness, accuracy, and query QPS, but its own implementation logic has higher requirements for the use of real-time computing engines, which require real-time computing engines to have strong real-time data processing capabilities and strong state storage capabilities. , Richer external component docking capabilities.

2. Common solutions for real-time data development in the company

fa3d2f39098b93a2c80edc2c41aa5d0b.png

The general program components for real-time data development in the company mainly include six parts: real-time data acquisition, data channel, data synchronization, real-time data calculation, real-time data set storage, and real-time data application. Currently, the components used in these six parts are basically stable. Components can be flexibly used on the corresponding platform.

data source

At present, the company's main real-time data sources are binlog logs generated by MySQL and puliclog logs generated on business servers. MySQL's binlog logs are completed through Ali's open source collection tool Canal. The working principle of Canal is to disguise itself as a MySQL slave and simulate MySQL. The interaction protocol of the slave sends the dump protocol to the MySQL Master, and the MySQL master receives the dump request sent by Canal, and starts to push the binary log to Canal, and Canal parses the binary log and finally sends the result to DDMQ; the Public log is the business log defined by the company  's  specification , by deploying LogAgent on the business server, the Agent Manager will process and generate the collection configuration. After the Agent accesses the Agent Manager to pull the collection configuration, the collection task will start to execute, and finally the log will be sent to  Kafka  .

data channel

The company's mainstream message channels are DDMQ and Kafka. All binlog logs come from DDMQ. DDMQ is an open source product of Didi at the end of 2018. It uses RocketMQ and Kafka as the low-level storage engine for messages. The main feature is to support delay and transaction messages. At the same time, it also supports complex message forwarding and filtering functions; the public log mainly uses Kafka as the message channel, and the development of the intermediate link of real-time tasks also mainly uses Kafka as the storage medium. Its main features are high scalability and ecological perfection, and it is developed in cooperation with Flink The efficiency is extremely high, and the operation and maintenance of components is very convenient.

sync center

The main function is to separate the data collected from the source into offline and real-time data according to business needs. The data link synchronization function developed on the basis of DataX for the data required by the platform for offline scenarios completes end-to-end data synchronization and puts the results into hdfs. For the data required in real-time scenarios, use the Dsink task with an embedded real-time computing engine to complete the data collection configuration and push the results to the kafka message queue. At the same time, the data will be placed in hdfs to build offline incremental or full ods tables.

real-time development platform

At present, the real-time task development in the company has been fully integrated into the real-time development platform of Shumeng (one-stop data development platform), which supports two modes of flink jar and flink sql. As of June 2022, the jar task in the real-time task running on the platform 8% and sql tasks 92%. In daily real-time task development, it is recommended to use the SQL syntax of Flink 1.12 to complete the development of real-time tasks. On the one hand, it can ensure the consistency of indicators, and on the other hand, it can also improve the maintainability of real-time tasks. During the task development process, users are advised to introduce and use the local debugging function to avoid errors in the real-time task development process as much as possible and improve the success rate of real-time task launch. Usually, the main work we complete on the real-time development platform is the calculation of ETL operations or light summary indicators , and then write the processing results into the downstream sink.

fe36b7417ac40a094210289bb958c126.png

The picture shows the flow chart of the local debugging function

data set

The downstream sinks of the calculation results generally include Kakfa, druid, Clickhouse, MySQL, Hbase, ES, fusion, etc. For the intermediate results of real-time tasks or the dwd layer data of real-time data warehouses, we will write them into kafka; for the data used for indicator monitoring and alarming, we will write them into Druid, and use the characteristics of druid time series database to improve the monitoring performance of real-time indicators; In the scenario of business bi analysis, data can be written into Clickhouse to configure diversified BI Kanban; the result data of index calculation using flink can also be directly written into mysql, Hbase, ES or fusion. The specific selection here will be explained in the next Chapter specific business scenarios for specific instructions. At present, all downstream sinks have been integrated into the platform. For the case of using druid, it is generally necessary to configure Datasource on  Woater (unified indicator monitoring platform), and for the case of using Clickhouse, it is generally necessary to configure data sets on Shuyi (BI analysis platform).

ca4e764f449029ebca0a75619f95cd4c.png

monitoring alarm

bab57687eb79f351a2adaefdc5ddceba.png

Real-time BI analysis

real-time data application

For real-time result data, commonly used methods include creating real-time indicators on the  Woater (unified indicator monitoring platform) platform, and configuring corresponding real-time kanban or real-time monitoring alarms to meet business minute-level result indicator monitoring and real-time curve analysis. You can also use Shumengflow table (Druid's Meta table) or ClickHouse  data set on Shuyi ( BI analysis platform) to configure real-time reports to meet different BI analysis requirements on the business side.

3. Selection of real-time data development components in specific scenarios

The above link is the main development link of the current real-time task development. In the process of real-time development, we need to analyze the specific problems in combination with the specific needs of the business and the capabilities of each platform, and choose the most suitable development option according to different business scenarios. type.

Real-time indicator monitoring scenario

Scenario features : Obvious dependence on time series, high requirements on timeliness of indicators, general accuracy of indicators, high requirements on query QPS, and high requirements on the stability of real-time data output.

Specific link:

cd41a0aedcf3ae062bf855db9bc6974e.png

In this type of scenario, it is recommended  to configure DataSource on Woater ( unified indicator monitoring platform), and set corresponding indicator columns and dimension columns based on monitoring requirements. To improve query efficiency, it is necessary to configure aggregation granularity. The commonly used aggregation granularity is 30s or 1min. At the same time, for the calculation of UV In the scenario of similar indicators, it is necessary to set the corresponding indicator column field to the hyperUnique type to improve computing performance, and to improve the ability of Druid to consume topic data by setting the consumption partition of Druid. It is generally recommended that the number of topic partitions be an even multiple of the number of Druid partitions. The real-time indicators configured through DataSource are used to configure real-time monitoring dashboards and real-time monitoring alarms.

Core link re-assurance: For core monitoring scenarios, in order to ensure the stability and timeliness of the real-time link, dual-link development is required.

35abec1aa0d2ad5cf26f755fa70aab09.png

The dual link of real-time data processing starts from the original data source, including FLink task active-active, result topic active-active, and Druid table active-active three parts. At the same time, it needs to support active-active switching at the real-time index level to achieve stable index query , and also avoid false positives in downstream monitoring and alarming.

Real-time BI analysis scenario

Scenario features : It does not completely depend on time series, requires high accuracy of real-time indicators, can tolerate a certain time delay, has general requirements for query QPS, and needs to support flexible dimension + index combination query.

Specific link:

ba3a931ab4961bde56771b9a4a2913b2.png

The main solution for this type of scenario is to flatten the required dimension information as much as possible in the flink task, and then write the flattened real-time data micro-batches into the local table of Clickhouse. We use ClickHouse's local table as the bottom table, and configure different materialized view tables downstream according to various business needs. For scenarios that require real-time deduplication based on primary keys, we can use CK's ReplacingMergeTree engine to implement, and then use real-time deduplicated materialized view tables as The data set of Shuyi (BI analysis platform) or the interface of Shulian (data service platform) query the bottom table for downstream configuration of BI Kanban; for the Kanban scene of determining dimensions and indicators, in order to improve query performance, it can also be based on the local table of ClickHouse , use the AggregatingMergeTree engine to create an aggregated view table based on the dimension fields required by the business. In this way, it can meet the needs of the downstream data easy to configure the Kanban or provide the data link interface; the last one is a common scenario that does not require real-time deduplication and pre-aggregation, and can write the data of the fink large screen or the preliminary pre-aggregated data to the CK. In the distributed table, directly configure the Shuyi data set to allow users to configure the indicator boards required by the business.

The main principles for the selection of three types of tables:

  • For business scenarios that require extremely high accuracy of business indicators and have clear deduplication primary keys, it is recommended to use CK's real-time de-emphasis chart.

  • For scenarios where the accuracy of business indicators is high, there are clear dimensions and indicator definitions, and the query logic is complex or the query QPS is high, it is recommended to perform pre-aggregation operations and use the aggregation view table of CK.

  • For scenarios where the business volume is not large and the logic of business changes is frequent, it is recommended to directly use CK's distributed common table to provide downstream kanban configuration in the early stage to meet the rapid iteration and data retrieval requirements of the business.

Real-time data online service scenario

Scenario features: High requirements for the accuracy of real-time indicators, high requirements for query QPS, general requirements for data timeliness

Specific link:

cb77c30f0c2d7db4499499e8b2b33622.png

The main feature of this type of scenario is that it is necessary to perform various pre-processing of the required real-time indicators. One way is to complete the calculation of the required real-time indicators in the flink task, and write the final results to MySQL or Hbase in real time. The real-time updated storage is used for interface encapsulation by the downstream data service platform. This type of solution is suitable for scenarios where business logic changes infrequently and data services need to be provided; another way is to move the aggregation logic down. Flink tasks mainly do data content widening and simple pre-aggregation, and the main index statistics work is handed over to The downstream OLAP engine calculates, and the data service platform provides interface query service by encapsulating the OLAP engine. The advantage of this is that the pre-aggregation capability of OLAP can also be used to provide efficient real-time indicator services when the business indicator logic changes frequently. The disadvantage is that the query pressure on OLAP is relatively high, and more resources need to be provided for OLAP consumption to ensure High QPS of the service.

Real-time feature and labeling system

Scenario features: general accuracy requirements for real-time indicators, high requirements for query QPS and large real-time state calculations, and the need to support the integration of real-time and offline indicators.

Specific link:

1826dc4dc136cc6a846e63fe049abd1b.png

Such scenarios generally have clear indicator columns and dimension columns, and a large number of real-time features or indicator labels need to be connected to the platform. The first solution is to let the platform consume data directly through the topic, and the platform provides feature or label services after packaging. The second solution is Based on the powerful primary key update capability of Hbase and Fusion, real-time and offline labels are poured into it and then connected to the platform to provide feature services or label services for downstream algorithm students to use.

4. Principles for using resources of each component

Real-time data development involves many components. It is recommended to follow the basic principles in the use of each component to make full use of resources and save a lot of unnecessary costs on the premise of meeting the real-time task development.

Data collection : The principle of single collection, for the development of real-time indicators required by the business, the upstream data source should be reused as much as possible to ensure the unity of real-time and offline ods layers.

ddmq : One flink task corresponds to one ddmq consumer group, and multiple topics are supported to use one consumer group. It is not recommended to use the same consumer group in different real-time tasks.

kafka : It is recommended that the traffic flow of a single partition should not exceed 3MB/s. The storage time of Kafka for important real-time tasks needs to be controlled within 48-72 hours, and at least 2 days of historical data can be traced back.

Flink : The source concurrency of kafka and ddmq needs to be strictly consistent with the number of partitions set by kafka and ddmq, so that the consumption performance is the best. The single TM resource of the flink task in the company is fixed slot = 2, taskmanagermemory = 4096, containers.vcores = 2. It can be adjusted appropriately according to different business scenarios. For pure ETL scenarios, the number of single TM slots can be appropriately increased. For tasks with large memory usage, the value of taskmanagermemory can be increased appropriately. In the normal real-time task development process, the global concurrency of kafka tasks is recommended to be consistent with the source concurrency. The global concurrency of ddmq consumption needs to be determined according to the flow of ddmq. The global concurrency is set to 3 for scenarios with traffic in the range of (1000±500), and for scenarios exceeding In addition to this ratio conversion, it needs to be estimated according to the maximum time-consuming operator in the business calculation logic.

druid : When creating a druid table, you must set the aggregation granularity. The recommended granularity is 30s or 1min. The default data storage period is 3 months. The druid table created in a certain business scenario needs to specify the dimension and index fields, and the dimension fields should be used as much as possible. String type, Druid optimizes bitmap and inverted index for String type; under the premise of meeting business needs, the indicator field uses the estimated type as much as possible to improve the calculation performance of real-time indicators.

Clickhouse : The default interval of Flink real-time write tasks is not less than 30s, and the write parallelism should be controlled within 10 as much as possible. The data storage cycle of CK table should be controlled at about 1 month. Time must be used as partition fields, and other types of fields cannot be used as partitions. It is recommended to use the Flink2CK native connector mode for real-time data writing scenarios to improve the stability of real-time writing and reduce the CPU consumption of CK; it is recommended to control the writing throughput of Flink2CK within 20M/s (single concurrency) to indirectly guarantee the CK cluster stability.

5. Summary and Outlook

This article mainly summarizes the mainstream real-time task development solutions and technology stacks based on the current specific business scenarios of Didi, providing users with a certain entry foundation from offline development to real-time data development, and at the same time providing a better real-time solution for product and operation students. The popularity of link development has lowered the development threshold for real-time data construction to a certain extent. After that, the real-time indicator monitoring, real-time BI analysis, real-time data online service, and real-time features of Didi's four typical business scenarios are used to illustrate the differences in the selection of real-time components and the principles to be followed in each business scenario. It can help business development students specify a reasonable real-time development plan according to specific data requirements and quickly implement it. Finally, this paper provides configuration suggestions for the main components in the real-time task development process to ensure that the development cost is reduced as much as possible under the premise of completing the user's real-time task development, and the overall resource utilization efficiency is improved, cost reduction and efficiency improvement.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/131218696