Didi Data Service System Construction Practice

What is data service

The main process of big data development is divided into four stages: data integration, data development, data production and data return. Data integration opens up the channel for business system data to enter the big data environment. It usually includes three methods: periodic import of offline tables, real-time collection and cleaning of imported offline tables, and real-time writing of corresponding data sources. Currently, Didi’s internal synchronization center platform has provided Data collection capabilities from multiple data sources such as MySQL, Oracle, MongoDB, and Publiclog; data development/production, users can build real-time and offline data warehouses, and data modeling based on various task methods such as SQL, Native, and Shell ; Data backflow improves access performance by importing offline data into OLAP, RDBMS, etc., and downstream services directly access the data source for data analysis and data visualization.

Didi's internal data dream factory is to provide one-stop data development and production solutions. The core focus is on the efficiency, security and stability of data development and production links.

c6def9231c2cef4a42c02284e7f2c233.png

Data development process

In order to systematically deliver data to users, we have built a one-stop data consumption platform, including general data consumption products such as Shuyi, data intelligent question answering robot, and transaction analysis, as well as horizontally deposited content products Polaris and group exhibition halls. A one-stop consumption platform needs to provide visualization and analysis capabilities by querying structured and standardized data. From the technical architecture of data consumption products, there are certain requirements for query performance, which will be returned to the appropriate multi-dimensional analysis storage engine according to the query method, the most common ones are MySQL, ClickHouse, Druid, and StarRocks. Therefore, the query closure of the multi-dimensional analysis storage engine, the expansion of computing capabilities, the implementation of performance optimization, and the guarantee of query stability are all public and general capabilities for consumer data products.

In addition, data access capabilities are urgently needed for other personalized data products, operating platforms, and B-end/C-end products. This is also the core issue of data center construction and the unification of data access capabilities: data service capabilities.

For this capacity building, we did not accomplish it overnight. It is mainly divided into three stages.

45afc15bb52e16dbed6aad49f1a5b3df.png

Technical Architecture of Data Consumption Products

Phase one:

Build the data backflow capability of the synchronization center

In 2019, Didi initiated the construction of the data system 2.0. As the core output of the data dream factory, its first phase goal is to build a one-stop data development and production platform. The key node is the completion of the one-stop construction of the synchronization center as a milestone . Through the construction of automated processes, the synchronization center has ended the history of manually constructing synchronization tasks between data sources by submitting work orders. Its core output is the construction of independent management capabilities for data entering the Hive link. In addition, we have newly built Hive backflow to MySQL, ClickHouse, Druid, Hbase and ES links, so that data backflow can complete the platform construction.

a4b1fd6a40f73f6743e35db21dc1ec85.png

The data service capability based on data backflow enables systematic coverage of relevant scenarios such as service points, Polaris, and Shuyi. The core solves the performance problem of direct business access to the big data environment. Taking Shuyi as an example, the data query performance P90 drops from 5s to less than 2s through data backflow to ClickHouse. This performance improvement makes Shuyi's user experience quality promote. The basic architecture of data products at this stage, especially the query side, is similar to that shown in the figure below, and two modules of data return and data retrieval logic are abstracted.

59452a1cac73f5ffdd91c23c18d4bd8c.png

Data backflow realizes data export to the multi-dimensional analysis storage engine, and accumulates task management and operation and maintenance capabilities. These data products have deeply opened up the data dream factory, and based on the powerful offline task operation and maintenance capabilities of the data dream factory, the output of data is guaranteed. The access logic maintains the specific query logic. Except Polaris, the other two products have derived middleware based on query abstraction, such as InSight's QE (QueryEngine) and Shuyi's query center. Data return and access logic are the core capabilities of data products, and they are also modules with extremely high construction costs. Therefore, at this stage, similar to products such as data intelligent question answering robots, data portals, and complex tables, the query and acceleration capabilities based on Shuyi datasets are used to quickly verify products.

Phase two:

Build a digital chain platform and unify data service

Phase 1 provides data backflow online storage capabilities, improves the performance of related system calls, and makes phased contributions to the development of data products. With the development of the business, the number of data tables has increased, and the isolation and precipitation of data access logic has become prominent in different systems, and the management cost has continued to increase. In order to improve data retrieval performance, in addition to accelerating to a multi-dimensional analysis storage engine, it is also necessary to highly aggregate data and build an APP layer with less data. There is a strong correlation between APP layer tables and business requirements. Therefore, changes in requirements often lead to changes in APP tables to support services. In Phase 1, data access logic scenes are scattered in different systems, and APP table changes will be a relatively large workload, including data source switching of different kanbans, and re-acceptance of kanban quality. The process is very complicated.

Data reflow and access logic are repeated in different data products, which also increases the efficiency of data product construction. In order to improve efficiency, internal products basically rely on Shuyi data sets for construction, such as data intelligent question answering robots and complex tables.

But this is not the optimal solution, the problem is mainly reflected in:

  • Based on the Shuyi data set, the construction of measures such as acceleration, current limiting, and isolation are very complicated and complicated. In particular, the data set acceleration method of Shuyi is divided into the first-level SQL task acceleration and the second-level ClickHouse acceleration, and the form is solidified.

  • Shuyi's queries are based on MPP construction, and it is difficult to support relatively high concurrent queries and checks.

  • The operation and maintenance guarantee capability is weak, and the acceleration tasks are all carried out by the platform, and the user perception is weak, and it cannot be operated and maintained.

  • Shuyi's data set is a strong dependence of Shuyi, and it is difficult to divest and build service-oriented capabilities. At that time, it was not the first priority for Shuyi.

To sum up, building a unified data service platform has strong business benefits. From the beginning of 2021, the digital chain platform has emerged as the times require. Its basic idea is to conduct unified management of acceleration links and query logic, and provide a unified and complete query gateway.

96bdd4dc7cb05a4645a75bbf71c66ca3.png

The basic capabilities of the digital chain lie in:

  • Diversified data sources : support access to ES, MySQL, ClickHouse, Hbase, Druid and other data sources;

  • Multi-scenario data access : supports high concurrent query of key/value key-value pairs, supports complex multi-dimensional analysis and supports data download capabilities;

  • Unified access standards : unified access gateway, unified data access protocol, unified data operation and maintenance and unified API management capabilities;

  • Data security control : Supports auditing of sensitive data access, data download and outbound control capabilities.

After the construction of the digital chain platform, the data API construction time has been reduced from the day level to the minute level, and the white screen API construction capability has been realized. At present, the number of APIs of Digital Chain has exceeded 4,000, and the number of weekly active APIs has also exceeded 1,600, serving more than 200 applications, covering all business lines, and achieving the established construction goals.

Phase three:

Build digital chain standard index service

Through the platform construction in the second stage, the data products that will be earned are mainly monitoring, dashboards, portals, operating systems, and security-related systems. These systems mainly focus on the efficiency of building APIs, the management capabilities of API business logic, and the operation and maintenance capabilities of APIs. However, early construction such as Shuyi and Polaris, with closed-loop products with relevant capabilities, it is difficult to find a breakthrough to access the digital chain, or it is difficult to see the benefits.

At present, indicators in big data are delivered through Hive tables and indicator descriptions deposited in indicator management tools for a long time. That is to say, the data warehouse will provide the business side with a Hive table and descriptive value logic. When the business side uses the Hive table to build Kanban and temporarily fetch data, it needs to repeatedly check the data fetch logic, which is relatively inefficient. At the same time, the same indicator is often displayed on Polaris, Shuyi and other products. The most embarrassing thing is that there are often inconsistencies in the values, which means that the consistency of indicator consumption is highlighted.

The indicator management tool is an indicator and dimension metadata management system built according to the indicator management methodology. In order to enter indicators and dimensions, the data team spends a lot of money. Index management tools only provide index input and retrieval capabilities, and index normative construction can only rely on top-down management and cannot be effectively self-operated. For indicator consistency, you can only ensure that the indicator comes from one source, and the delivery method cannot be the Hive table, but the indicator itself. The indicator needs to provide direct consumption capabilities.

The dilemma of service-oriented construction in the second stage, the ambiguity of the consumption of Polaris and Shuyi indicators, and the dilemma of indicator management tools themselves, the service-oriented construction of standard indicators came into being. The basic idea is shown in the figure. One is that the index management tool provides model management and associates the index with the physical table. In addition, the number chain provides a unified consumption gateway, allowing data products such as Shuyi and Polaris to open up this consumption channel.

1eed0bead0ee6f45a4e008eed3fd7663.png

For the service-based construction of standard indicators, metadata management needs to expand the expressive capabilities of indicators and dimensions, and use logical models to associate indicators, dimensions and specific fields of specific physical tables. In order to simplify the logic of downstream consumption, the service of standard indicators needs to provide a certain amount of automatic retrieval logic capabilities. Usually an index is implemented in different physical tables, and the consistency check of the indexes between different physical tables can effectively avoid the ambiguity of the index.

metadata management

The most critical metadata for the service of standard indicators: indicators, dimensions, and logical models. The following will be introduced in turn.

index

The indicator management methodology mainly introduces the calculation indicators and composite indicators introduced to improve the semantic ability of indicators. Derived indicators refer to indicators that can be directly served externally after the physical table (Hive/Starrocks/ClickHouse) is developed, that is, indicators that must be materialized in the corresponding fields on the physical table. Calculated indicators are calculated based on registered derived indicators, and may not be materialized into indicators corresponding to the fields on the physical table. The current calculation method only supports the four arithmetic operations of addition, subtraction, multiplication, and division. For example, the cancellation rate after answering = the amount of canceled orders after answering / the amount of orders answered on the same day. Composite indicators refer to the indicators generated by registered derived composite dimensions, which may not be materialized to the indicators corresponding to the fields on the physical table. For example, "net flower out GMV" is based on the indicator "including openapi and scanning code payment GMV", and the composite dimension "order Aggregated business line", the dimension value is generated by "network + outbound + flower". As shown in the figure below, composite indicators and calculated indicators can be nested with each other. Currently, composite indicators can only appear once in the nesting chain at most.

d92be30617d13f1e18b01b8d93b636e0.png

latitude

Latitude type, four types are currently constructed:

  • Dimension table dimension : an independent dimension table, the dimension table will have a unique primary key, and other attribute information. The dimension table dimension can construct the star model of the data warehouse. If there are foreign keys, you can build a snowflake model that depends on multi-layer dimension tables. For example, the city dimension table dimension whole_dw.dim_city.

  • Enumeration dimension : key/value key-value pairs, key-value pairs are managed centrally. For example: Gender dimension, the corresponding key-value pair male (M)/female (F).

  • Degenerate dimension : Dimension logic cannot be centrally managed, and different physical tables have different implementations, but represent the same dimension. For example, in the business line dimension of Polaris, the conversion logic of the business line id under different sectors to the business line id of Polaris is different, which needs to be determined in the specific implementation.

  • Derived dimension : Unlike the degenerate dimension, the dimension logic can be managed centrally, which is a piece of processing code.

logical model

The logical model has different interpretations in different places. The logical model in the indicator management tool is the carrier for the binding of indicators, dimensions and physical tables. The logical model can be bound to three types of indicators, derived indicators, calculated indicators, and composite indicators, and can also be bound to four dimensions: dimension table dimension, enumerated dimension, derived dimension, and degenerated dimension. The indicators and dimensions bound to the logical model can be directly bound to the fields of the physical table, or can be bound to the calculated fields constructed based on the fields of the physical table. Calculation indicators and composite indicators can also be non-materialized according to calculation logic or composite logic. A logical model can be bound to multiple indicators and dimensions, and conversely, an indicator or dimension can be bound to multiple logical models. More generally speaking, multiple implementations of an indicator are specified through a logical model.

cb464788647266ce948eded484549272.png

When the logical model specifies the physical table, it also specifies the storage engine of the physical table, the data layout of the physical table, and the data warehouse level. Currently, the data chain supports three storage engines: Hive, SR, and ClickHouse; the data layout supports general APP tables, Cube tables, and GroupingSets tables; the data warehouse level supports APP, DM, DWS, and DWD.

Data retrieval logic automation

In the servitization of digital chain construction standard indicators, the automation of data retrieval logic can realize centralized management (Single Source of Truth) on the one hand, and improve efficiency on the other hand. The automation of data retrieval logic is mainly reflected in the service of standard indicators:

Supports accessing data through indicators and dimensions . Users only need to provide the required indicators and dimensions, and obtain data through the access interface. As mentioned above in the logical model, the indicator and the logical model have a one-to-many relationship. The automated access logic will select the appropriate logical model based on the required indicators, dimensions, partition range, and access method with the best performance. It is important to point out that when the derived indicators that the calculated indicators depend on can only be obtained through different models, the data retrieval process can be completed through federated queries.

Tables that support multiple data layouts currently support general APP tables, Cube tables, and GroupingSets tables. In fetching, the fetching logic of different data layouts has been shielded, and users do not need to care about the data layout of the original table.

Various data warehouse modeling models are supported. In the data warehouse modeling standard output, singleton, star and snowflake models can be used. For snowflake and star models, the automatic access logic can realize complex access logic such as roll-up of different dimension attributes in the dimension table dimension by automatically associating the required dimension tables.

It supports roll-up of daily, weekly, monthly, and quarterly granularity . Previously, data with different time granularities could only be realized by developing different tables. When the query performance is guaranteed, the rollup capability of time granularity can now be realized.

1bf284f3425dea430f264b8f2356b127.png

consistency check

In addition to realizing data retrieval efficiency through data retrieval logic automation, index consistency is also the core starting point for digital chain construction standard index service. Index consistency, on the one hand, is through a unified consumption interface, and on the other hand, it is based on the actual status of the index to perform passive and automatic verification.

Passive indicator verification, the user configures the required verification indicators on the platform, as shown in the figure below, like "the whole group GMV in the last day", it may be implemented in multiple logical models. Therefore, the verification logic is to conduct a periodic verification after several logical models are produced. Another situation is that "the last day's GMV of the whole group" can be realized by "the last day's online car-hailing GMV" + "the last day's two-wheeler GMV" + "the last day's freight GMV". Therefore, as shown in the figure below, the verification logic can be a periodic verification after the logical model_1 on the left and the logical model_1', logical model_2', and logical model_3' on the right are produced. .

15d95f2bd7986a2059a8a338d9357d3f.png

The difference between automatic index verification and passive index verification is that the model disassembly method is automatically generated by the system, and the index that can be verified is also screened out by the system.

Access & Query Process

At present, services that access digital chain standard indicators include Polaris, Shuyi and InSight. These three products are also the core data products of Didi. Servitization of standard indicators has different challenges when opening up these three products, which will be detailed in the articles shared by other students. Here we only briefly introduce the basic process of data product access and query.

In the usual process, the data BP will enter the indicators and dimensions sequentially according to the indicator management tool, and the data development students will build data warehouses according to different data architecture methods, and create logical models to associate and bind indicators and dimensions. Managed metadata will be synchronized to the digital chain in real time.

Shuyi, Polaris, and InSight build reports and kanbans through metadata interfaces. When a data query is initiated, the request will be sent to the data chain, and after model screening and optimization, the final execution SQL will be generated, and the queried data will be returned to the data product side.

fd1fbcc9cd5a6a58e33f866ef8a295d9.png

The overall structure of the data service system

The digital chain platform aims to create a one-stop standardized, efficient, stable and secure data service platform. The current service business scenarios are mainly divided into local data service, offline data service, feature service and standard index service. The digital chain platform is divided into a gateway layer and an engine layer. The gateway implements a unified entrance and provides capabilities such as authentication, current limiting, caching, routing, and monitoring. The engine layer divides the implementation scenarios into key/value key-value pair scenarios, multi-dimensional analysis scenarios, and standard indicator service scenarios. The key/value key-value pair scenario is the service of the main service features, that is, business scenarios such as bull shields and map features. Multi-dimensional analysis scenarios mainly serve local data service and offline data service, namely Horus, Jiuxiao and other business scenarios. Key/value key-value pair scenarios and multi-dimensional analysis scenarios are the core capabilities of Phase 2, and standard indicator service scenarios are the core capabilities of Phase 3.

In order to support diverse and complex data query demands, we have also built a unified query middleware: DiQuery. Relying on the powerful query capability of MPP, DiQuery has built a unified query capability to serve data products such as Shulian and Shuyi. In addition to supporting single-table query capabilities, DiQuery also supports federated queries and LOD complex function query capabilities. DiQuery supports extended functions such as year-on-year ratio and four-week average, and supports the ability to roll up based on this.

a004180e632d51f830ec66ec5968f91e.png

Summary and Outlook

The development of Didi's data service system has gone through the original data return task method, the construction of a unified data service platform, and the construction of standard index service, and is building a better data service system step by step. The service-oriented construction of standard indicators is the highlight of this year, and it develops rapidly in the full cooperation of data warehouse research and development, product and platform research and development.

The current data service system decouples the relationship between data production and data consumption. Next, it is necessary to promote the standardization of data production, further solve the problem of indicator consistency, improve the efficiency of data warehouse construction, and improve data quality through the perspective of indicators, etc. Servitization of standard indicators will be an important evolution of the data platform that will be promoted step by step, and the curtain has been slowly drawn up in the industry.

Guess you like

Origin blog.csdn.net/DiDi_Tech/article/details/132033395