Data warehouse index system practice

Indicator system

1. Pain point analysis

 Mainly from the perspectives of business, technology and products :

  • business perspective

    The indicators and dimensions of business analysis scenarios are not clear;

    Frequent requirement changes and repeated iterations, bloated data reports, and uneven data;

    It is more expensive for users to analyze specific business problems to find data and check and confirm the data.

  • technical perspective

    The definition of indicators, the naming of indicators are confusing, the indicators are not unique, and the caliber of indicator maintenance is inconsistent;

    Indicator production, repeated construction; high cost of data reconciliation;

    Indicator consumption, inconsistent data export, repeated output, inconsistent output caliber;

  • Product perspective

    Lack of system productization support, the data flow from production to consumption is not connected at the system product level;

2. Management objectives

  • technical goals   

    Unified index and dimension management, index naming, calculation caliber, unique statistical source, dimension definition specification, dimension value consistent

  • Business goals   
    Unified data export, scenario coverage

  • Product goal Productization   
    of indicator system management tools; productization of indicator system content to support decision-making, analysis, and operation, such as decision-making Polaris, intelligent operation analysis products, etc.

3. Model Architecture

 Business Line 

Business sector definition principle : abstraction at the business logic level and subdivision at the physical organizational structure level. Hierarchical subdivision and subdivision can be carried out according to the actual business situation. Hierarchical classification recommends up to three levels of subdivision. The first level subdivision can be unified at the company level. The specification is determined, and the secondary and subsequent splits can be split according to the actual business of the business line.

For example, at the business logic level of Didi Chuxing, two-wheeled vehicles and four-wheeled vehicles belong to the travel field and can be abstracted from the travel business sector (level 1). Cars (level 2), which can be subdivided according to actual business needs in the future, online car-hailing can be subdivided into solo and shared rides, and Pratt & Whitney can be subdivided into bicycles and enterprise-level.

 canonical definition 

  • data field

Refers to a collection of business analysis-oriented abstraction of business processes or dimensions . Among them, the business process can be summarized as undivided behavior events, and indicators can be defined under the business process; dimension is the environment of measurement, such as passenger call order event, call order type is the dimension. In order to ensure the vitality of the entire system, the data domain needs to be abstracted, maintained and updated for a long time, and the change needs to be performed through the change process.

  • business process

Refers to the company's business activity events , such as orders and payments, which are all business processes. Among them, the business process cannot be split.

  • Time period

It is used to clarify the time range or time point of the statistics , such as the last 30 days, natural week, deadline, etc.

  • Modification type

It is an abstract division of modifiers . The modification type belongs to a certain business domain. For example, the access terminal type of the log domain includes modifiers such as the APP terminal and the PC terminal.

  • modifier

It refers to the limited abstraction of business scenarios for indicators other than the statistical dimension. Modifiers belong to a type of modification. For example, under the access terminal type in the log domain, there are modifiers APP, PC, etc.

  • Metrics/Atomic metrics

Atomic metrics and metrics have the same meaning. A metric based on a business event behavior is an undivided metric in the business definition and has a name with clear business meaning, such as payment amount.

  • dimension

Dimension is the environment of measurement, which is used to reflect a class of attributes of the business. The collection of such attributes constitutes a dimension, which can also be called an entity object. Dimensions belong to a data domain, such as geographic dimensions (including countries, regions, provinces, cities, etc.), time dimensions (including year, quarter, month, week, day, etc.).

  • dimension attribute

Dimension attributes belong to a dimension, such as country name, country ID, province name, etc. in the geographic dimension are all dimension attributes.

  • The classification of indicators is mainly divided into atomic indicators, derived indicators, and derived indicators.

  1. Transactional metrics:
    Refers to metrics that measure business processes. For example, the number of orders and the amount of payment for orders, such indicators need to maintain atomic indicators and modifiers, and create derived indicators on this basis.

  2. Stock indicators:

    Refers to the statistics of certain states of entity objects (such as drivers and passengers), such as the total number of registered drivers and the total number of registered passengers. Such indicators need to maintain atomic indicators and modifiers, and create derived indicators on this basis. The corresponding time period is generally It is "history up to a certain current time".

  1. Atomic indicators are    
    based on the measurement of a certain business event behavior. They are indicators in the business definition that cannot be split, and have names with clear business meanings, such as call volume and transaction amount.

  2. A derived indicator    
    is one atomic indicator + multiple modifiers (optional) + time period, which defines the business statistical scope of atomic indicators. There are two types of derived indicators:

  3. Derivative indicators
    are compounded on the basis of transactional indicators and stock indicators. There are mainly ratio type, proportional type, statistical mean 

 model design 

The dimensional modeling method is mainly used for construction. The basic business detail fact table mainly stores dimension attribute sets and metrics/atomic indicators; the analysis business summary fact table is classified and stored according to the index category (de-duplication index, non-duplication index), and non-duplication index is stored. The aggregated fact table stores statistical dimension sets, atomic indicators or derived indicators, and the deduplicated indicator aggregated fact table only stores the set of statistical labels for analysis entities.

At the data warehouse physical implementation level, the indicator system is mainly based on the layered structure of the data warehouse model to guide the construction. Didi's indicator data is mainly stored in the DWM layer, which is the core management layer of the indicator.

4. Indicator system metadata management

Apache Atlas | Metadata Management Framework

 Dimension management 

It includes basic information and technical information, and is maintained and managed by different roles.

  • The basic information corresponds to the business information of the dimension, which is maintained by business managers, data products or BI analysts , and mainly includes dimension names, business definitions, and business classifications.

  • The data information of the dimension corresponding to the technical information is maintained by data research and development , mainly including whether there is a dimension table (whether it is an enumeration dimension or an independent physical dimension table), whether it is a date dimension, the English name and Chinese name of the corresponding code, and the English name of the corresponding name and Chinese name. If the dimension has a dimension physical table, it needs to be bound to the corresponding dimension physical table and set the fields corresponding to code and name. If the dimension is an enumeration dimension, you need to fill in the corresponding code and name. The unified management of dimensions is conducive to the standardization of data tables in the future, and it is also convenient for users to query and use.

 Indicator management 

It includes basic information, technical information and derivative information, and is maintained and managed by different roles. 

  • The basic information corresponds to the business information of the indicator, which is maintained by business managers, data products or BI analysts , mainly including attribution information (business segment, data domain, business process), basic information (indicator name, indicator English name, indicator definition, statistics Algorithm description, indicator type (de-duplication, non-duplication)), business scenario information (analysis dimension, scenario description);

  • The physical model information of the technical information corresponding to the index is maintained by the data research and development , mainly including the corresponding physical table and field information;

  • Derivative information corresponds to associated derivation or derived indicator information, associated data application and business scenario information, which is convenient for users to query which other indicators and data applications the indicator is used by , and provides the ability of indicator bloodline analysis to trace the data source.

Atomic indicator definition attribution information + basic information + business scenario information derived indicator definition time period + modifier set + atomic indicator modification type mainly includes type description, statistical algorithm description, data source (optional) 

5. Index system construction process 

 modeling process 

The modeling process mainly guides engineers to abstract, categorize, and unify business terms for the indicators involved in the demand scenario from a business perspective , so as to reduce communication costs and avoid repeated construction of subsequent indicators.

The analytical data system is a physical collection of aggregated fact tables in the model architecture, and the business logic level abstracts the indicator system based on business analysis objects or scenarios. Didi Chuxing mainly abstracts themes based on the analysis objects, such as driver themes, safety themes, experience themes, and urban themes. Indicator classification is mainly based on the abstract classification of actual business processes, such as driver transaction indicators, driver registration indicators, and driver growth indicators. The basic data system is a physical collection of detailed fact tables and basic dimension tables in the model architecture. The business logic level abstracts according to actual business scenarios, such as driver compliance, passenger registration, etc., to restore the core business process of the business.

 Development Process 

The development process guides engineers in the production, operation and maintenance, and quality control of the indicator system from a technical perspective, and is also a bridge for communication and coordination between data products or data analysts and data warehouse R&D.

6. Construction of indicator system map

 Overview of the indicator system map 

The indicator system map, also known as the data analysis map, mainly abstracts business analysis entities based on actual business scenarios, and integrates and sorts out the business classifications, analysis indicators and dimensions involved in the entities. Construction method: mainly through business thinking and user perspective to build, closely link business and data, and organize indicators in a structured and classified manner. 

Construction purpose:

  • For users:

    It is convenient for users to quickly locate the required indicators and dimensions, and at the same time, through the precipitation of the indicator system in business scenarios, users can quickly reach their data demands.

  • For R&D:

    It is beneficial to the design of the follow-up index production model, the boundaryization of data content, the iterative quantification of data system construction, and the implementation of data assets.

  

 Indicator System Graph Model 

 An example of an indicator system map 

Productization of the indicator system

The product set involved in the indicator system is mainly constructed according to its life cycle, and the data flow is opened up through product tools to realize the unified, automated, standardized and process management of the indicator system. Because the essential goal of the index system construction is to serve business and realize data-driven business value, the core principle of construction is "lighting on standards, focusing on scenarios, from management and control to service" . Through the integration of tools, products, technologies and organizations, users can use data more efficiently and accelerate business innovation iterations.


Among them , the product that is strongly related to the methodology of the indicator system is the implementation of the indicator dictionary tool, and the positioning and value of its products:

  • A tool that supports the index management specification from method to implementation, automatically generates standardized indexes, solves the problems of confusing index names and non-unique indexes, and eliminates the ambiguity of data

  • Provide standard indicator calibers and metadata information in a unified manner

Tool design process (methodology->definition->production->consumption)

Indicator definition

Indicator production

concluding remarks

The article introduces the methodology & practice of the indicator system construction and the construction of tool products as a whole. The indicator dictionary and development tools have been connected with the process, and the data service will be provided through DataAPI after the connection with the data consumption product.

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/126094043