Data warehouse construction and data governance

The following articles are from Five Minutes to Learn Big Data, author Yuan Mo , a super-comprehensive overview of data warehouse construction and data governance

foreword

Reasons for data warehouse stratification:

  1. Exchange space for time, and improve the user experience of the application system through a large number of preprocessing, so there will be a large amount of redundant data in the data warehouse; if there is no layering, if the business rules of the source business system change, it will affect the entire data cleaning process , The workload is huge, and the so-called movement affects the whole body.
  2. The process of data cleaning can be simplified through data hierarchical management, because the original one-step work is divided into multiple steps to complete, which is equivalent to splitting a complex job into multiple simple jobs, turning a large black box into a A white box is built, and the processing logic of each layer is relatively simple and easy to understand, so that it is easier for us to ensure the correctness of each step. When data errors occur, we often only need to partially adjust a certain step.

Bill Inmon, the father of the data warehouse, defined the data warehouse as a subject-oriented, integrated, relatively stable collection of data reflecting historical changes, used to support management decisions. From the definition point of view, the keywords of data warehouse are subject-oriented, integrated, stable, reflect historical changes, and support management decision-making, and the realization of these keywords is reflected in the layered architecture.
A good layered architecture has the following benefits:
1. Clear data structure: Each data layer has a corresponding scope, which makes it easier to locate and understand when using data.
2. Data lineage tracking: The data services provided to business personnel or downstream systems are all target data, and the data sources of target data generally come from multiple table data. If the target data is abnormal, the clear blood relationship can quickly locate the problem. Moreover, lineage management is also an important part of metadata management.
3. Reduce repetitive development: The principle of layer-by-layer data processing, the lower layer contains the full amount of data required for upper layer data processing. This processing method prevents each data developer from re-extracting data from the source system for processing.
4. Organized data relationships: There are complex data relationships between source systems, such as customer information that exists in the core system, credit system, financial management system, and capital system at the same time. How to make decisions when accessing data? The data warehouse will uniformly model the data of the same subject, sort out complex data relationships into a well-organized data model, and avoid the above-mentioned problems when using it.
5. Shield the influence of original data: the principle of layer-by-layer processing of data, the data of the upper layer is obtained by the data processing of the lower layer, and skipping data is not allowed. The original data is located at the bottom of the data warehouse, and there are multiple layers of data processing away from the application layer data. Therefore, the change of the original data will be eliminated during the process of processing the application layer data to maintain the stability of the application layer.

Layering aims to solve the current fast data support for the business, abstract the common framework for the future and empower other business lines, and provide stable and accurate data support for business development, and can be based on the existing model. New business development provides direction, that is, data-driven and empowered.

Features of a good data warehouse:

  1. Stability: data output is stable and guaranteed
  2. Credible: clean data and high data quality
  3. Rich: the business covered by the data is broad enough
  4. Transparency: The data composition system is sufficiently transparent

Data Warehouse Design

Three dimensions of data warehouse design:

  • Functional architecture: clear structural hierarchy
  • Data Architecture: Data Quality Guaranteed
  • Technical architecture: easy to expand and easy to use

Data Warehouse Architecture

According to the process of data inflow and outflow, the data warehouse architecture can be divided into: source data, data warehouse, and data application.
insert image description here
The data in the data warehouse comes from different source data and provides various data applications. The data flows into the data warehouse from bottom to top and then opens up applications to the upper layer. The data warehouse is only a platform for integrated data management in the middle.

  • Source data: There is no change in the data of this layer, and the data structure and data of the peripheral system are directly used, and it is not open to the outside world; it is a temporary storage layer, which is a temporary storage area for interface data, and prepares for the next step of data processing.
  • Data warehouse: also known as the detail layer, the data in the DW layer should be consistent, accurate, and clean data, that is, the data after cleaning (removing impurities) from the source system data.
  • Data application: the data source directly read by the front-end application; the data generated according to the report and thematic analysis requirements.

The data warehouse acquires data from various data sources and the data conversion and flow in the data warehouse can be considered as the process of ETL (extracting Extra, converting Transfer, loading Load). ETL is the pipeline of the data warehouse, and can also be considered as the process of the data warehouse. Blood, which maintains the metabolism of data in the data warehouse, and most of the energy in the daily management and maintenance of the data warehouse is to maintain the normal and stable ETL.
Building a data warehouse is like creating a new life, and the layered architecture is just the logical skeleton of this life. If you want to grow blood and flesh on the skeleton, you must carry out appropriate data modeling. Whether the data warehouse is strong or weak, healthy or ugly depends on the modeling results.

Data Warehouse Modeling Method

There are many modeling methods for data warehouses, and each modeling method represents a philosophical point of view and a method of summarizing and summarizing the world. Common methods include paradigm modeling, dimensional modeling, entity modeling, etc. Each method will look at business problems from different perspectives in essence.

  1. Paradigm modeling method
    Paradigm modeling method is actually a method commonly used by us in building data models. This method is mainly advocated by Inmon, and it mainly solves the data storage of relational databases. It is a technical method used. At present, most of our modeling methods in relational databases use the three-paradigm modeling method.
    A normal form is a collection of relational schemas conforming to a certain level. Constructing a database must follow certain rules, and in a relational database, such rules are paradigms, and this process is also called normalization. There are currently six normal forms of relational databases: first normal form (1NF), second normal form (2NF), third normal form (3NF), Boyce-Codd normal form (BCNF), fourth normal form (4NF) and fifth normal form (5NF) .
    In the model design of the data warehouse, the third normal form is generally adopted. A relation in third normal form must satisfy the following three conditions:
    • Each attribute value is unique and has no ambiguity
    • Each non-primary attribute must be fully dependent on the entire primary key, not a part of the primary key
    • Each non-primary attribute cannot depend on the attributes in other relationships, otherwise this attribute should be classified into other relationships.
      insert image description here
      According to Inmon's point of view, the construction method of the data warehouse model is similar to the enterprise data model of the business system. In the business system, the enterprise data model determines the source of data, and the enterprise data model is also divided into two levels, namely the subject domain model and the logic model. Similarly, the subject domain model can be regarded as the conceptual model of the business model, while the logical model is the instantiation of the domain model on the relational database.
  2. Entity Modeling
    Entity Modeling is not a common method in data warehouse modeling, it comes from a school of philosophy. From a philosophical point of view, the objective world should be subdivided, and the objective world should be divided into entities and the relationship between entities. Then we can completely introduce this abstract method in the modeling process of the data warehouse, and divide the entire business into individual entities, and the relationship between each entity and the description of these relationships are our data modeling Work needs to be done.
    Although substantive law may seem somewhat abstract at first glance, it is actually quite easy to understand. That is, we can divide any business process into three parts, entities, events, and descriptions, as shown in the figure below: the
    insert image description here
    above figure expresses an abstract meaning, if we describe a simple fact: "Xiao Ming drives to school" . Taking this business fact as an example, we can regard "Xiaoming" and "school" as an entity, and "going to school" describes a business process. Here we can abstract it as a specific "event", and "drive to" Then it can be regarded as an explanation of the event "going to school".
  3. Dimensional modeling method
    Dimensional modeling is advocated by Ralph Kimall, another master in the data warehouse field. His "Data Warehouse Toolbox" is the most popular data warehouse modeling classic in the field of data warehouse engineering. Dimensional modeling builds a model based on the needs of analysis and decision-making, and the constructed data model serves the analysis needs. Therefore, it focuses on how to solve the analysis needs of users more quickly, and also has better response performance for large-scale and complex queries.
    insert image description hereAt present, the most commonly used modeling method in Internet companies is dimensional modeling.

modeling

In actual business, we are given a lot of data, how do we use these data to build a data warehouse? The author of the data warehouse toolbox summarizes the following four steps for us based on his 60 years of actual business experience.
insert image description here

  1. Select business dimension
    Dimension modeling is close to the business, so it must be modeled based on the business, then select the business process, as the name implies, is to select the business we need to model in the entire business process, according to the needs provided by the operation and the future Ease of expansion, etc. to select services. For example, in a mall, the entire process of the mall is divided into the merchant end, the user end, and the platform end. The operational requirements are the total order volume, the number of orders, and the purchase status of users. When we choose the business process, we choose the data on the user end. consider. Business selection is very important, because all subsequent steps are based on this business data.
  2. Let’s take an example of the declaration granularity
    first: For a user, a user has an ID number, a household address, multiple mobile phone numbers, and multiple bank cards. , the finer granularity than the user granularity includes mobile phone number granularity and bank card granularity, and the existence of a one-to-one relationship means the same granularity. Why mention the same granularity, because dimensional modeling requires us to have the same granularity in the same fact table, and do not mix multiple different granularities in the same fact table, and create different fact tables for different granularity data. And when obtaining data from a given business process, it is strongly recommended to start the design with attention to atomic granularity, that is, start from the finest granularity, because atomic granularity can withstand unexpected user queries. However, the rollup summary granularity is very important to improve query performance. Therefore, for data with clear requirements, we establish rollup summary granularity tailored to the requirements, and for data with unclear requirements, we establish atomic granularity.
  3. Default dimension
    The dimension table is used as the entry and descriptive identification of business analysis, so it is also called the "soul" of the data warehouse. How to confirm which are dimension attributes in a pile of data? If the column is a description of a specific value, a text or a constant, a participant of a certain constraint and row identification, the attribute is often a dimension attribute at this time. The warehouse toolbox tells us to firmly grasp the granularity of the fact table, so that all possible dimensions can be distinguished, and to ensure that there is no duplicate data in the dimension table, the primary key of the dimension should be unique
  4. Confirm the fact
    The fact table is used for measurement, which is basically represented by a quantity value. Each row in the fact table corresponds to a measurement, and the data in each row is a specific level of detail data, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity. This ensures that there are no problems with double counting metrics. Sometimes it is often impossible to determine whether the column data is a fact attribute or a dimension attribute. The most useful facts to remember are numeric and additive facts. So you can analyze whether the column is a measure that contains multiple values ​​​​and acts as a participant in the calculation. In this case, the column is often a fact.

Among them, granularity is very important. Granularity is used to determine what the rows of the fact table represent. It is recommended to start the design with attention to granular data at the atomic level, because atomic granularity can withstand unpredictable user queries, and atomic data can be used in various possible ways. However, once a high granularity is selected, it cannot meet the needs of users to drill down to details. Facts are the core of the entire dimensional modeling. The snowflake model or the star model is based on a fact table that is extended through the foreign key-associated dimension table to generate a model-wide table that can support predictable query requirements, and the final query It also falls on the fact table.

Data Warehouse Layering in Actual Business

insert image description here

The concrete implementation of the data layer

  1. Data source layer ODS
    insert image description here

  2. insert image description here
    Each row in the DW fact table corresponds to a measure, and the data in each row is a specific level of detail data, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity. This ensures that there are no problems with double counting metrics. Dimension tables generally have a single primary key, and a few are joint primary keys. Be careful not to have duplicate data in the dimension table, otherwise data divergence will occur when it is associated with the fact table. Sometimes it is often impossible to determine whether the column data is a fact attribute or a dimension attribute. The most useful facts to remember are numeric and additive facts. Therefore, it can be analyzed whether the column is a measure that contains multiple values ​​and acts as a participant in the calculation. In this case, the column is often a fact; if the column is a description of a specific value, it is a text or constant, and a certain A participant in a constraint and row identification, in which case the attribute is often a dimension attribute. However, it is still necessary to combine the business to make the final judgment whether it is a dimension or a fact.
  3. Data light aggregation layer DM
    insert image description here
  4. Data Application Layer APP
    insert image description here

Summarize:
insert image description here

data governance

The real difficulty in building a data warehouse lies not in the design of the data warehouse, but in the data governance after the subsequent business development and the expansion of the business line, including asset management, data quality monitoring, and the construction of a data indicator system. In fact, the scope of data governance is very wide, including the management of data itself, data security, data quality, data cost, etc. In the DAMA Data Management Knowledge System Guide, data governance is located in the center of the "wheel diagram" of data management. The general outline of the management field provides an overall guidance strategy for various data management activities.

insert image description here

The Way of Data Governance

  1. Data governance requires system construction.
    In order to maximize the value of data, three elements need to be met: reasonable platform architecture, perfect governance services, and systematic operation methods. Choose the appropriate platform architecture according to the size of the enterprise, the industry it belongs to, the amount of data, etc.; governance services need to run through the entire data life cycle to ensure the integrity, accuracy, and consistency of data in the entire process of collection, processing, sharing, storage, and application and effectiveness; the means of operation should include the optimization of norms, organization, platform, and process.

  2. Data governance needs to lay a solid foundation
    Data governance needs to be done step by step, but at least three aspects need to be paid attention to in the initial stage of construction: data specification, data quality, and data security. Standardized model management is a prerequisite for data governance, high-quality data is a prerequisite for data availability, and data security control is a prerequisite for data sharing and exchange.

  3. Data governance requires IT empowerment.
    Data governance is not a pile of normative documents, but needs to implement the norms, processes, and standards generated in the governance process on the IT platform. In the data production process, "begin with the end" Carry out data governance in a forward-looking manner to avoid various passivity and increase in operation and maintenance costs caused by post-mortem audits.

  4. Data governance needs to focus on data
    . The essence of data governance is to manage data. Therefore, it is necessary to strengthen metadata management and master data management, manage data from the source, and complete relevant attributes and information of data, such as: metadata, quality, security, business logic, Blood relationship, etc., manage data production, processing and use through metadata-driven methods.

  5. Data governance requires the integration of construction and management.
    The consistency of data model lineage and task scheduling is the key to the integration of construction and management. It helps to solve the problem of inconsistency between data management and data production, and avoids the inefficient management mode of two skins.

Data Governance

The scope of data governance is very wide, the most important of which is data quality governance, and the scope of data quality is also very wide, throughout the entire life cycle of the data warehouse, from data generation -> data access -> data storage -> data processing -> Data output -> Data display, each stage requires quality governance, and the evaluation dimensions include completeness, standardization, consistency, accuracy, uniqueness, relevance, etc.

normative governance

Specifications are the guarantee for the construction of data warehouses. In order to avoid duplication of indicators and poor data quality, standardize the construction according to the most detailed and practical method.

  • Word
    roots Word roots are the basis of dimension and index management, and are divided into common words and proprietary words to improve the usability and relevance of words.

    1. Common root: the smallest unit that describes things, such as: transaction-trade.
    2. Proprietary root: It has a conventional or industry-specific description, such as: US dollar-USD.
  • Table naming convention

  1. Table names and field names are separated by an underscore (example: clienttype->client_type).
  2. Each part uses lowercase English words, and those belonging to general fields must meet the definition of general field information.
  3. Table names and field names must start with a letter.
  4. The table name and field name should not exceed 64 English characters.
  5. Prioritize the use of existing keywords in the root (root management in the standard configuration of the data warehouse), and regularly review the unreasonableness of newly added naming.
  6. Non-standard abbreviations are prohibited in the table name customization section.
  7. Table name = type + business topic + subtopic + table meaning + storage format + update frequency + ending
    insert image description here
  • Index Naming Standard
    Combined with the characteristics of the index and the root management specification, the index is structured.

Architecture Governance

  1. Data Layering
    An excellent and reliable data warehouse system often requires a clear data layering structure, that is, to ensure the stability of the data layer and to shield the impact on the downstream, and to avoid too long links. The general layered structure is as follows:
    insert image description here

  2. Data Flow
    The stable business is developed according to the standard data flow, namely ODS–>DWD–>DWA–>APP. Unstable business or exploratory requirements can follow the two model data flows of ODS->DWD->APP or ODS->DWD->DWT->APP. After ensuring the rationality of the data link, the principle of model hierarchical reference is confirmed on this basis:

    • Normal flow direction: ODS>DWD->DWT->DWA->APP, when the relationship of ODS>DWD->DWA->APP appears, it means that the subject domain is not fully covered. DWD data should be dropped into DWT, and DWD->DWA is allowed for tables with very low frequency of use.
    • Try to avoid tables that use DWD and DWT (the subject domain to which the DWD belongs) in the DWA wide table.
    • In principle, try to avoid generating DWT tables from DWT in the same subject domain, otherwise it will affect the efficiency of ETL.
    • Direct use of ODS tables is prohibited in DWT, DWA, and APP, and ODS tables can only be referenced by DWD.
    • Reverse dependencies are prohibited, for example, DWT tables depend on DWA tables.
  3. Metadata Governance
    metadata can be divided into technical metadata and business metadata: technical metadata is used by IT personnel who develop and manage data warehouses, and it describes data related to data warehouse development, management and maintenance, including data source information, Data transformation description, data warehouse model, data cleaning and updating rules, data mapping and access rights, etc.

insert image description here

Guess you like

Origin blog.csdn.net/u012655441/article/details/124377092