2023-Detailed explanation of the entire data warehouse construction system

1. The basic concept of data warehouse

  1. The difference between data warehouse and database

  2. Data Warehouse Hierarchical Architecture

  3. Data Warehouse Metadata Management

2. Data Warehouse Modeling Method

  1. Paradigm Modeling

  2. Dimensional Modeling

  3. Solid Modeling

3. Dimensional Modeling

  1. Types of Tables in Dimensional Modeling

  2. Three modes of dimensional modeling

  3. Dimensional modeling process

4. Layering of data warehouses in actual business

  1. Data source layer ODS

  2. Data detail layer DW

  3. Data light aggregation layer DM

  4. Data Application Layer APP

Basic concepts of data warehouse

Data Warehouse Concepts:

The English name is Data Warehouse, which can be abbreviated as DW or DWH. The purpose of the data warehouse is to build an analysis-oriented integrated data environment to provide decision support (Decision Support) for enterprises. It was created for analytical reporting and decision support purposes.

The data warehouse itself does not "produce" any data, and at the same time it does not need to "consume" any data. The data comes from outside and is open to external applications, which is why it is called "warehouse" instead of "factory". .

Basic Features:

A data warehouse is a subject-oriented, integrated, non-volatile and time-varying collection of data to support management decisions.

  1. Subject-oriented:

In traditional databases, the biggest feature is the organization of application-oriented data, and various business systems may be separated from each other. A data warehouse, on the other hand, is subject-oriented. Theme is an abstract concept, which is an abstraction for data synthesis, classification, analysis and utilization in enterprise information systems at a higher level. In a logical sense, it corresponds to the analysis object involved in a certain macro analysis field in the enterprise.

  1. Integration:

The data of the data warehouse is obtained by extracting, cleaning, converting and summarizing the scattered, independent and heterogeneous database data, which ensures the consistency of the data in the data warehouse with respect to the entire enterprise.

The comprehensive data in the data warehouse cannot be obtained directly from the original database system. Therefore, before the data enters the data warehouse, it must go through unification and synthesis. This step is the most critical and complicated step in the construction of the data warehouse. The tasks to be completed include:

  • It is necessary to unify all contradictions in the source data , such as fields with the same name and different names, different names with the same meaning, inconsistent units, inconsistent word lengths, and so on.

  • Perform data synthesis and calculations . The data synthesis work in the data warehouse can be generated when the data is extracted from the original database, but many of them are generated inside the data warehouse, that is, after entering the data warehouse for comprehensive generation.

The following diagram illustrates a simple process of composite data for an insurance company, where the data related to the topic "Insurance" in the data warehouse comes from several different operational systems. The naming of data within these systems may be different, and the data format may also be different. Before storing data from different sources in the data warehouse, these inconsistencies need to be removed.

Data Warehouse Theme

  1. non-volatile (non-updatable)

The data in the data warehouse reflects the content of historical data for a long period of time . It is a collection of database snapshots at different points in time, and the derived data based on these snapshots for statistics, synthesis and reorganization.

Data non-volatility is mainly for applications. Most of the data warehouse users' operations on data are data query or complex mining. Once the data enters the data warehouse, it is generally retained for a long time. There are generally a large number of query operations in the data warehouse, but few modification and deletion operations. Therefore, after the data is processed and integrated into the data warehouse, it is rarely updated, and usually only needs to be loaded and updated regularly .

  1. time-varying

A data warehouse contains historical data of various granularities. Data in a data warehouse may relate to a particular date, week, month, quarter, or year. The purpose of the data warehouse is to mine the hidden patterns by analyzing the business operation status of the enterprise in the past period of time. Although users of the data warehouse cannot modify the data, it does not mean that the data in the data warehouse will never change. The results of the analysis can only reflect the past situation. When the business changes, the excavated models will lose their timeliness. Therefore, the data in the data warehouse needs to be updated to meet the needs of decision-making. From this perspective, data warehouse construction is a project, but also a process. The data of the data warehouse changes over time in the following aspects:

(1) The data time limit of the data warehouse is generally much longer than that of the operational data.
(2) The operational system stores current data, while the data in the data warehouse is historical data.
(3) The data in the data warehouse are added in chronological order, and they all have time attributes.

1. The difference between a data warehouse and a database

The difference between database and data warehouse is actually  the difference between OLTP  and  OLAP  .

Operational processing, called OLTP (On-Line Transaction Processing,), can also be called a transaction-oriented processing system. It is a daily operation on the database for specific businesses, and usually queries and modifies a small number of records. Users are more concerned about the response time of operations, data security, integrity, and the number of concurrently supported users . As the main means of data management, traditional database systems are mainly used for operational processing. Relational databases such as Mysql and Oracle generally belong to OLTP.

Analytical processing, called OLAP (On-Line Analytical Processing), generally analyzes historical data of certain subjects to support management decisions.

First of all, we must understand that the emergence of data warehouse is not to replace the database. Database is transaction-oriented design, data warehouse is subject-oriented design. Databases generally store business data, and data warehouses generally store historical data.

The database design is to avoid redundancy as much as possible. Generally, it is designed for a certain business application, such as a simple User table, which can record simple data such as user name and password, which is suitable for business applications, but not suitable for analysis. The design of the data warehouse intentionally introduces redundancy, and is designed according to the analysis requirements, analysis dimensions, and analysis indicators .

Databases are designed for capturing data, data warehouses are designed for analyzing data .

Take banking, for example. The database is the data platform of the transaction system. Every transaction made by the customer in the bank will be written into the database and recorded. Here, it can be simply understood as using the database to keep accounts. The data warehouse is the data platform of the analysis system. It obtains data from the transaction system, summarizes and processes it, and provides decision-makers with a basis for decision-making. For example, how many transactions occur in a certain branch of a bank in a month, and what is the current deposit balance of the branch. If there are more deposits and more consumer transactions, then it is necessary to set up an ATM in the area.

Obviously, the transaction volume of banks is huge, usually calculated in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness. It takes tens of seconds for customers to deposit a sum of money, which is unbearable. This requires the database to only store data for a short period of time. The analysis system is post-event, and it must provide all valid data within the time period of interest. These data are massive, and the summary calculation is slower, but as long as effective analysis data can be provided, the goal will be achieved.

The data warehouse is produced in order to further mine data resources and make decisions when there are already a large number of databases. It is by no means a so-called "large database" .

2. Data warehouse layered architecture

According to the process of data inflow and outflow, the data warehouse architecture can be divided into: source data , data warehouse , data application

database

The data in the data warehouse comes from different source data and provides various data applications. The data flows into the data warehouse from bottom to top and then opens up applications to the upper layer. The data warehouse is only a platform for integrated data management in the middle.

Source data : There is no change in the data of this layer, and the data structure and data of the peripheral system are directly used, and it is not open to the outside world; it is a temporary storage layer, which is a temporary storage area for interface data, and prepares for the next step of data processing.

Data warehouse : also known as the detail layer, the data in the DW layer should be consistent, accurate, and clean data, that is, the data after cleaning (removing impurities) from the source system data.

Data application : the data source directly read by the front-end application; the data generated according to the report and thematic analysis requirements.

The data warehouse acquires data from various data sources and the data conversion and flow in the data warehouse can be considered as the process of ETL ( extracting Extra, converting Transfer, loading Load ). ETL is the pipeline of the data warehouse, and can also be considered as the process of the data warehouse. Blood, which maintains the metabolism of data in the data warehouse, and most of the energy in the daily management and maintenance of the data warehouse is to maintain the normal and stable ETL.

So why layer your data warehouse ?

  • Exchange space for time, and improve the user experience (efficiency) of the application system through a large number of preprocessing, so there will be a large amount of redundant data in the data warehouse; if there is no layering, if the business rules of the source business system change, it will affect the entire system. The data cleaning process requires a huge workload.

  • The process of data cleaning can be simplified through data hierarchical management, because the original one-step work is divided into multiple steps to complete, which is equivalent to splitting a complex job into multiple simple jobs, turning a large black box into a A white box is built, and the processing logic of each layer is relatively simple and easy to understand, so that it is easier for us to ensure the correctness of each step. When data errors occur, we often only need to partially adjust a certain step.

3. Data warehouse metadata management

Metadata (Meta Date) mainly records the definition of the model in the data warehouse, the mapping relationship between each level, monitors the data status of the data warehouse and the task running status of the ETL . Generally, metadata is stored and managed uniformly through the Metadata Repository, and its main purpose is to achieve coordination and consistency in the design, deployment, operation and management of the data warehouse.

Metadata is an important part of the data warehouse management system. Metadata management is a key component in the enterprise data warehouse. It runs through the entire process of data warehouse construction and directly affects the construction, use and maintenance of the data warehouse.

  • One of the main steps in building a data warehouse is ETL. At this time, metadata will play an important role. It defines the mapping from the source data system to the data warehouse, the rules of data conversion, the logical structure of the data warehouse, the rules of data update, the history of data import, and the loading cycle. It is through metadata that experts in data extraction and transformation as well as data warehouse administrators efficiently build data warehouses.

  • When using the data warehouse, users access data through metadata, clarify the meaning of data items and customize reports.

  • The size and complexity of a data warehouse is inseparable from proper metadata management, including adding or removing external data sources, changing data cleaning methods, controlling erroneous queries, and scheduling backups.

Metadata can be divided into technical metadata and business metadata. Technical metadata is used by IT personnel who develop and manage data warehouses. It describes data related to data warehouse development, management and maintenance, including data source information, data conversion descriptions, data warehouse models, data cleaning and update rules, and data mapping. and access rights etc. The business metadata serves the management and business analysts, and describes the data from a business perspective, including business terms, what data is in the data warehouse, the location of the data, and the availability of the data, etc., to help business people better understand what is in the data warehouse Data is available and how it is used.

It can be seen from the above that metadata not only defines the schema, source, extraction and conversion rules of data in the data warehouse, but also is the basis for the operation of the entire data warehouse system. Metadata connects the loose components in the data warehouse system to form a an organic whole.

Data Warehouse Modeling Method

There are many modeling methods for data warehouses, and each modeling method represents a philosophical point of view and a method of summarizing and summarizing the world. Common methods include  paradigm modeling, dimensional modeling, entity modeling, etc. Each method will look at business problems from different perspectives in essence.

1. Paradigm Modeling

Paradigm modeling method is actually a method commonly used by us in building data models. This method is mainly advocated by Inmon. It mainly solves the data storage of relational databases and uses a method at the technical level. At present, most of our modeling methods in relational databases use the three-paradigm modeling method.

A normal form is a collection of relational schemas conforming to a certain level. Constructing a database must follow certain rules, and in a relational database, such rules are paradigms, and this process is also called normalization. There are currently six normal forms of relational databases: first normal form (1NF), second normal form (2NF), third normal form (3NF), Boyce-Codd normal form (BCNF), fourth normal form (4NF) and fifth normal form (5NF) .

In the model design of the data warehouse, the third normal form is generally adopted. A relation in third normal form must have the following three conditions:

  • Each attribute value is unique and has no ambiguity;

  • Each non-primary attribute must be fully dependent on the entire primary key, not a part of the primary key;

  • Each non-primary attribute cannot depend on attributes in other relations, because in this case, this attribute should be attributed to other relations.

paradigm modeling

According to Inmon's point of view, the construction method of the data warehouse model is similar to the enterprise data model of the business system. In the business system, the enterprise data model determines the source of data, and the enterprise data model is also divided into two levels, namely the subject domain model and the logic model. Similarly, the subject domain model can be regarded as the conceptual model of the business model, while the logical model is the instantiation of the domain model on the relational database.

2. Dimensional modeling method

The dimensional model is advocated by Ralph Kimall, another master in the data warehouse field. His "Data Warehouse Toolbox" is the most popular data warehouse modeling classic in the field of data warehouse engineering. Dimensional modeling builds a model based on the needs of analysis and decision-making, and the constructed data model serves the analysis needs. Therefore, it focuses on how to solve the analysis needs of users more quickly, and also has better response performance for large-scale and complex queries.

Dimensional modeling

Typical representatives are the well-known Star-schema, and the Snow-schema applicable in some special scenarios.

The more important concepts in dimensional modeling are fact table (Fact table) and dimension table (Dimension table). The simplest description is to build data warehouses and data marts based on fact tables and dimension tables.

At present, the most commonly used modeling method in Internet companies is dimensional modeling, which will be explained later.

3. Solid Modeling

Entity modeling is not a common method in data warehouse modeling, it comes from a school of philosophy. From a philosophical point of view, the objective world should be subdivided, and the objective world should be divided into entities and the relationship between entities. Then we can completely introduce this abstract method in the modeling process of the data warehouse, and divide the entire business into individual entities, and the relationship between each entity and the description of these relationships are our data modeling Work needs to be done.

Although substantive law may seem somewhat abstract at first glance, it is actually quite easy to understand. That is, we can divide any business process into three parts, entities, events, and descriptions , as shown in the following figure:

solid modeling

The above picture shows an abstract meaning, if we describe a simple fact: "Xiao Ming drives to school". Taking this business fact as an example, we can regard "Xiaoming" and "school" as an entity, and "going to school" describes a business process. Here we can abstract it as a specific "event", and "drive to" Then it can be regarded as an explanation of the event "going to school".

Dimensional modeling

Dimensional modeling is currently widely used, and it is specially used in analytical databases, data warehouses, and data mart modeling methods. Data mart can be understood as a "small data warehouse".

1. Types of tables in dimensional modeling

1. Fact table

Operational events that occur in the real world, and their measurable values ​​are stored in the fact table. At the lowest level of granularity, a fact table row corresponds to a measurement event and vice versa.

Fact tables represent measures on the subject of analysis . For example, we can understand a purchase behavior as a fact.

Facts and Dimensions

The order table in the figure is a fact table. You can understand that it is an operational event that occurs in reality. Every time we complete an order, a record will be added to the order. The characteristics of the fact table: there is no actual content stored in the table. It is a collection of primary keys, and these IDs can correspond to a record in the dimension table. The fact table contains foreign keys associated with each dimension table and can be associated with the dimension table. The measurement of the fact table is usually a numerical type, and the number of records will continue to increase, and the size of the table data will grow rapidly.

Schedule (wide table):

In the data of the fact table, some attributes together form a field (combined together), such as the year, month, day, hour, minute, and second constitute the time. When grouping statistics based on a certain attribute is required, operations such as interception and splicing are required. Efficiency extremely low. like:

local_time
2021-03-18 06:31:42

For the convenience of analysis, a field in the fact table can be cut and extracted to form a new field. Because there are more fields, it is called a wide table, and the original one becomes a narrow table .

Expand the above local_timefields to the following 6 fields:

year month day hour m s
2021 03 18 06 31 42

And because the information in the wide table is clearer and more detailed, it can also be called a detailed table.

2. dimension table

Each dimension table contains a single primary key column. The primary key of the dimension table can be used as the foreign key of any fact table associated with it. Of course, the description environment of the dimension table row should correspond exactly to the fact table row. Dimension tables are usually wide, flat, non-standard tables that contain a large number of low-grained text attributes.

A dimension represents a quantity that you use to analyze data. For example, if you want to analyze product sales, you can choose to analyze by category or by region. Each category constitutes a dimension. The user table, business table, and time table in the diagram of the fact table are all dimension tables. These tables have a unique primary key, and then store detailed data information in the table.

In general, there is no need to strictly abide by the normative design principles in the data warehouse. Because the leading function of the data warehouse is analysis-oriented, query-based, and does not involve data update operations. The design of the fact table is based on the ability to correctly record historical information, and the design of the dimension table is based on the ability to aggregate subject content from an appropriate angle .

2. Three modes of dimensional modeling

1. Star Schema

Star schema (Star Schema) is the most commonly used dimensional modeling method. The star schema is centered on the fact table, and all dimension tables are directly connected to the fact table, like stars . The dimensional modeling of the star schema consists of a fact table and a set of dimension tables, and has the following characteristics: a. The dimension table is only associated with the fact table, and there is no association between the dimension tables; b. The primary key of each dimension table is a single column, And the primary key is placed in the fact table as a foreign key connecting both sides; c. With the fact table as the core, the dimension tables are distributed in a star shape around the core;

2. Snowflake Mode

Snowflake Schema is an extension of Star Schema. The dimension table of the snowflake schema can have other dimension tables . Although this model is more standardized than the star schema, because this model is not easy to understand, the maintenance cost is relatively high, and in terms of performance, multi-layer dimension tables need to be associated. Performance Also lower than the star schema. So generally not very common

snowflake pattern

3. Constellation pattern

The constellation schema is an extension of the star schema. The star schema is based on one fact table, while the constellation schema is based on multiple fact tables and shares dimension information . The two dimensional modeling methods introduced above are multidimensional tables corresponding to single fact tables, but in many cases there are more than one fact tables in the dimension space, and one dimension table may also be used by multiple fact tables. In the late stage of business development, most of the dimensional modeling uses the constellation model.

constellation model

3. Dimensional modeling process

We know that the table types of dimensional modeling include fact table and dimension table; schemas include concepts such as star model, snowflake model, and constellation model, but in actual business, we are given a lot of data, how do we use these data for data warehouse For construction, the author of Data Warehouse Toolbox has summarized the following four steps for us based on his 60 years of practical business experience, please remember!

Dimensional modeling in the data warehouse toolbox in four steps:

Dimensional modeling in four steps

Please keep in mind the above four steps. No matter what business you do, follow these steps. Don’t mess up the order, because these four steps are interlocking and connected step by step. The following disassembles in detail how to do each step

1. Select business process
Dimensional modeling is closely related to the business, so it must be modeled based on the business, then select the business process, as the name implies, is to select the business we need to model in the entire business process, according to the needs provided by the operation and Select services for future scalability. For example, in a mall, the entire process of the mall is divided into the merchant end, the user end, and the platform end. The operational requirements are the total order volume, the number of orders, and the purchase status of users. When we choose the business process, we choose the data on the user end. consider. Business selection is very important, because all subsequent steps are based on this business data.

2. Declare granularity
Let me give you an example: For a user, a user has an ID number, a household registration address, multiple mobile phone numbers, and multiple bank cards. The address granularity, which is finer than the user granularity, includes the mobile phone number granularity and the bank card granularity. There is a one-to-one relationship, which means the same granularity. Why mention the same granularity, because dimensional modeling requires us to have the same granularity in the same fact table , and do not mix multiple different granularities in the same fact table, and create different fact tables for different granularity data. And when obtaining data from a given business process, it is strongly recommended to start the design with attention to atomic granularity, that is, start from the finest granularity, because atomic granularity can withstand unexpected user queries. However, the rollup summary granularity is very important to improve query performance. Therefore, for data with clear requirements, we establish rollup summary granularity tailored to the requirements, and for data with unclear requirements, we establish atomic granularity.

3. Confirm dimensions
Dimension tables are used as the entrance and descriptive identification of business analysis, so they are also called the "soul" of the data warehouse. How to confirm which are dimension attributes in a pile of data? If the column is a description of a specific value, a text or a constant, a participant of a certain constraint and row identification, the attribute is often a dimension attribute at this time. The warehouse toolbox tells us to firmly grasp the granularity of the fact table, so that all possible dimensions can be distinguished , and to ensure that there is no duplicate data in the dimension table, the primary key of the dimension should be unique

4. Confirm the fact
The fact table is used for measurement, which is basically represented by a quantity value. Each row in the fact table corresponds to a measurement, and the data in each row is a specific level of detail data, called granularity. One of the core principles of dimensional modeling is that all measures in the same fact table must have the same granularity . This ensures that there are no problems with double counting metrics. Sometimes it is often impossible to determine whether the column data is a fact attribute or a dimension attribute. The most useful facts to remember are numeric and additive facts . So you can analyze whether the column is a measure that contains multiple values ​​​​and acts as a participant in the calculation. In this case, the column is often a fact.

Data Warehouse Layering in Actual Business

The layering of the data warehouse must be carried out in conjunction with the company's business, and the responsibilities of each layer need to be clearly defined. To ensure the stability of the data layer and shield the downstream impact, the following layered structure is generally adopted:

Data Hierarchy

Implementation of the data layer

Use four diagrams to illustrate the specific implementation of each layer

  • Data source layer ODS

data source layer

The data source layer mainly imports various business data to the big data platform as a snapshot storage of business data.

  • Data detail layer DW

data level

Each row in the fact table corresponds to a measure, and the data in each row is a specific level of detail data, called granularity. The thing to remember is that all measures in the same fact table must have the same granularity .

Dimension tables generally have a single primary key, and a few are joint primary keys. Be careful not to have duplicate data in the dimension table, otherwise data divergence will occur when it is associated with the fact table .

Sometimes it is often impossible to determine whether the column data is a fact attribute or a dimension attribute. The most useful facts to remember are numeric and additive facts . Therefore, it can be analyzed whether the column is a measure that contains multiple values ​​and acts as a participant in the calculation. In this case, the column is often a fact; if the column is a description of a specific value, it is a text or constant, and a certain A participant in a constraint and row identification, in which case the attribute is often a dimension attribute. However, it is still necessary to combine the business to make the final judgment whether it is a dimension or a fact.

  • Data light aggregation layer DM

Data Light Aggregation Layer

This layer is named the light summary layer, which means that this layer has started to summarize the data, but it is not a complete summary, but the data of the same granularity is correlated and summarized. Data with different granularities but related data can also be summarized. At this time, it is necessary Unify the granularity through operations such as aggregation.

  • Data Application Layer APP

data application layer

The tables in the data application layer are provided for users. The construction of the data warehouse is coming to an end. Next, different data fetches are performed according to different needs, such as directly displaying reports or providing them to colleagues who need data analysis. data, or other business support.

at last

Technology serves business, business creates value for the company, and technology without business is meaningless. Therefore, the construction of the data warehouse is closely related to the business. The business of the company is different, and the construction of the data warehouse is also different. Only the suitable one is the best.

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/130685944