Data warehouse construction | ODS, DWD, DWM and other theoretical combat (Haowen Collection)

Contents of this article:
1. Data flow
2. Application examples
3. What is data warehouse DW
4. Why layering is required
5. Data layering
6. Data mart
7. Problem summary

Guided reading

In the process of data warehouse construction, the organization and management of data requires not only vertical subject domain division according to business, but also horizontal data warehouse hierarchical specification. The author of this article analyzes the layers of enterprise data warehouses, hoping to help you.

Because the article is too long, this article is not the final version, the full PDF version can be obtained at the end of the article

Those who are engaged in data warehouse related work know that one of the first tasks of data warehouse model design is to carry out . It can be 模型分层seen that the importance of model layering in the model design process is indeed excellent. It is true that excellent layered design is the key to the success of a data warehouse project The core element, making data easy to understand and highly reused is the core goal of layering.

1. Data flow

2. Application example

 

3. What is Data Warehouse DW

Data warehouse (can be abbreviated as DW or DWH) data warehouse is a complete theoretical system including etl, scheduling, and modeling when databases already exist in large numbers .          The purpose of the construction of the data warehouse solution is to provide the basis for front-end query and analysis. It is mainly used in OLAP (on-line Analytical Processing), supports complex analysis operations, focuses on decision support, and provides intuitive and easy-to-understand query results. At present, the more popular ones in the industry are: AWS Redshift, Greenplum, Hive, etc. The data warehouse is not the final destination of the data, but preparations for the final destination of the data, these preparations include: cleaning, escaping, classification, reorganization, merging, splitting, statistics, etc.

1. Main Features

  • subject-oriented

    • Operational database organization is oriented to transaction processing tasks, while the data in the data warehouse is organized according to certain subject areas.

    • The topic refers to the key aspects that users care about when using the data warehouse to make decisions, and a topic is related to multiple operational information systems.

  • integrated

    • It is necessary to process and integrate source data, unify and integrate

    • In the process of processing, the inconsistency of source data must be eliminated to ensure that the information in the data warehouse is consistent global information about the entire enterprise. (connection relation)

  • Unchangeable

    • The data in DW is not up-to-date, but comes from other data sources

    • The data warehouse is mainly to provide data for decision analysis, and the operations involved are mainly data query

  • time related

    • The data in the data warehouse needs to be marked with time attributes

Compare with database

  • DW: Specifically designed for data analysis, which involves reading large amounts of data to understand relationships and trends between data

  • Database: used to capture and store data

Fourth, why stratification

Problems involved in data warehouse:

  1. Why do data warehouses?

  2. Why do data quality management?

  3. Why do metadata management?

  4. What is the role of each layer in the data warehouse layering?

In practical work, we all hope that our data can flow in an orderly manner, and designers and users can clearly know the entire declaration cycle of the data, such as the left figure below. However, in practice, the data situation we are facing is likely to be highly complex and chaotic. We may create a data system with a chaotic table dependency structure and circular dependencies, such as the right figure below. .

In order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly, which is data layering . The benefits of data tiering:

  • Clear data structure: let each data layer have its own role and responsibilities, which can be more convenient and understandable when using and maintaining

  • Complex problem simplification: Decompose a complex task into multiple steps to complete step by step, each layer only solves a specific problem

  • Unified data caliber: Provide unified data export and unified output caliber through data layering

  • Reduce repetitive development: standardize data layering and develop a common middle layer, which can greatly reduce repetitive computing work

Five, data layering

Each company's business can be layered at different levels according to its own business needs; the current mature data layering: data operation layer ODS, data warehouse layer DW, data service layer ADS (APP).

1. Data operation layer ODS

Data operation layer: Operation Data Store data preparation area, also known as the source layer. The data in the data source enters this layer after extraction, cleaning, and transmission, that is, the ETL process. The main functions of this layer:

  • ODS is the preparation area of ​​the back data warehouse layer

  • Provide raw data for DWD layer

  • Reduced impact on business systems

When the source data is loaded into this layer, such as denoising (for example, the age of a person in a piece of data is 300 years old, which belongs to abnormal data, some processing needs to be done in advance), deduplication (for example, in the personal data table) , there are two duplicate data for the same ID, and a series of operations such as one-step deduplication), field naming specification, etc. are required when accessing. However, in order to consider the follow-up data problem that may need to be traced, it is not recommended to do too much data cleaning work for this layer. It is also possible to access the original data as it is, and do it according to the needs of specific business layers. This layer of data is the source of subsequent data warehouse processing data. Ways to source the data :

  • business library

    • Sqoop is often used to extract, for example, it is extracted once a day.

    • In terms of real-time, you can consider using canal to monitor mysql's binlog and access it in real-time.

  • Buried log

    • Logs are generally saved in the form of files, and you can choose to use flume to synchronize regularly

    • Can use spark streaming or Flink for real-time access

    • kafka 也 OK

  • Message queue: data from ActiveMQ, Kafka, etc.

2. Data warehouse layer DW

The data warehouse layer can be divided into three layers from top to bottom: 数据细节层DWD, 数据中间层DWM, 数据服务层DWS.

1) Data detail layer DWD

Data detail layer: data warehouse details, DWD (data cleaning/DWI) This layer is the isolation layer between the business layer and the data warehouse, and maintains the same data granularity as the ODS layer; it mainly cleans and normalizes the ODS data layer. operations, such as removing empty data, dirty data, outliers, etc. In order to improve the usability of the data detail layer, this layer usually adopts some dimension degradation methods to degenerate the dimensions into the fact table and reduce the association between the fact table and the dimension table.

2) Data middle layer DWM

Data middle layer: Data Warehouse Middle, DWM This layer is based on the data of the DWD layer, performs some slight aggregation operations on the data, generates some intermediate result tables of columns, improves the reusability of public indicators, and reduces the work of repeated processing. .

In short, the general core dimensions are aggregated to calculate the corresponding statistical indicators

3) Data service layer DWS

Data service layer: Data Warehouse Service, DWS (Wide Table - User Behavior, Light Aggregation) This layer is based on the basic data on DWM, integrated and aggregated into a data service layer for analyzing a certain subject domain, generally a wide table, used for Provide follow-up business queries, OLAP analysis, data distribution, etc. Generally speaking, there will be relatively few data tables in this layer; a table will cover more business content. Because of its many fields, the table in this layer is generally called a wide table.

  • User behavior, mildly aggregated to DWD

  • Mainly do some light summarization of ODS/DWD layer data.

3. Data application layer ADS

Data application layer: Application Data Service, ADS (APP/DAL/DF) - report results. This layer is mainly used for data products and data analysis. It is generally stored in ES, Redis, PostgreSql and other systems for online systems; it may also be stored in hive or Druid for data analysis and data mining. For example, commonly used data reports exist here.

4. Fact Table

A fact table refers to a table that stores fact records, such as system logs, sales records, etc. The records of the fact table are constantly growing, such as the e-commerce product order table, which is a similar situation, so the volume of the fact table is usually much larger than that of other tables.

5. Dimension Surface Dimension (DIM)

Dimension Table or dimension table, sometimes called Lookup Table, is a table corresponding to fact table; it stores the attribute values ​​of dimensions and can be associated with fact table, which is equivalent to adding fact Attributes that frequently appear on the table are extracted and standardized and managed by a table. The dimension table mainly consists of two parts:

  • High-cardinality dimensional data: generally data tables similar to user data tables and commodity data tables, and the amount of data may be tens of millions or hundreds of millions.

  • Low cardinality dimension data: generally a configuration table, such as the Chinese meaning corresponding to the enumeration field, or a date dimension table, etc.; the amount of data may be single digits or tens of thousands.

6. Temporary table TMP

There will be many temporary tables for each layer of calculation, and a DWTMP layer is specially designed to store the temporary tables of our data warehouse

6. Data Mart

ADS layer in a narrow sense; in a broad sense, it refers to the data mart (Data Mart) that Hadoop synchronizes from DWD DWS ADS to RDS, also called data mart. Data mart is to meet the needs of specific departments or users and store them in a multi-dimensional manner. Including the definition of dimensions, the indicators to be calculated, the level of dimensions, etc., to generate data cubes for decision analysis requirements. In terms of scope, data is extracted from an enterprise-wide database, data warehouse, or a more specialized data warehouse. The point of a data center is that it caters to the specific needs of professional user groups in terms of analytics, content, performance, and ease of use. Users of data centers expect data to be represented in terms they are familiar with. Data warehousing structure with data marts

difference data warehouse

Data marts are a subset of enterprise-level data warehouses, which are mainly oriented to department-level businesses and only oriented to a specific topic. In order to solve the contradiction between flexibility and performance, a data mart is a small department or workgroup level data warehouse added to the data warehouse architecture. Data marts store pre-computed data for specific users to meet users' performance needs. Data marts can alleviate the bottleneck of accessing data warehouses to a certain extent. In theory, there should be a general data warehouse concept, and then there are data marts. When actually building a data mart, this is rarely done in China. In China, we generally start with the data mart first, and build a data mart for a specific topic (such as customer information of an enterprise), and then build a data warehouse. The order in which data warehouses and data marts are established is closely related to the design method. As an engineering discipline, data warehouse is not right or wrong. In terms of data structure, a data warehouse is a subject-oriented, integrated collection of data. The data mart is usually defined as a star structure or a snowflake data structure. The data mart is generally composed of a fact table and several dimension tables.

7. Summary of the problem

1. What is the difference between ODS and DWD?

Q : I still don't understand the difference between the ods and dwd layers. I feel that the dwd is useless after the ods layer.

A : From an ideal point of view, if the data of the ods layer is irregular and can basically meet most of our needs, this is of course good. At this time, the dwd layer is actually not necessary. However, in reality, it is difficult to guarantee the quality of the data at the ods layer. After all, the sources of data are various, and the pusher will also have its own push logic. In this case, we need to pass an additional layer of dwd to Mask some low-level differences.

Q : I understand, does it mean that dwd mainly does some data cleaning and normalization operations on the ods layer, and dws mainly does some light summarization of the ods layer data?

A : Yes, it can be roughly understood in this way.

2. What does the APP layer do?

Q : I feel that the DWS layer has no place to put it. Should the DWS table of each business be in the DWD or in the app?

Answer : This question is not easy to answer. I think the main thing is to clarify what the DWS layer does. If your DWS layer contains some wide tables that can be used by the business side, you can put them in the app layer. If the data mart you are talking about is a more general concept, then in fact, dws, dwd, and app are all the contents of the data mart.

Q : Is the data stored in Redis and ES considered the app layer?

A : Yes, in my personal understanding, the app layer mainly stores some relatively mature tables that can be used by the business side. These tables can be in Hive, or imported from Hive into Redis or ES, a system with better query performance

Data warehouse construction full version tutorial PDF document

Guess you like

Origin blog.csdn.net/helloHbulie/article/details/124147383