Data warehouse construction | ODS, DWD, DWM and other theoretical combat)

start of text

Contents of this article:

1. Data flow
2. Application examples
3. What is a data warehouse DW
4. Why layering
5. Data layering
6. Data mart
7. Problem summary

guide

In the process of building a data warehouse, in terms of data organization and management , it is not only necessary to divide the subject areas vertically according to the business, but also to standardize the horizontal data warehouse layering. The author of this article analyzes the layering of enterprise data warehouses, hoping to help you.

Those who are engaged in data warehouse related work know that one of the first tasks of data warehouse model design is to carry out

model layering

, it can be seen that model layering is important in the model design process. Indeed, excellent layering design is the core element for the success of a data warehouse project. Making data easy to understand and highly reusable is the core goal of layering.

, What is Data Warehouse DW

Data warehouse (can be abbreviated as DW or DWH ) data warehouse is a complete set of theoretical systems including etl , scheduling, and modeling when there are already a large number of databases . The purpose of the data warehouse program construction is to serve as the basis for front-end query and analysis, mainly used in OLAP ( on-line Analytical Processing ), support complex analysis operations, focus on decision support, and provide intuitive and easy-to-understand query results. Currently popular in the industry are: AWS Redshift , Greenplum , Hive, etc. The data warehouse is not the final destination of the data, but prepares for the final destination of the data. These preparations include: cleaning, escaping, classification, reorganization, merging, splitting, statistics, etc.                 

1. Main features

  • subject-oriented
    • The operational database organization is oriented to transaction processing tasks, while the data in the data warehouse is organized according to certain subject areas.
    • Theme refers to the key aspects that users care about when using the data warehouse to make decisions. A theme is usually related to multiple operational information systems.
  • integrated
    • Need to process and integrate source data, unify and synthesize
    • In the process of processing, the inconsistency of source data must be eliminated to ensure that the information in the data warehouse is consistent global information about the entire enterprise. (connection relation)
  • Unchangeable
    • The data in DW is not up-to-date, but comes from other data sources
    • The data warehouse mainly provides data for decision-making analysis, and the operations involved are mainly data queries
  • related to time
    • The data in the data warehouse needs to be marked with time attributes for decision-making needs

Comparison with database

  • DW : Specifically designed for data analysis, which involves reading large amounts of data to understand relationships and trends between the data
  • Database: used to capture and store data

4. Why layering

Issues involved in data warehouse:

  1. Why do a data warehouse?
  2. Why data quality management?
  3. Why metadata management?
  4. What is the role of each layer in the data warehouse layering?

In actual work, we all hope that our data can flow in an orderly manner, and designers and users can clearly know the entire life cycle of data, such as the left picture below.          However, in reality, the data situation we are facing is likely to be highly complex and hierarchically chaotic. We may create a data system with chaotic table dependency structure and circular dependency.

In order to solve the problems we may face, we need a set of effective data organization, management and processing methods to make our data system more orderly, which is data layering . Benefits of data tiering:

  • Clear data structure: Let each data layer have its own roles and responsibilities, which can be more convenient and understandable when using and maintaining
  • Complex problem simplification: a complex task is disassembled into multiple steps to complete step by step, and each layer only solves a specific problem
  • Unified data caliber: Provide unified data export and unified output caliber through data layering
  • Reduce repetitive development: Standardize data layering and develop a common middle layer, which can greatly reduce the work of repeated calculations

5. Data Layering

Each company's business can be stratified into different levels according to its own business needs; the current relatively mature data stratification: data operation layer ODS , data warehouse layer DW , data service layer ADS (APP) .

1. Data operation layer ODS

Data operation layer: Operation Data Store data preparation area, also known as paste source layer. The data in the data source enters this layer after extraction, cleaning, and transmission, that is, the ETL process. The main functions of this layer:

  • ODS is the preparation area for the data warehouse layer behind
  • Provide raw data for DWD layer
  • Reduce impact on business systems

When the source data is loaded into this layer, such as denoising (for example, the age of a person in a piece of data is  300 years old, which is abnormal data, some processing needs to be done in advance), deduplication ( for example, in the personal data table In the same  ID , there are two duplicate data, and a series of operations such as one - step deduplication, field naming specification, etc. are required when accessing . However, in order to consider the possibility of data traceability in the future, it is not recommended to do too much data cleaning work for this layer. It is also possible to access the original data intact, and do it according to the specific layering requirements of the business. The data at this layer is the source of subsequent data warehouse processing data. The way of data source :        

  • Business library
    • Sqoop is often used to extract, for example, once a day.
    • In terms of real-time, you can consider using canal to monitor the mysql binlog and access it in real time.
  • Buried log
    • Logs are generally saved in the form of files, and you can choose to synchronize them regularly with flume
    • You can use spark streaming or Flink for real-time access
    • kafkaOK
  • Message queue: data from ActiveMQ , Kafka, etc.

2. Data warehouse layer DW

The data warehouse layer can be divided into three layers from top to bottom:

Data detail layer DWD

Data middle layer DWM

Data service layer DWS

1) Data detail layer DWD

Data detail layer: data warehouse details , DWD ( data cleaning /DWI)         This layer is the isolation layer between the business layer and the data warehouse, and maintains the same data granularity as the ODS layer; it mainly cleans and standardizes some data on the ODS data layer Operations, such as removing empty data, dirty data, outliers, etc. In order to improve the usability of the data detail layer, this layer usually adopts some dimension degeneration methods to degenerate the dimension into the fact table and reduce the association between the fact table and the dimension table.

2) Data middle layer DWM

Data middle layer: Data Warehouse Middle , DWM This layer is based on the data of the DWD layer, performs some slight aggregation operations on the data, generates a series of intermediate result tables, improves the reusability of public indicators, and reduces the work of repeated processing .

To put it simply, aggregate the common core dimensions to calculate the corresponding statistical indicators

3) Data service layer DWS

Data service layer: Data Warehouse Service , DWS ( wide table - user behavior, light aggregation ) This layer is based on the basic data on DWM , integrated and summarized into a data service layer for analyzing a certain subject domain, generally a wide table, used for Provide follow-up business query, OLAP analysis, data distribution, etc. Generally speaking, there will be relatively few data tables in this layer; a table will cover more business content, and because it has many fields, it is generally called a wide table in this layer.

  • User Behavior, Mild Aggregation to DWD
  • Mainly do some light summary of ODS/DWD layer data.

3. Data application layer ADS

Data application layer: Application Data Service , ADS (APP/DAL/DF) - report results. This layer is mainly used for data products and data analysis. It is generally stored in ES , Redis , PostgreSql and other systems for online systems; it may also be stored in hive or Druid for data analysis and data mining. For example, commonly used data reports exist here.

4.  Fact Table

A fact table refers to a table that stores fact records, such as system logs and sales records. The records of the fact table are constantly growing, such as the commodity order table of e-commerce, which is a similar situation, so the volume of the fact table is usually much larger than other tables.

5. Dimensional Surface Dimension ( DIM )

A dimension table ( Dimension Table ) or dimension table, sometimes called a lookup table ( Lookup Table ), is a table corresponding to the fact table; it saves the attribute values ​​​​of the dimension and can be associated with the fact table, which is equivalent to the fact The frequently repeated attributes on the table are extracted and standardized, and managed with a single table. A dimension table mainly consists of two parts:

  • High-cardinality dimensional data: Generally, it is a data table similar to a user data table and a product data table, and the data volume may be tens of millions or hundreds of millions
  • Low-cardinality dimension data: generally configuration tables, such as Chinese meanings corresponding to enumerated fields, or date dimension tables, etc.; the amount of data may be single digits or tens of thousands.

6. Temporary table TMP

There will be many temporary tables in the calculation of each layer, and a DWTMP layer is specially set up to store the temporary tables of our data warehouse

6. Data Mart

ADS layer in a narrow sense ; in a broad sense, it refers to the data mart ( Data Mart ) that Hadoop synchronizes from DWD DWS ADS to RDS , also called data mart. The data mart is to meet the needs of specific departments or users and store them in a multi-dimensional way Including defining dimensions, indicators to be calculated, dimension levels, etc., generating data cubes oriented to decision analysis requirements. In scope, data is drawn from enterprise-wide databases, data warehouses, or more specialized data warehouses. The point of the data center is that it caters to the special needs of professional user groups in terms of analysis, content, performance, and ease of use. Data center users expect data to be represented in terms they are familiar with. Data Warehousing Structure with Data Marts

difference data warehouse

Data mart is a subset of enterprise-level data warehouse, which is mainly oriented to department-level business and only oriented to a specific topic. In order to solve the contradiction between flexibility and performance, the data mart is a small departmental or workgroup-level data warehouse added to the data warehouse architecture. The data mart stores pre-calculated data for a specific user to meet the performance requirements of the user. Data marts can alleviate the bottleneck of accessing data warehouses to a certain extent.          Theoretically speaking, there should be a general concept of data warehouse, and then there is data mart. When actually building a data mart, this is rarely done in China. In China, we generally start with the data mart first, and build a data mart on a specific topic (such as customer information of an enterprise) first, and then build a data warehouse. The order of establishment of data warehouse and data mart is closely related to the design method. And data warehouse as an engineering discipline, there is no right or wrong.          In terms of data structure, a data warehouse is a collection of subject-oriented, integrated data. The data mart is usually defined as a star structure or a snowflake data structure, and the data mart is generally composed of a fact table and several dimension tables.

7. Problem summary

1. What is the difference between ODS and DWD ?

Question : I still don't quite understand the odsand dwdlayers. After having the odslayer, I feel that dwdis useless.

Answer : From an ideal point of view, if the odslayer is very regular and can basically meet most of our needs, this is of course good. At this time, the dwdlayer is actually not necessary. However, in reality, the odsneedto pass an additional layer of dwd .Mask some underlying differences.

Question : I probably understand. Does it mean that dwdmainlyperforms some data cleaning and normalization operations the odsand dwsmainlyperforms some light summary of the data the ods?

Answer : Yes, it can be roughly understood in this way.

2. What does the APP layer do?

Question : I feel thatthere is no place to putthe DWSShould theDWS the DWDor in the app?

Answer : This question is not easy to answer. I think the main thing is to clarifythe DWSlayer does. If yourDWSlayer contains some wide tables that can be used by the business side, youcan the appIf the data mart you mentioned is a relatively general concept, then in fact, dws,dwd, andappare all considered as the content of the data mart.

Q : Is the data stored in RedisandESconsidered the applayer?

Answer : Yes, according to my personal understanding,the applayer mainly stores some relatively mature tables, which can be used by the business side. These tables can be in Hiveimportedfrom Hive Redisor ES,a system with better query performance.

Note: The content comes from the Internet, if it involves infringement, please contact and delete it

Guess you like

Origin blog.csdn.net/xljlckjolksl/article/details/131609258