About Data Warehouse Hierarchical Design

Foreword:

It is not a data warehouse, but it also needs to understand the knowledge of data warehouses.

In fact, many layers vary from person to person, and I asked my colleagues about the differences between layers, and it was not very clear.

So if you have a chance in the future, let’s meet with colleagues from Shucang~

1. Explanation of various terms

1.1 What is ODS?

  • The ODS layer is best understood. Basically, the data is pulled from the source table and etl is performed. For example, mysql is mapped to hive, and then in hive is the ods layer.
  • The full name of ODS is Operational Data Store, operating data storage. "Theme-oriented", the data operation layer, also called the ODS layer, is the layer closest to the data in the data source. The data in the data source is extracted, cleaned, and transmitted. , that is to say, after the legendary ETL, it is loaded into this layer. The data at this layer is generally classified according to the classification method of the source business system. However, data at this level is not equivalent to raw data. When the source data is loaded into this layer, such as denoising (for example, the age of a person in a piece of data is 300 years old, which is abnormal data, and some processing needs to be done in advance), deduplication (for example, in the personal data table In the same ID, there are two duplicate data, and a series of operations such as deduplication, field naming specification, etc. are required when accessing.

1.2 Data warehouse layer DW?

The data warehouse layer (DW) is the main body of the data warehouse. Here, the data obtained from the ODS layer establishes various data models according to the subject. This layer will have a deeper connection with dimensional modeling.

Segmentation:

  1. Data detail layer: DWD (Data Warehouse Detail)
  2. Data middle layer: DWM (Data Ware House Middle)
  3. Data service layer: DWS (Data Ware House Servce)

1.2.1 DWD detailed layer?

Detail layer (ODS, Operational Data Store, DWD: data warehouse detail)

  • Concept: It is the detailed data layer of the data warehouse, which precipitates the data of the STAGE layer and reduces the complexity of extraction. At the same time, the information model organization of ODS/DWD mainly follows the form of business transaction processing of the enterprise, and centralizes each professional data. The granularity of the layer is the same as that of the stage layer, and it belongs to the public resource of analysis
  • Data generation method: some data comes directly from kafka, and some data is synthesized from interface layer data and historical data.
  • This stage layer is not very clear

1.2.2 DWM light summary layer (MID or DWB, data warehouse basis)

  • Concept: A transitional level between the DWD layer and the DM layer in the data warehouse of the light summary layer, which is to carry out light synthesis and summary statistics on the production data of the DWD layer (complicated cleaning and processing can be included, such as generated according to PV logs session data). The main difference between the light comprehensive layer and DWD is that the application fields of the two are different. The data of DWD comes from the production system, which does not meet some unforeseen needs and is precipitated; the light comprehensive layer is fine-grained for analytical applications. Statistics and Precipitation
  • Data generation method: A light summary table is generated by the detailed layer according to certain business requirements. Data that requires complex cleaning and data that requires MR processing in the detailed layer is also processed and then connected to the light summary layer.
  • Log storage method: internal table, parquet file format.
  • Log deletion method: long-term storage.
  • Table schema: Generally, partitions are created by day, and partition fields are selected according to specific businesses without the concept of time.
  • Library and table naming. Database name: dwb, table name: Preliminary consideration format: dwb date business table name, to be determined.
  • Old data update method: direct coverage

1.2.3 DWS  theme layer (DM, data market or DWS, data warehouse service)

  • Concept: Also known as data mart or wide table. According to the division of business, such as traffic, orders, users, etc., a wide table with many fields is generated to provide subsequent business query, OLAP analysis, data distribution, etc.
  • Data generation method: It is generated by calculating the light summary layer and detailed layer data.
  • Log storage method: use impala internal table, parquet file format.
  • Log deletion method: long-term storage.
  • Table schema: Generally, partitions are created by day, and partition fields are selected according to specific businesses without the concept of time.
  • Library and table naming. Database name: dm, table name: the preliminary consideration format is: dm date business table name, to be determined.
  • Old data update method: direct coverage

1.3 APP?

Data product layer (APP), this layer is to provide the result data used for data products.

It is mainly used for data products and data analysis. It is generally stored in ES, Mysql and other systems for online systems, and may also be stored in Hive or Druid for data analysis and data mining.

As we often say, the report data, or the kind of wide table, is generally placed here.

Application layer (App)

  • Concept: The application layer is based on business needs, and the results obtained from the statistics of the first three layers can directly provide query display, or import it into Mysql for use.
  • Data generation method: It is generated by the detailed layer, light summary layer, and data mart layer. Generally, data is mainly required to come from the mart layer.
  • Log storage method: use impala internal table, parquet file format.
  • Log deletion method: long-term storage.
  • Table schema: Generally, partitions are created by day, and partition fields are selected according to specific businesses without the concept of time.
  • Library and table naming. Library name: tentatively apl, and depending on the business, there is no limit to a library. (In fact, it’s called app_)
  • Old data update method: direct coverage.

[Talk about Data Warehouse] How to Design Data Hierarchy Elegantly

1.4 Source of data

There are two main sources of data:

Business library, where Sqoop is often used to extract

Our business library uses databus to receive, just handle kafka.

In terms of real-time, you can consider using Canal to monitor Mysql's Binlog and access it in real time. (Have a chance to make up this canal)

Buried logs, the online system will enter various logs. These logs are generally saved in the form of files. We can choose to use Flume to extract them regularly, or use Spark Streaming or Storm to access them in real time. Of course, Kafka will also be a key role.

Also use filebeat to collect logs, hit kafka, and then process logs

Note: At this layer, it should not be simple data access, but certain data cleaning should be considered, such as the processing of abnormal fields, standardization of field naming, unification of time fields, etc. Generally, these are easy to be ignored, but they are still important. Especially when we do automatic generation of various features later, it will be very useful.

1.5 ODS, DW → App layer

There are mainly two types:

  1. Daily timing task type: For example, our typical daily calculation tasks, calculate the data of the previous day every morning, and read the report in the morning. Such tasks are often calculated using Hive, Spark or raw MR programs, and the final results are written into Hive, Hbase, Mysql, Es or Redis.
  2. Real-time data: This part is mainly used by various real-time systems, such as our real-time recommendation and real-time user portrait. Generally, we will use Spark Streaming, Storm or Flink to calculate, and finally fall into Es, Hbase or Redis.

1.6 Dimensional Surface DIM?

Dimension surface layer (Dimension)
finally adds a dimension surface layer, which mainly contains two parts of data:

High-cardinality dimensional data: generally, data tables similar to user data tables and product data tables . The amount of data may be tens of millions or hundreds of millions.

Low-cardinality dimension data: generally a configuration table, such as the Chinese meaning corresponding to an enumeration value, or a date dimension table. The amount of data may be single digits or tens of thousands.

1.7 Simple Layered Diagram of Hierarchy

See the figure below, if the DWD layer is being processed, it is the DWM layer (MID layer) (our data warehouse still has many dwm layers)

[Talk about Data Warehouse] How to Design Data Hierarchy Elegantly

Here to explain the role of DWS, DWD, DIM and TMP.

  • DWS: light summary layer, which makes a preliminary summary of user behavior from the ODS layer, abstracts some common dimensions: time, ip, id, and makes some statistical values ​​based on these dimensions, such as users in each time period The number of products purchased by different login ip, etc. Doing a light summary here will make the calculation more efficient. On this basis, it will be much faster to calculate the behavior of only 7 days, 30 days, and 90 days. We hope that 80% of the business can be calculated through our DWS layer instead of ODS.
  • DWD: This layer mainly solves some data quality problems and data integrity problems. For example, the user's profile information comes from many different tables, and there are often problems such as delay and data loss. In order to facilitate the better use of data by various users, we can make a shield at this layer. (to aggregate multiple tables)
  • DIM: This layer is relatively simple. Take an example to understand, such as country code and country name, geographical location, Chinese name, national flag picture and other information are stored in the DIM layer.
  • TMP: There will be many temporary tables in the calculation of each layer, and a DWTMP layer is specially set up to store the temporary tables of our data warehouse.

2. Problem

2.1 DWS and DWD?

Question 1: The relationship between dws and dwd

Q: Are dws and dwd parallel rather than sequential?

Answer: Parallel, dw layer

Q: In fact, for the same data, the two processes are serial?

Answer: dws will do summary, dwd and ods have the same granularity, and there is no dependency relationship between the two layers

Question: Yes, the summarization in dws has not been processed for data quality and completeness, or this quality-related processing has been done separately. Why not do the summarization on top of dwd? My question is actually, dws Lightly summarize the data results, have you done any data quality processing?

Answer: Just go directly to dws from ods. There is no need to go through dwd. Let me give you an example. I will make a light summary of your browsing behavior and put it directly in dws. But your information form needs to be assembled from many forms. We made a complete information form from four or five personal information forms and put it in the dwd. Then at the app layer, we need to produce a portrait table, including user information and user behavior in the past year, we will directly get the data from dwd, and then make a layer of statistics on the basis of dws to form an app table up. Of course, this is not absolute. Whether there is a dependency between dws and dwd mainly depends on whether there is such a demand.

2.2 What is the difference between ODS and DWD?

Question: I still don't quite understand the difference between the ods and dwd layers. With the ods layer, I feel that dwd is useless.

Answer: Well, I understand it this way. From an ideal point of view, if the data in the ods layer is very regular and can basically meet most of our needs, this is of course good. At this time, the dwd layer is actually Not really necessary. However, in reality, it is difficult to guarantee the quality of data at the ods layer. After all, there are various sources of data, and the pusher will also have its own push logic. In this case, we need to use an additional layer of dwd to Mask some underlying differences.

Question: I probably understand. Does it mean that dwd mainly does some data cleaning and normalization operations on the ods layer, and dws mainly does some light summary of the ods layer data?

Answer: Yes, it can be roughly understood in this way.

2.3 What does the app layer do?

Question 3: What does the app layer do?

Question: I feel that there is no place to put the data mart layer. Should the data mart tables of each business be in dwd or app?

Answer: This question is not easy to answer. I think the main thing is to clarify what the data mart layer does. If your data mart layer contains some wide tables that can be used by the business side, just put it in the app layer. If the data mart layer you mentioned is a relatively general concept, then in fact, dws, dwd, and app are all considered as the content of the data mart.

Q: Is the data stored in Redis and ES considered the app layer?

Answer: Yes, according to my personal understanding, the app layer mainly stores some relatively mature tables, which can be used by the business side. These tables can be in Hive, or imported from Hive to Redis or ES, a system with better query performance.

3. Summary

Another blogger's picture is just fine:

Subject (Subject) is an abstract concept that synthesizes, classifies, analyzes and utilizes data in enterprise information systems at a high level, and each subject basically corresponds to a macroscopic analysis field. In a logical sense, it corresponds to the analysis object involved in a certain macro analysis field in the enterprise. For example, "sales analysis" is an analysis field, so the subject of this data warehouse application is "sales analysis".

Regarding data warehouse design and data governance of data warehouse, please also pay attention to my WeChat public account [talking about data], just scan the QR code and follow

picture

Guess you like

Origin blog.csdn.net/kuangfeng88588/article/details/118378742