Data warehouse architecture model design reference

1. Data technology architecture

1.1. Technical architecture

Insert image description here

1.2. Data stratification

The data warehouse is divided into three layers, from bottom to top: data introduction layer (ODS, Operation Data Store), data common layer (CDM, Common Data Model) and data application layer (ADS, Application Data Service).
The layering of the data warehouse and the uses of each level are shown in the figure below.
Insert image description here
● Data introduction layer ODS (Operation Data Store): stores unprocessed raw data into the data warehouse system. It is structurally consistent with the source system and is the data preparation area of ​​the data warehouse. Mainly responsible for introducing basic data into MaxCompute, and recording historical changes of basic data.
● Data common layer CDM (Common Data Model, also known as common data model layer), including DIM dimension table, DWD and DWS, which is processed from ODS layer data. It mainly completes data processing and integration, establishes consistent dimensions, builds reusable detailed fact tables for analysis and statistics, and summarizes public-granularity indicators.
● Common Dimension Layer (DIM): Based on the dimensional modeling concept, it establishes consistent dimensions for the entire enterprise. Reduce the risk of inconsistent data calculation calibers and algorithms. Tables in the public dimension layer are often also called logical dimension tables, and dimensions and dimension logical tables usually correspond one to one.
Public summary granularity fact layer (DWS): Using the subject object of analysis as the modeling driver, based on the indicator requirements of upper-layer applications and products, a public-granularity summary indicator fact table is constructed, and the model is physicalized by means of wide tables. Construct statistical indicators with standardized naming and consistent caliber, provide public indicators for the upper layer, and establish summary wide tables and detailed fact tables. Tables in the public summary granularity fact layer are often called summary logical tables and are used to store derived indicator data.
● Detailed-grained fact layer (DWD): Taking the business process as the modeling driver, and based on the characteristics of each specific business process, the most fine-grained detailed layer fact table is constructed. Combining the data usage characteristics of the enterprise, some important dimension attribute fields of the detailed fact table can be made appropriately redundant, that is, wide table processing. Tables at the fine-grained fact level are often also called logical fact tables.
● Data application layer ADS (Application Data Service): stores personalized statistical indicator data of data products. Generated based on CDM and ODS layer processing.

The data classification architecture is divided into three parts at the ODS layer: data preparation area, offline data and quasi-real-time data area. The overall data classification architecture is shown in the figure below.
Insert image description here
Data from the source business system is synchronized to the ODS layer of the data warehouse through the synchronization integration tool. After the data is developed to form a wide fact table, public aggregation is performed based on the dimensions of products, regions, etc. The data processing flow is as follows:
Insert image description here

1.3. Data division and namespace convention

Please divide the data according to the business and agree on the naming. It is recommended to agree on the English abbreviation of the relevant naming for the business name in combination with the data hierarchy. This can serve as an important reference for the naming of project spaces, tables, fields, etc. in the subsequent data development process.
● Divide by business: Divide according to main business when naming to guide the division principles, naming principles and ODS projects used in the physical model. For example, according to the definition of English abbreviation by business, the English abbreviation of Alibaba's "Taobao" can be defined as "tb".
● Divide by data domain: When naming, divide the data domain according to the data of the CDM layer to effectively manage the data and guide the naming of the data table. For example, the English abbreviation of "transaction" data can be defined as "trd".
● Divide by business process: When a data domain consists of multiple business processes, it can be divided by business process when naming. Business process is an objective or abstract business behavior from the perspective of data analysis. For example, the English abbreviation of the "refund" business process in the transaction data field can be named "rfd_ent".

2. Data model

Models are reflections and abstractions of real things and can help us better understand the objective world. The data model defines the relationships and structures between data so that we can obtain the data we want regularly. For example, in a supermarket, the layout of goods has specific specifications, and the placement of goods is based on consumers' buying habits and the flow of people.

● The role of the data model. The data model is the first step at the beginning of the data warehouse work after the business requirements analysis. A good data model can help us store data better, obtain data more efficiently, and ensure consistency between data.
● Basic principles of model design
● High cohesion and low coupling What records and fields a logical and physical model consists of should follow the principles of high cohesion and low coupling in the most basic software design methodology. It is mainly considered from the two perspectives of data business characteristics and access characteristics: design data with similar or related services and the same granularity into a logical or physical model; put data that is accessed at the same time with high probability together, and data that is accessed at the same time with low probability. Store separately.
The core model and the extended model are separated to establish the core model and the extended model system. The fields included in the core model support common core businesses, and the fields included in the extended model support personalization or the needs of a small number of applications. When the core model must be associated with the extended model, the extended fields cannot be allowed to invade the core model excessively, so as not to destroy the architectural simplicity and maintainability of the core model.
● Public processing logic sinks to a single bottom layer. Public processing logic should be encapsulated and implemented at the bottom layer that data scheduling depends on. Do not let common processing logic be exposed to application layer implementation, and do not let common logic exist in multiple places at the same time.
Cost and performance balance: Appropriate data redundancy can be exchanged for query and refresh performance. Excessive redundancy and data duplication are not appropriate.
● The data can be rolled back and the processing logic remains unchanged. The results of running the data multiple times at different times must be determined and unchanged.
Fields with the same consistency must have the same field names in different tables.
● Clear and understandable naming. Table naming conventions must be clear and consistent, and table naming must be easy for downstream users to understand and use.

2.1. Data access ODS layer

The ODS layer stores the most original data you obtain from the business system and is the source data for other upper-layer data. The data in business data systems is usually very detailed data that has been accumulated over a long period of time and is accessed frequently. It is application-oriented data.

2.1.1. Table design

The main data included in the ODS layer are: transaction system order details, user information details, product details, etc. These data are unprocessed and are the rawest data. Logically, these data are stored in the form of two-dimensional tables. Although strictly speaking, the ODS layer does not belong to the category of data warehouse modeling, it is also very important to properly plan the ODS layer and do data synchronization.
Note
● Table or field naming should be consistent with the business system, but additional identifiers are needed to distinguish incremental and full tables. For example, we use _di to identify the table as an incremental table.
● Special attention needs to be paid to conflict handling when naming. For example, tables in different business systems may have the same name. To distinguish two different tables, you can suffix or prefix the source database name of the two tables with the same name. For example, if the names of some fields in the table happen to have the same name as keywords, this can be solved by adding the _col1 suffix.

2.1.2. Table creation example

In this example, 6 ODS tables are used:
● Record product information for auction: ods_auction.
● Record product information for normal sales: ods_sale.
● Log user details: ods_users_extra.
● Record the newly added product transaction order information: ods_biz_order_delta.
● Record the newly added logistics order information: ods_logistics_order_delta.
● Record the newly added payment order information: ods_pay_order_delta.
The table creation statement is as follows:

CREATE TABLE IF NOT EXISTS ods_auction
(
    id                             STRING COMMENT '商品ID',
    title                          STRING COMMENT '商品名',
    gmt_modified                   STRING COMMENT '商品最后修改日期',
    price                          DOUBLE COMMENT '商品成交价格,单位元',
    starts                         STRING COMMENT '商品上架时间',
    minimum_bid                    DOUBLE COMMENT '拍卖商品起拍价,单位元',
    duration                       STRING COMMENT '有效期,销售周期,单位天',
    incrementnum                   DOUBLE COMMENT '拍卖价格的增价幅度',
    city                           STRING COMMENT '商品所在城市',
    prov                           STRING COMMENT '商品所在省份',
    ends                           STRING COMMENT '销售结束时间',
    quantity                       BIGINT COMMENT '数量',
    stuff_status                   BIGINT COMMENT '商品新旧程度 0 全新 1 闲置 2 二手',
    auction_status                 BIGINT COMMENT '商品状态 0 正常 1 用户删除 2 下架 3 从未上架',
    cate_id                         BIGINT COMMENT '商品类目ID',
    cate_name                        STRING COMMENT '商品类目名称',
    commodity_id                     BIGINT COMMENT '品类ID',
    commodity_name                    STRING COMMENT '品类名称',
    umid                              STRING COMMENT '买家umid'
)
COMMENT '商品拍卖ODS'
PARTITIONED BY (ds STRING COMMENT '格式:YYYYMMDD')
LIFECYCLE 400;
CREATE TABLE IF NOT EXISTS ods_sale
(
    id                             STRING COMMENT '商品ID',
    title                          STRING COMMENT '商品名',
    gmt_modified                   STRING COMMENT '商品最后修改日期',
    starts                         STRING COMMENT '商品上架时间',
    price                          DOUBLE COMMENT '商品价格,单位元',
    city                           STRING COMMENT '商品所在城市',
    prov                           STRING COMMENT '商品所在省份',
    quantity                       BIGINT COMMENT '数量',
    stuff_status                   BIGINT COMMENT '商品新旧程度 0 全新 1 闲置 2 二手',
    auction_status                 BIGINT COMMENT '商品状态 0 正常 1 用户删除 2 下架 3 从未上架',
    cate_id                      BIGINT COMMENT '商品类目ID',
    cate_name                    STRING COMMENT '商品类目名称',
    commodity_id                 BIGINT COMMENT '品类ID',
    commodity_name                STRING COMMENT '品类名称',
    umid                          STRING COMMENT '买家umid'
)
COMMENT '商品正常购买ODS'
PARTITIONED BY (ds      STRING COMMENT '格式:YYYYMMDD')
LIFECYCLE 400;
CREATE TABLE IF NOT EXISTS ods_users_extra
(
    id                STRING COMMENT '用户ID',
    logincount        BIGINT COMMENT '登录次数',
    buyer_goodnum     BIGINT COMMENT '作为买家的好评数',
    seller_goodnum    BIGINT COMMENT '作为卖家的好评数',
    level_type        BIGINT COMMENT '1 一级店铺 2 二级店铺 3 三级店铺',
    promoted_num      BIGINT COMMENT '1 A级服务 2 B级服务 3 C级服务',
    gmt_create        STRING COMMENT '创建时间',
    order_id          BIGINT COMMENT '订单ID',
    buyer_id          BIGINT COMMENT '买家ID',
    buyer_nick        STRING COMMENT '买家昵称',
    buyer_star_id     BIGINT COMMENT '买家星级 ID',
    seller_id         BIGINT COMMENT '卖家ID',
    seller_nick       STRING COMMENT '卖家昵称',
    seller_star_id    BIGINT COMMENT '卖家星级ID',
    shop_id           BIGINT COMMENT '店铺ID',
    shop_name         STRING COMMENT '店铺名称'
)
COMMENT '用户扩展表'
PARTITIONED BY (ds       STRING COMMENT 'yyyymmdd')
LIFECYCLE 400;
CREATE TABLE IF NOT EXISTS ods_biz_order_delta
(
    biz_order_id         STRING COMMENT '订单ID',
    pay_order_id         STRING COMMENT '支付订单ID',
    logistics_order_id   STRING COMMENT '物流订单ID',
    buyer_nick           STRING COMMENT '买家昵称',
    buyer_id             STRING COMMENT '买家ID',
    seller_nick          STRING COMMENT '卖家昵称',
    seller_id            STRING COMMENT '卖家ID',
    auction_id           STRING COMMENT '商品ID',
    auction_title        STRING COMMENT '商品标题 ',
    auction_price        DOUBLE COMMENT '商品价格',
    buy_amount           BIGINT COMMENT '购买数量',
    buy_fee              BIGINT COMMENT '购买金额',
    pay_status           BIGINT COMMENT '支付状态 1 未付款  2 已付款 3 已退款',
    logistics_id         BIGINT COMMENT '物流订单ID',
    mord_cod_status      BIGINT COMMENT '物流状态 0 初始状态 1 接单成功 2 接单超时3 揽收成功 4揽收失败 5 签收成功 6 签收失败 7 用户取消物流订单',
    status               BIGINT COMMENT '状态 0 订单正常 1 订单不可见',
    sub_biz_type         BIGINT COMMENT '业务类型 1 拍卖 2 购买',
    end_time             STRING COMMENT '交易结束时间',
    shop_id              BIGINT COMMENT '店铺ID'
)
COMMENT '交易成功订单日增量表'
PARTITIONED BY (ds       STRING COMMENT 'yyyymmdd')
LIFECYCLE 7200;
CREATE TABLE IF NOT EXISTS ods_logistics_order_delta
(
    logistics_order_id STRING COMMENT '物流订单ID ',
    post_fee           DOUBLE COMMENT '物流费用',
    address            STRING COMMENT '收货地址',
    full_name          STRING COMMENT '收货人全名',
    mobile_phone       STRING COMMENT '移动电话',
    prov               STRING COMMENT '省份',
    prov_code          STRING COMMENT '省份ID',
    city               STRING COMMENT '市',
    city_code          STRING COMMENT '城市ID',
    logistics_status   BIGINT COMMENT '物流状态
1 - 未发货
2 - 已发货
3 - 已收货
4 - 已退货
5 - 配货中',
    consign_time       STRING COMMENT '发货时间',
    gmt_create         STRING COMMENT '订单创建时间',
    shipping           BIGINT COMMENT '发货方式
1,平邮
2,快递
3,EMS',
    seller_id          STRING COMMENT '卖家ID',
    buyer_id           STRING COMMENT '买家ID'
)
COMMENT '交易物流订单日增量表'
PARTITIONED BY (ds                 STRING COMMENT '日期')
LIFECYCLE 7200;
CREATE TABLE IF NOT EXISTS ods_pay_order_delta
(
    pay_order_id     STRING COMMENT '支付订单ID',
    total_fee        DOUBLE COMMENT '应支付总金额 (数量*单价)',
    seller_id STRING COMMENT '卖家ID',
    buyer_id  STRING COMMENT '买家ID',
    pay_status       BIGINT COMMENT '支付状态
1等待买家付款,
2等待卖家发货,
3交易成功',
    pay_time         STRING COMMENT '付款时间',
    gmt_create       STRING COMMENT '订单创建时间',
    refund_fee       DOUBLE COMMENT '退款金额(包含运费)',
    confirm_paid_fee DOUBLE COMMENT '已经确认收货的金额'
)
COMMENT '交易支付订单增量表'
PARTITIONED BY (ds        STRING COMMENT '日期')
LIFECYCLE 7200;

2.1.3. Storage design

In order to meet the needs of historical data analysis, the time dimension can be added as a partition field in the ODS layer table. In actual applications, you can choose to use incremental, full storage or zipper storage.
● Incremental storage is incremental storage in days, with business dates as partitions, and each partition stores daily incremental business data. For example:
● On January 1, user A visited company A’s e-commerce store B, and company A’s e-commerce log generated a record t1. On January 2, user A visited company A’s e-commerce store C, and company A’s e-commerce log generated a record t1. The business log generates a record t2. Using the incremental storage method, t1 will be stored in the partition on January 1st, and t2 will be stored in the partition on January 2nd.
On January 1, user A purchased product B on the e-commerce website of company A, and the transaction log will generate a record t1. On January 2, user A returned product B again, and the transaction log will update the t1 record. Using the incremental storage method, the initial purchased t1 record will be stored in the partition on January 1, and the updated t1 will be stored in the partition on January 2.

[Explanation] ODS tables with strong transaction nature such as transactions and logs are suitable for incremental storage. This type of table has a large amount of data, and the storage cost of using full storage is high. In addition, downstream applications of such tables have less demand for full historical data access (such demands can be obtained through subsequent aggregation by the data warehouse). For example, a log ODS table does not have a data update business process, so all incremental partitions UNION together to form a full set of data.

● Full storage: Full storage in days, with the business date as the partition. Each partition stores the full business data up to the business date. For example, on January 1, seller A released two products, B and C, on Company A’s e-commerce website. The front-end product table will generate two records, t1 and t2. On January 2, seller A removed product B from the shelves. At the same time, product D is released, and the front-end product table will update record t1 and create a new record t3. Using the full storage method, two records of t1 and t2 are stored in the partition on January 1, and the updated t1, t2, and t3 records are stored in the partition on January 2.
[Explanation] For slowly changing dimensional data with a small amount of data, such as product categories, full storage can be used directly.
● Zipper storage records all change data with daily granularity by adding two new timestamp fields (start_dt and end_dt). Usually the partition fields are also these two timestamp fields.
Examples of zipper storage are as follows.

Product start_dt end_dt Seller status
B 20160101 20160102 A On the shelves
C 20160101 30001231 A On the shelves
B 20160102 30001231 A Off the shelves

In this way, downstream applications can obtain historical data by limiting the timestamp field. For example, if a user accesses data on January 1st, he only needs to restrict start_dt<=20160101 and end_dt>20160101.
[Explanation] For large amounts of slowly changing dimensional data, such as member information tables, zipper tables can be used to store them.

2.1.4. Data quality specifications

● Each ODS full table must be configured with a unique field identifier.
● Each ODS full scale must have comments.
● Each ODS full table must monitor partition empty data.
● It is recommended to monitor the enumeration value changes and enumeration value distribution of important enumeration type fields in important tables.
● It is recommended to set up week-to-week monitoring of the data volume and number of data records in the ODS table. If there is no change on a week-to-week basis, it means that the source system has been migrated or offline.

2.1.5. Other specifications

● The source table of a system is only allowed to be synchronized to the data warehouse once to maintain the consistency of the table structure.
● When data integration synchronizes the full amount of data, it will directly enter the current day partition of the full table.
● All ODS layer tables are stored in statistical date and time partition tables, and data costs are controlled by storage management and policies.
● If new fields are added to the source system, you need to reconfigure the data integration synchronization job. If a field in the target table does not exist in the source system, Data Integration automatically fills in NULL.

2.2. CDM common dimension layer (DIM)

The common dimension summary layer (DIM) is based on the dimensional modeling concept to establish consistent dimensions across the entire enterprise.

The public dimension summary layer (DIM) is mainly composed of dimension tables (dimension tables). Dimension is a logical concept, a perspective from which to measure and observe business. Dimension tables are physical tables built on the data platform based on dimensions and their attributes, using the principle of wide table design. Therefore, the common dimension summary layer (DIM) first needs to define the dimensions.

2.2.1. Define dimensions

When dividing the data domain and building the bus matrix, it is necessary to combine the analysis of the business process to define the dimensions. Taking the marketing business segment of e-commerce company A as an example, in the transaction data domain, we focus on the business process of confirming receipt (transaction success). In the business process of confirming receipt of goods, there are mainly two business perspectives that depend on the two dimensions of product and receipt location (in this tutorial, it is assumed that receipt and purchase are at the same location). From the product perspective, the following dimensions can be defined:
● Product ID
● Product name
● Product price
● Product condition: 0 means brand new, 1 means idle, and 2 means second-hand.
● Product category ID
● Product category name
● Category ID
● Category name
● Buyer ID
● Product status 0 means normal, 1 means user deleted, 2 means removed from shelves, 3 means never put on shelves.
● The city where the product is located
● The province where the product is located
From a regional perspective, the following dimensions can be defined:
● Buyer ID
● City code
● City name
● Province code
● Province name

As the core of dimensional modeling, the uniqueness of dimensions must be guaranteed in enterprise-level data warehouses. Taking the product dimension of company A as an example, there is and is only allowed to be one dimension definition. For example, the dimension of province code is consistent with the information conveyed by any business process.

2.2.2. Design dimension table

After completing the dimension definition, you can supplement the dimensions to generate a dimension table. When designing dimension tables, you need to pay attention to:
● It is recommended that the dimension table information should not exceed 10 million pieces.
● When joining dimension tables with other tables, it is recommended that you use Map Join.
● Avoid updating dimension table data too frequently.
When designing a dimension table, you need to consider the following aspects:
● The stability of the data in the dimension table. For example, company A's e-commerce members usually do not die, but the member data may be updated at any time. In this case, you should consider creating a single partition to store the entire data. If there are records that will not be updated, you may need to create historical tables and daily tables separately. The daily table is used to store currently valid records to prevent the data volume of the table from expanding; the historical table inserts the corresponding partition according to the death time, and uses a single partition to store the death record of the partition corresponding to the time.
● Whether the dimension table needs to be split vertically. If a dimension table has a large number of attributes that are not used, or the query becomes slow due to carrying too many attribute fields, you need to consider splitting the fields and creating multiple dimension tables.
● Whether the dimension table needs to be split horizontally. If there are obvious boundaries between records, you can consider splitting it into multiple tables or designing it into multi-level partitions.
● The production time of core dimension tables. There are usually strict requirements.

The main steps in designing dimension tables are as follows:

  1. Initial definition of dimensions. Ensure dimensionality consistency.
  2. Determine the main dimension table (central fact table, star schema is used in this tutorial). The main dimension table here is usually the data introduction layer (ODS) table, which is directly synchronized with the business system. For example, s_auction is a product table synchronized with the front-end product center system, and this table is the main dimension table.
  3. Determine relevant dimension tables. Data warehouse is the integration of data from business source systems, and there are correlations between tables in different business systems or the same business system. Based on the business combing, determine which tables are associated with the main dimension tables, and select some of them to generate dimension attributes. Taking the product dimension as an example, based on sorting out the business logic, it can be found that products are related to categories, sellers, stores and other dimensions.
  4. Determine dimension attributes. It mainly includes two stages. The first stage is to select dimension attributes from the main dimension table or generate new dimension attributes; the second stage is to select dimension attributes from the related dimension table or to generate new dimension attributes. Taking the product dimension as an example, select dimension attributes or generate new dimension attributes from the main dimension table (s_auction), category, seller, store and other related dimension tables. The design of dimension attributes requires attention:
    ○ Generate as many dimension attributes as possible.
    ○ Give as many meaningful textual descriptions as possible.
    ○ Distinguish between numeric attributes and facts.
    ○ Try to precipitate common dimensional attributes.

2.2.3. Design criteria

● Consistent dimension specification The field names, data types, and data contents of the same dimension attributes in different physical tables in the dimension table of the public layer must be consistent. Except for the following situations:
● In different actual physical tables, if other names need to be used due to differences in dimension roles, the other names must also be aliases of canonical dimension attributes. For example, when defining a standard member ID, if the buyer ID and seller ID are to be represented in a table, then the buyer ID and seller ID are respectively defined for the member ID in the design specification stage. If for historical reasons, in the case of temporary inconsistency, a standard dimension attribute must be defined in the canonical dimension, the different physical names must also be aliases from the standard dimension attribute.
● Combination and splitting of dimensions
● Combination principle
○ Realize fields with strong business relevance described by dimensions in a physical dimension table. Strong correlation means that they often need to be queried or displayed together, and whether there is a natural relationship between the attributes of the two dimensions, etc. For example, the basic attributes of the product and the brand it belongs to.
○ Non-relevant dimensions can appropriately consider miscellaneous dimensions (such as transactions), and a transaction miscellaneous dimension can be constructed to collect special tag attributes, business classification and other information of transactions. Miscellaneous dimensions can also be degenerated and processed in the fact table, but this will easily lead to a relatively large fact table and complicated processing.
○ The so-called behavioral dimensions are aggregated and calculated indicators, which are treated as dimensions when used in downstream applications. If necessary, metrics can be redundantly added to dimension tables as behavioral dimensions.
● Splitting and redundancy
○ Dimension tables with too many dimension attributes and involving many sources (such as member tables) can be split appropriately:
○ Split into core tables and extended tables. The core table has relatively few fields and the refresh output time is earlier, so it is used first. The extended table has many fields and can redundant some fields of the core table. The refresh output time is relatively late, making it suitable for data analysts.
○ According to the business irrelevance of the dimension attributes, split the dimension attributes with little correlation into multiple physical tables for storage.
○ For dimension tables with large number of data records (such as product tables), some sub-collections can be appropriately redundant to reduce the amount of downstream scanning data:
○ A related dimension table with active behavior can be generated based on whether there is any behavior that day, so as to Reduce the amount of data scanned by your application.
○ Appropriate sub-collection redundancy can be performed based on the size of the scanning data range of the business to which it belongs.

2.2.4. Table creation example

In this example, the final dimension table creation statement is as follows.

CREATE TABLE IF NOT EXISTS dim_asale_itm
(
    item_id                            BIGINT COMMENT '商品ID',
    item_title                      STRING COMMENT '商品名称',
    item_price                     DOUBLE COMMENT '商品成交价格_元',
    item_stuff_status              BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
    cate_id                          BIGINT COMMENT '商品类目ID',
    cate_name                        STRING COMMENT '商品类目名称',
    commodity_id                      BIGINT COMMENT '品类ID',
    commodity_name                  STRING COMMENT '品类名称',
    umid                           STRING COMMENT '买家ID',
    item_status                    BIGINT COMMENT '商品状态_0正常1用户删除2下架3未上架',
    city                           STRING COMMENT '商品所在城市',
    prov                           STRING COMMENT '商品所在省份'
)
COMMENT '商品全量表'
PARTITIONED BY (ds        STRING COMMENT '日期,yyyymmdd');
CREATE TABLE IF NOT EXISTS dim_pub_area
(
    buyer_id       STRING COMMENT '买家ID',
    city_code      STRING COMMENT '城市code',
    city_name      STRING COMMENT '城市名称',
    prov_code      STRING COMMENT '省份code',
    prov_name      STRING COMMENT '省份名称'
)
COMMENT '公共区域维表'
PARTITIONED BY (ds             STRING COMMENT '日期分区,格式yyyymmdd')
LIFECYCLE 3600;

2.3. CDM fine-grained fact layer (DWD)

The detailed-grained fact layer is modeled driven by business processes, and based on the characteristics of each specific business process, the finest-grained detailed layer fact table is constructed. You can combine the data usage characteristics of the enterprise to make some important dimension attribute fields of the detailed fact table appropriately redundant, that is, wide table processing.

The fact tables of the public summary granular fact layer (DWS) and detailed granular fact layer (DWD) are the core of data warehouse dimensional modeling and need to be designed closely around the business process. Describe the business process by obtaining the metrics that describe the business process, including referenced dimensions and metrics related to the business process. Measures are usually numeric data that serve as the basis for logical tables of facts. The description information of the fact logical table is the fact attribute, and the foreign key fields in the fact attribute are related through the corresponding dimensions.

The degree of business detail expressed by a record in the fact table is called granularity. Generally, granularity can be expressed in two ways: one is the level of detail represented by the combination of dimension attributes, and the other is the specific business meaning represented.

As a fact to measure business processes, it is usually a decimal value of integer or floating point type. There are three types: additivity, semi-additivity and non-additivity:

● Additive facts mean that they can be summarized according to any dimension associated with the fact table.
● Semi-additive facts can only be summarized along specific dimensions, not all dimensions. For example, inventory can be summarized by location and product, but it is meaningless to accumulate inventory for each month of the year by time dimension.
● Complete nonadditivity, such as ratio facts. For the fact of non-additivity, aggregation can be achieved by decomposing it into additive components.

Fact tables are usually more slender than dimension tables, and rows are added faster. Dimension attributes can be stored in fact tables. This type of dimension columns stored in fact tables is called dimension degradation, which can speed up queries. Like other dimensions stored in dimension tables, dimension degradation can be used to filter queries on fact tables, implement aggregation operations, etc.

The fine-grained fact layer (DWD) is usually divided into three types: transaction fact table, periodic snapshot fact table and cumulative snapshot fact table.

transactional fact table

Transaction fact tables are used to describe business processes, track measurement events at a certain point in space or time, and store the most atomic data, also called atomic fact tables. Transactional fact tables are mainly used to analyze behavior and track events. The transaction fact table obtains event or behavior details in the business process, and then through the association between facts and dimensions, various event-related metrics can be easily counted, such as browsing UVs, search times, etc.

● Design transactional fact tables based on the analysis of data application requirements. If there is a large downstream demand for analysis indicators for a certain business process event, you can consider building a transactional fact table based on a certain event process.
● Transactional fact tables generally use event occurrence date or time as the partition field. This partition method can facilitate downstream job data scanning and partition pruning.
● The principle of redundant subsets of detail layer fact tables can help reduce the IO overhead of upper-layer data access.
● The principle of degenerating the detail layer fact table dimensions to the fact table can help reduce the JOIN cost of upper-layer data access.

periodic snapshot fact table

Periodic snapshot fact tables record facts at regular, predictable intervals. The periodic snapshot fact table is mainly used to analyze status or stock facts. Snapshots sample state metrics at predetermined intervals.

Cumulative snapshot fact table

The cumulative snapshot fact table is used to express key step events between the beginning and end of the process, covering the entire life cycle of the process, and usually has multiple date fields to record key time points. As the cumulative snapshot fact table continues to change over its lifetime, records will also be modified as processes change.

The cumulative snapshot fact table is a fact table constructed based on joint analysis of multiple business processes. For example, the circulation link of purchase orders, etc.

The cumulative snapshot fact table is mainly used to analyze the time intervals and cycles between events. For example, use the interval between payment and delivery of a transaction to analyze delivery speed, or analyze payment refund rates in the payment and refund links, etc.

The cumulative snapshot fact table can also be used to help analyze a small number of indicator statistics that are not very sensitive to refresh time. For example, if the current transactional fact table does not support it and there are only a few statistical indicators, and you need to analyze the closing and shipment of transactions, you can perform calculations based on the cumulative snapshot fact table.

2.3.1. Detailed granularity fact table design

The detailed-grained fact table design is as follows:
● Usually, a fine-grained fact table is associated with only one dimension.
● Include as many facts as possible relevant to the business process.
● Select only facts relevant to the business process.
● Decompose nonadditivity facts into additive components.
● The granularity must be declared before selecting dimensions and facts.
● There cannot be multiple facts of different granularities in the same fact table.
● The units of facts must be consistent.
● Handle Null values ​​with caution.
● Use degenerate dimensions to improve the usability of fact tables.

The overall design process of the detailed granularity fact table is shown in the figure below.
Insert image description here
The transaction business process and its measurement have been defined in the consistency measurement. Detailed fact tables pay attention to model design for business processes. The design of the detailed fact table can be divided into four steps: selecting business processes, determining granularity, selecting dimensions, and determining facts (measurements). Granularity mainly records the semantic description of business activities without expanding the dimensions. When you build a detailed fact table, you need to choose to develop detailed layer data based on the existing table, and know what granularity of data is stored in the records of the built table.

2.3.2. Table creation example

The DWD layer in this example is mainly composed of three tables:
● Transaction commodity information fact table: dwd_asale_trd_itm_di.
● Trading member information fact table: ods_asale_trd_mbr_di.
● Transaction order information fact table: dwd_asale_trd_ord_di.

Dimension degradation is fully used to improve query efficiency. The table creation statement is as follows.

CREATE TABLE IF NOT EXISTS dwd_asale_trd_itm_di
(
    item_id              BIGINT COMMENT '商品ID',
    item_title           STRING COMMENT '商品名称',
    item_price           DOUBLE COMMENT '商品价格',
    item_stuff_status    BIGINT COMMENT '商品新旧程度_0全新1闲置2二手',
    item_prov            STRING COMMENT '商品省份',
    item_city            STRING COMMENT '商品城市',
    cate_id              BIGINT COMMENT '商品类目ID',
    cate_name            STRING COMMENT '商品类目名称',
    commodity_id         BIGINT COMMENT '品类ID',
    commodity_name       STRING COMMENT '品类名称',
    buyer_id             BIGINT COMMENT '买家ID'
)
COMMENT '交易商品信息事实表'
PARTITIONED BY (ds     STRING COMMENT '日期')
LIFECYCLE 400;
CREATE TABLE IF NOT EXISTS ods_asale_trd_mbr_di
(
    order_id         BIGINT COMMENT '订单ID',
    bc_type          STRING COMMENT '业务分类',
    buyer_id         BIGINT COMMENT '买家ID',
    buyer_nick       STRING COMMENT '买家昵称',
    buyer_star_id    BIGINT COMMENT '买家星级ID',
    seller_id        BIGINT COMMENT '卖家ID',
    seller_nick      STRING COMMENT '卖家昵称',
    seller_star_id   BIGINT COMMENT '卖家星级ID',
    shop_id          BIGINT COMMENT '店铺ID',
    shop_name        STRING COMMENT '店铺名称'
)
COMMENT '交易会员信息事实表'
PARTITIONED BY (ds     STRING COMMENT '日期')
LIFECYCLE 400;
CREATE TABLE IF NOT EXISTS dwd_asale_trd_ord_di
(
    order_id              BIGINT COMMENT '订单ID',
    pay_order_id          BIGINT COMMENT '支付订单ID',
    pay_status            BIGINT COMMENT '支付状态_1未付款2已付款3已退款',
    succ_time             STRING COMMENT '订单交易结束时间',
    item_id               BIGINT COMMENT '商品ID',
    item_quantity         BIGINT COMMENT '购买数量',
    confirm_paid_amt      DOUBLE COMMENT '订单已经确认收货的金额',
    logistics_id          BIGINT COMMENT '物流订单ID',
    mord_prov             STRING COMMENT '收货人省份',
    mord_city             STRING COMMENT '收货人城市',
    mord_lgt_shipping     BIGINT COMMENT '发货方式_1平邮2快递3EMS',
    mord_address          STRING COMMENT '收货人地址',
    mord_mobile_phone     STRING COMMENT '收货人手机号',
    mord_fullname         STRING COMMENT '收货人姓名',
    buyer_nick            STRING COMMENT '买家昵称',
    buyer_id              BIGINT COMMENT '买家ID'
)
COMMENT '交易订单信息事实表'
PARTITIONED BY (ds       STRING COMMENT '日期')
LIFECYCLE 400;

2.4. CDM summary layer (DWS)

The public summary granularity fact layer uses the analyzed subject object as a modeling driver, and builds a public granularity summary indicator fact table based on the indicator requirements of the upper-layer applications and products. A table in the common summary layer usually corresponds to a derived indicator.

2.4.1. Public summary fact table design principles

Aggregation refers to summarizing data at the original granularity. The DWS public summary layer is subject aggregation modeling oriented to analysis objects. In this tutorial, the final analysis goals are: the total sales of a certain category (for example: kitchenware) in each province in the last day, the names of the top 10 sales products in this category, and the distribution of user purchasing power in each province. Therefore, we can summarize the data of the last day from the perspective of the final successful transaction of goods, categories, buyers, etc.

Note
● Aggregation does not span facts. Aggregation is a summary of the original star model. To obtain and query results that are consistent with the original model, the dimensions and measures of the aggregation must be consistent with the original model, so the aggregation does not span facts.
● Aggregation will improve query performance, but aggregation will also increase the difficulty of ETL maintenance. When the first-level category corresponding to a subcategory changes, the previously existing data that has been summarized in the aggregation table needs to be readjusted.

In addition, the following principles need to be followed when designing the DWS layer:

● Data publicity: It is necessary to consider whether the summarized aggregation can be provided to third parties. You can determine whether aggregation based on a certain dimension is frequently used in data analysis. If the answer is yes, it is necessary to summarize and deposit the detailed data into an aggregation table.
● Do not cross data domains. A data domain is an abstraction that classifies and aggregates data at a higher level. Data domains are usually classified according to business processes. For example, transactions are unified under the transaction domain, and new additions and modifications of products are placed under the product domain.
● Distinguish statistical periods. The naming of the table should be able to explain the statistical period of the data. For example, _1d means the last day, td means as of the current day, and nd means the last N days.

Examples are as follows:
● dws_asale_trd_byr_subpay_1d (A one-day summary fact table of the one-day granular payment of buyers of e-commerce company A)
● dws_asale_trd_byr_subpay_td (a summary table of one-day granular payment of buyers of e-commerce company A) ● dws_asale_trd_byr_cod_nd (a summary table of one-day granular payment of
buyers of e-commerce company A) Granular cash on delivery transaction summary fact table)
● dws_asale_itm_slr_td (A e-commerce company's seller's granular product inventory summary table as of the day)
● dws_asale_itm_slr_hh (A e-commerce company's seller's granular product hourly summary table)—the dimension is hour
● dws_asale_itm_slr_mm (A e-commerce company's granular product inventory summary table Seller granular product minute summary table)—the dimension is minutes

2.4.2. Table creation example

The DWS layer table creation statement that meets business requirements is as follows.

CREATE TABLE IF NOT EXISTS dws_asale_trd_byr_ord_1d
(
    buyer_id                BIGINT COMMENT '买家ID',
    buyer_nick              STRING COMMENT '买家昵称',
    mord_prov               STRING COMMENT '收货人省份',
    cate_id                 BIGINT COMMENT '商品类目ID',
    cate_name               STRING COMMENT '商品类目名称',
    confirm_paid_amt_sum_1d DOUBLE COMMENT '最近一天订单已经确认收货的金额总和'
)
COMMENT '买家粒度所有交易最近一天汇总事实表'
PARTITIONED BY (ds         STRING COMMENT '分区字段YYYYMMDD')
LIFECYCLE 36000;
CREATE TABLE IF NOT EXISTS dws_asale_trd_itm_ord_1d
(
    item_id                 BIGINT COMMENT '商品ID',
    item_title               STRING COMMENT '商品名称',
    cate_id                 BIGINT COMMENT '商品类目ID',
    cate_name               STRING COMMENT '商品类目名称',
    mord_prov               STRING COMMENT '收货人省份',
    confirm_paid_amt_sum_1d DOUBLE COMMENT '最近一天订单已经确认收货的金额总和'
)
COMMENT '商品粒度交易最近一天汇总事实表'
PARTITIONED BY (ds         STRING COMMENT '分区字段YYYYMMDD')
LIFECYCLE 36000;

Guess you like

Origin blog.csdn.net/docsz/article/details/132165997