Hive data warehouse construction manual

1 Layering and modeling theory of data warehouse

1.1 Purpose of data warehouse

  • Integrate all business data of the company and establish a unified data center
  • Generate business reports for decision making
  • Provide operational data support for website operations
  • It can be used as a data source for each business, forming a virtuous circle of mutual feedback of business data
  • Analyze user behavior data, reduce input costs and improve input effects through data mining
  • Develop data products that directly or indirectly benefit the company

1.2 Data Warehouse Operation Architecture Diagram

img

1.3 The difference between data mart and data warehouse

Data Mart (Data Market): It is a miniature data warehouse, which usually has less data, fewer subject areas, and less historical data, so it is department-level and generally only for a certain part range of management services.

Data warehouse (Data Warehouse): The data warehouse is enterprise-level and can provide decision support means for the operation of various departments of the entire enterprise.

img

1.4 Data Warehouse Hierarchy

1.4.1 Reasons for Stratification

  • Simplify complex problems: decompose complex tasks into multiple layers to complete, and each layer only handles simple tasks, which is convenient for locating problems.
  • Reduce repetitive development: standardize the data layering, and use the middle layer data to reduce a large number of repeated calculations and increase the reusability of calculation results.
  • Isolate raw data: Whether it is data anomalies or data sensitivity, decouple real data from statistical data.

1.4.2 Basic Hierarchical Model

ODS (data source layer, raw data) – ETL --> DWD (data detail layer) – hive sql --> DWS (data summary) – sqoop --> ADS (data application: reports, user portraits)

img

1.5 Data Warehouse Hierarchy

1.5.1 Overview of Data Warehouse Hierarchy

In Alibaba's data system, it is recommended to divide the data warehouse into three layers, from bottom to top:

Data introduction layer ODS (Operation Data Store): store unprocessed raw data to the data warehouse system, which is consistent with the source system in structure, and is the data preparation area of ​​the data warehouse. Mainly complete the duties of importing basic data to MaxCompute, and record the historical changes of basic data.

Data public layer CDM (Common Data Model, also known as common data model layer): including DIM dimension table, DWD and DWS, processed from ODS layer data. Mainly complete data processing and integration, establish consistent dimensions, build reusable detailed fact tables for analysis and statistics, and summarize public granularity indicators. According to current business characteristics, only the DWD layer is temporarily established

  • Detail-grained fact layer (DWD): With the business process as the modeling driver, based on the characteristics of each specific business process, build the most fine-grained detail-level fact table. Some important dimension attribute fields of the detailed fact table can be appropriately redundant in combination with the data usage characteristics of the enterprise, that is, wide table processing.

  • Data middle layer: DWM (Data Ware House Middle) This layer will perform light aggregation operations on the data based on the data in the DWD layer, generate a series of intermediate tables, improve the reusability of public indicators, and reduce repeated processing. Intuitively speaking, it is to aggregate common core dimensions and calculate corresponding statistical indicators.

  • Public summary granularity fact layer (DWS): With the subject object of analysis as the modeling driver, based on the upper-level application and product indicator requirements, build a public granular summary indicator fact table, and physicalize the model by means of a wide table. Construct statistical indicators with naming conventions and consistent caliber, provide public indicators for the upper layer, and establish summary wide tables and detailed fact tables.

  • Common Dimension Layer (DIM): Based on the concept of dimensional modeling, establish a consistent dimension for the entire enterprise. Reduce the risk of inconsistent data calculation caliber and algorithms. The tables in the common dimension layer are usually also called logical dimension tables, and dimensions and dimension logical tables are usually in one-to-one correspondence.

  • Data application layer ADS (Application Data Service): stores personalized statistical index data of data products. Generated according to CDM and ODS layer processing.

Chinese and English and abbreviations:

Data introduction layer (ODS, Operation Data Store)

Data public layer (CDM, Common Data Model)

Common dimension layer (DIM, Dimension)

Data Warehouse Detail (DWD, Data Warehouse Detail)

Data summary layer (DWS, Data Warehouse Service)

Data application layer (ADS, Application Data Service)

1.5.2 Uses of each level

**1) Data introduction layer (ODS, Operation Data Store): **The original data is stored in the data warehouse system with almost no processing. The structure is basically consistent with the source system, and it is the data preparation area of ​​the data warehouse. Raw data, mainly buried point data (log data) and business operation data (binlong), the data source is mainly Mysql, HDFS, Kafka, etc.

2) Data public layer (CDM, Common Data Model, also known as common data model layer) , including DIM dimension table, DWD and DWS, processed from ODS layer data. It mainly completes data processing and integration, establishes consistent dimensions, builds reusable detailed fact tables for analysis and statistics, and summarizes indicators of public granularity . This layer includes three layers:

  • Common dimension layer (DIM):
  1. Based on the concept of dimensional modeling, establish a consistent dimension for the entire enterprise. Reduce the risk of inconsistent data calculation caliber and algorithms. The tables in the common dimension layer are usually also called logical dimension tables, and dimensions and dimension logical tables are usually in one-to-one correspondence.
  2. Mainly use three storage engines: MySQL, Hbase, and Redis. MySQL can be used when the dimension table data is relatively small. For the case where the single data size is relatively small and the query QPS is relatively high, Redis storage can be used to reduce the memory resource occupation of the machine. For For scenarios where the amount of data is relatively large and is not particularly sensitive to changes in dimension table data, HBase storage can be used.
  • Data Warehouse Detail Layer (DWD) :
  1. The ODS layer is cleaned and landed on this layer, which is generally the finest granularity.
  2. With the business process as the modeling driver, **build the most fine-grained detail-level fact table based on the characteristics of each specific business process. **In combination with the data usage characteristics of the enterprise, some important dimension attribute fields of the detailed fact table can be appropriately redundant, that is, wide-table processing.
  • Data Summarization Layer (DWS):
  1. Slight aggregation of the DWD layer, and aggregation of some accumulative indicators to increase reusability.
  2. Taking the subject object of analysis as the modeling driver, based on the upper-level application and product index requirements, build a summary index fact table with public granularity, and physicalize the model by means of wide table. Construct statistical indicators with naming conventions and consistent caliber, provide public indicators for the upper layer, and establish summary wide tables and detailed fact tables. The tables of the public summary granularity fact layer are usually also called summary logical tables, which are used to store derived indicator data.

3) Data application layer (ADS, Application Data Service) : stores personalized statistical index data of data products. Generated according to CDM and ODS layer processing.

1.6 Development Specifications

1.6.1 Naming rules

1) ods layer

增量数据: {
    
    project_name}.ods_{
    
    数据来源}_{
    
    源系统表名}_delta
全量数据: {
    
    project_name}.ods_{
    
    数据来源}_{
    
    源系统表名}
 
数据来源说明:
01 -> hdfs 数据
02 -> mysql 数据
03 -> redis 数据
04 -> mongodb 数据
05 -> tidb 数据
 
举例如下:
行为日志表: ods_01_action_log
用户表: ods_02_user

2) dim layer

公共区域维表: {
    
    project_name}.dim_pub_{
    
    自定义命名标签}
具体业务维表: {
    
    project_name}.dim_{
    
    业务缩写}_{
    
    自定义命名标签}
 
举例如下:
公共区域维表: dim_pub_area
公共时间维表: dim_pub_date
A公司电商板块的商品全量表: dim_asale_itm

3) dwd layer

多个业务公共表: {
    
    project_name}.dwd_pub_{
    
    自定义命名标签}
具体业务数据增量表: {
    
    project_name}.dwd_{
    
    业务缩写}_{
    
    自定义命名标签}_di
具体业务数据全量表: {
    
    project_name}.dwd_{
    
    业务缩写}_{
    
    自定义命名标签}_df
 
举例如下:
交易会员信息事实表:ods_asale_trd_mbr_di
交易商品信息事实表:dwd_asale_trd_itm_di
交易订单信息事实表:dwd_asale_trd_ord_di

4) dws layer

多个业务公共表: {
    
    project_name}.dws_pub_{
    
    自定义命名标签}
具体业务最近一天汇总事实表: {
    
    project_name}.dws_{
    
    业务缩写}_{
    
    自定义命名标签}_1d
具体业务最近N天汇总事实表: {
    
    project_name}.dws_{
    
    业务缩写}_{
    
    自定义命名标签}_nd
具体业务历史截至当天汇总表: {
    
    project_name}.dws_{
    
    业务缩写}_{
    
    自定义命名标签}_td
具体业务小时汇总表: {
    
    project_name}.dws_{
    
    业务缩写}_{
    
    自定义命名标签}_hh
 
举例如下:
dws_asale_trd_byr_subpay_1d(A电商公司买家粒度交易分阶段付款一日汇总事实表)
dws_asale_trd_byr_subpay_td(A电商公司买家粒度分阶段付款截至当日汇总表)
dws_asale_trd_byr_cod_nd(A电商公司买家粒度货到付款交易汇总事实表)
dws_asale_itm_slr_td(A电商公司卖家粒度商品截至当日存量汇总表)
dws_asale_itm_slr_hh(A电商公司卖家粒度商品小时汇总表)---维度为小时
dws_asale_itm_slr_mm(A电商公司卖家粒度商品分钟汇总表)---维度为分钟

5) ads layer

{
    
    project_name}.ads_{
    
    业务缩写}_{
    
    自定义命名标签}
 
举例如下:
订单统计表: ads_nshop_order_form
订单支付统计: ads_nshop_orderpay_form

1.7 Misunderstandings of layering

The internal division of the data warehouse layer is not for layering. Layering is to solve various problems such as the organization of ETL tasks and workflows, the flow of data, the control of read and write permissions, and the satisfaction of different needs.

A more common practice in the industry is to divide the entire data warehouse layer (DW) into dwd, dwb, dws, dim, mid, and many other layers. However, we still can't tell what the clear boundaries between these layers are, or we can clearly explain the boundaries between them, but complex business scenarios prevent us from actually implementing them.

Therefore, the three layers of data layering, ODS, DWD, and DWS, are generally the most basic:

img

As for how to split the DW layer, it is defined according to the specific business needs and company scenarios. Generally speaking, it needs:

  • Layering is the purpose of solving data flow and quickly supporting business;
  • It must be penetrated according to the subject domain and business domain;
  • There is no reverse dependency between layers.
  • If the data support can be completed by relying on the ODS layer data, then the business side directly uses the landing layer, which is also conducive to quick and low-cost exploration and experimentation of some data.
  • After determining the layered specification, it is best to follow this structure in the future, and it can be agreed upon;
  • Consanguinity, data dependence, data dictionary, data naming conventions, etc. are first supported;

The layering in DW is not the most correct, only the most suitable for you.

1.8 Misunderstandings of wide tables

Wide tables are introduced at the data warehouse layer. The so-called wide table, so far there is no clear definition. A common practice is to associate many dimensions, fact roll-ups, or drill-downs with a certain fact table to form a table that contains both a large number of dimensions and related facts.

The use of wide tables has its certain convenience. The user does not need to consider the association with the dimension table, nor does it need to understand what the dimension table and the fact table are.
However, as the business grows, we still cannot predictably design and define how many dimensions should be redundant in wide tables, nor can we clearly define where the bottom line of redundant dimensions in wide tables lies.

A possible situation is that in order to meet the usage requirements, the existing columns in the dimension table must be continuously added to the wide table. This directly leads to frequent changes in the table structure of the wide table.

What we currently do is:

  • According to the subject domain and business domain, sort out all the nodes of a certain business;
  • Use the data of key nodes as the basis of the fact table, and then horizontally expand the roll-up data of other fact tables (including some statistical indicators), and at the same time add the dimensions corresponding to some primary keys on the node vertically;
  • The involvement of wide tables does not depend on specific business requirements but matches the overall business line;
  • Try to use dimensional modeling instead of wide tables;

Why use dimensional modeling instead of wide tables as much as possible? Even if fields and data are redundant, the way of dimensional modeling will also represent the full amount of data. The wide table mode is better. Reasons:

  • Dimensional modeling is based on a certain established fact. Since it is a fact table, the granularity of the fact table will basically not change if the business of this piece does not change;
  • The fact table and the dimension table are decoupled, and the change of the dimension table will basically not affect the fact table, and the result table only needs to refresh the data flow;
  • New dimensions can be dynamically added according to the star schema or snowflake schema;
  • The dimensional model can be used as the basis of the wide table. Once the entire data flow is determined, the corresponding wide table can be regenerated through the dimensional model for fast business support;

2 dimension table and fact table

2.1 Dimension table

Dimension table : generally describes the facts. Each dimension table corresponds to an object or concept in the real world. For example: user, product, date, region, etc.

Dimension table characteristics:

  • Dimension tables are wide in scope (with multiple attributes, column comparisons)
  • Compared with the fact table, the number of rows is relatively small: usually < 100,000
  • The content is relatively fixed: code list

Time dimension table:

date ID day of week day of year the quarter holidays
2020-01-01 2 1 1 New Year
2020-01-02 3 2 1 none
2020-01-03 4 3 1 none
2020-01-04 5 4 1 none
2020-01-05 6 5 1 none

2.2 Fact table

Fact tables: Every data warehouse contains one or more fact tables. Fact tables may contain business sales data, such as those produced by cash register transactions, and fact tables typically contain a large number of rows. The main characteristic of a fact table is that it contains numerical data (facts), and this numerical information can be aggregated to provide data about the unit as a history. Each fact table contains a multi-part index that contains Key Relevance The dimension table is the primary key while the dimension table contains the attributes of the fact record. Fact tables should not contain descriptive information, nor should they contain any data other than numeric measure fields and related index fields that make the facts correspond to items in the dimension table.

There are two types of "measures" included in a fact table: one is a measure that can be rolled up, and the other is a measure that is not rolled up. The most useful measures are those that can be rolled up, the numbers that add up are very meaningful. Users can obtain aggregated information by accumulating metrics, eg. You can summarize the sales of a specific item for a group of stores over a specific time period. Non-cumulative measures can also be used in fact tables. Aggregated results are generally meaningless. For example, when measuring temperature at different locations in a building, it is meaningless to add up the temperatures at all different locations in the building, but Averaging makes sense.

Generally speaking, a fact data table is associated with one or more latitude tables, and users can use one or more dimension tables when creating cubes with fact data tables .

Each row of data in the fact table represents a business event (order, payment, refund, review, etc.). The term "fact" refers to the measurement value of a business event (countable times, number, amount, etc.), for example, on May 21, 2020, Mr. Song Song bought a bottle of sea dog ginseng for 250 yuan in JD.com pill. Dimension tables: time, user, product, merchant. Fact sheet: 250 yuan, a bottle

The rows of each fact table include: additive numerical measure values, foreign keys connected with dimension tables, usually with two or more foreign keys.

Characteristics of a fact table:

  • very big
  • The content is relatively narrow: the number of columns is small (mainly foreign key id and measure value)
  • Changes frequently, with many new additions every day.

1) Transactional fact table

Take each transaction or event as a unit, such as a sales order record, a payment record, etc., as a row of data in the fact table. Once the transaction is committed and the fact table data is inserted, the data will no longer be changed, and the update method is incremental update.

2) Periodic snapshot fact table

Periodic snapshot fact tables do not retain all data, but only data at fixed time intervals, such as daily or monthly sales, or monthly account balances.

For example, shopping carts, with addition and subtraction of products, may change at any time, but we are more concerned about how many products are there at the end of each day, which is convenient for our later statistical analysis.

3) Cumulative snapshot fact table

A cumulative snapshot fact table is used to track changes to business facts. For example, the data warehouse may need to accumulate or store the point-in-time data of each business stage of the order from the time the order is placed to the time when the order is packaged, shipped, and signed for to track the progress of the order statement cycle. When this business process is in progress, the records in the fact table are also constantly updated.

order id user id order time packing time delivery time Submission time order amount
3-8 3-8 3-9 3-10

3 Data Warehouse Modeling Planning

3.1 ODS layer

How do we plan and process user behavior data and business data on HDFS?

(1) Keep the original appearance of the data without any modification, and play the role of backing up the data.

(2) The data is compressed to reduce disk storage space (for example: the original data is 100G, which can be compressed to about 10G)

(3) Create a partition table to prevent subsequent full table scans

3.2 DIM layer and DWD layer

The DIM layer and the DWD layer need to build a dimensional model, which generally adopts a star model, and the status presented is generally a constellation model.

Dimensional modeling generally follows the following four steps:

Select Business Process→Declaration Granularity→Confirm Dimension→Confirm Fact

(1) Select business process

In the business system, select the business line we are interested in, such as order business, payment business, refund business, logistics business, and a business line corresponds to a fact table.

(2) Statement granularity

Data granularity refers to the level of refinement or comprehensiveness of data stored in the data warehouse.

Declaring granularity means precisely defining what a row of data in the fact table represents. The smallest granularity should be selected as much as possible to meet various needs.

A typical granularity declaration looks like this:

A row of data in the order fact table represents a commodity item in an order.

A row of data in the payment fact table represents a payment record.

(3) Determine the dimension

The main role of dimensions is to describe business facts, mainly expressing information such as "who, where, and when".

The principle for determining dimensions is: whether to analyze indicators of related dimensions in subsequent requirements. For example, it is necessary to make statistics on when the most orders were placed, which region placed the most orders, and which user placed the most orders. Dimensions that need to be determined include: time dimension, region dimension, and user dimension.

(4) Determine the facts

The word "fact" here refers to the measurement value in the business (times, number, number of pieces, amount, which can be accumulated), such as order amount, order number, etc.

At the DWD layer, the business process is used as the modeling drive, and based on the characteristics of each specific business process, the most fine-grained fact table of the detail layer is constructed. Fact tables can be appropriately widened.

The association between the fact table and the dimension table is relatively flexible, but in order to meet more complex business requirements, you can associate the tables that can be associated as much as possible.

time user area merchandise coupon Activity metric
Order Shipping fee/offer amount/original amount/final amount
order details Number of pieces/Premium amount/Original amount/Final amount
to pay payment amount
add purchase Number of pieces/amount
collect frequency
evaluate frequency
Chargeback Number of pieces/amount
Refund Number of pieces/amount
Coupon collection frequency

So far, the dimensional modeling of the data warehouse has been completed, and the DWD layer is driven by business processes.

The DWS layer, DWT layer, and ADS layer are all demand-driven, and have nothing to do with dimensional modeling.

Both DWS and DWT build wide tables and build tables according to themes. The theme is equivalent to the angle of observation of the problem. Corresponds to the dimension table.

3.3 DWS layer and DWT layer

The DWS layer and the DWT layer are collectively referred to as the wide surface layer. The design ideas of the two layers are roughly the same, and are illustrated through the following cases.

1) The question is drawn: two requirements, count the number of orders in each province, and count the total amount of orders in each province

2) Processing method: join the province table and the order table, group by province, and then calculate. The same data is calculated twice, and in fact there will be more similar scenarios.

So how can the design avoid double counting?

For the above scenario, a region-wide table can be designed, whose primary key is the region ID, and the fields include: order times, order amount, payment times, payment amount, etc. All the above indicators are calculated uniformly, and the results are saved in the wide table, which can effectively avoid double calculation of data.

3) Summary:

(1) Which wide tables need to be built: based on dimensions.

(2) Fields in the wide table: look at the fact table from the perspective of different dimensions, focusing on the aggregated measurement values ​​of the fact table.

(3) The difference between the DWS and DWT layers: the DWS layer stores the summary behavior of all subject objects on the day, such as the number of orders placed in each region on the day, the order amount, etc., and the DWT layer stores the cumulative behavior of all subject objects, such as The number of orders placed and the amount of orders placed in each region in the last 7 days (15 days, 30 days, 60 days).

3.4 ADS layer

Analyze the major thematic indicators of the e-commerce system separately.

4 Hive Data Warehouse Combat

insert image description here

insert image description here

1 Construction of ODS layer

1.1 ODS Layer Functions and Responsibilities

1) Keep the original appearance of the data without any modification, and play the role of backing up the data.

2) Data is compressed by LZO to reduce disk storage space. 100G data can be compressed to within 10G.

3) Create a partition table to prevent subsequent full table scans, and use partition tables extensively in enterprise development.

4) Create an external table. In enterprise development, in addition to creating temporary tables for your own use and creating internal tables, most scenarios are to create external tables.

img

1.2 ODS layer construction - data import - full coverage

There is no need for partitions, and each synchronization is to delete first and then write, directly overwriting.

Applicable to situations where there will not be any new additions or changes to the data.

For example, dimension data such as regional dictionary tables, time, and gender will not change or will rarely change, and only the latest values ​​can be kept.

Here we take the t_district area dictionary table as an example to explain.

DROP TABLE if exists yp_ods.t_district;
CREATE TABLE yp_ods.t_district
(
    `id` string COMMENT '主键ID',
    `code` string COMMENT '区域编码',
    `name` string COMMENT '区域名称',
    `pid`  int COMMENT '父级ID',
    `alias` string COMMENT '别名'
)
comment '区域字典表'
row format delimited fields terminated by '\t' 
stored as orc tblproperties ('orc.compress'='ZLIB');

sqoop data synchronization

Because tables are stored in ORC format, HCatalog API is required when using sqoop to import data.

-- Sqoop导入之前可以先原表的数据进行清空
truncate table yp_ods.t_district;

方式1-使用1个maptask进行导入
sqoop import  \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select * from t_district where \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_district \
--m 1

1.3 ODS Layer Construction – Data Import – Incremental Synchronization

A new date partition is added every day, and the new data of the day is synchronized and stored.

For example, login log table, access log table, transaction record table, commodity evaluation table, order evaluation table, etc.

Here we take the t_user_login login log table as an example to explain.

DROP TABLE if exists yp_ods.t_user_login;
CREATE TABLE if not exists yp_ods.t_user_login(
   id string,
   login_user string,
   login_type string COMMENT '登录类型(登陆时使用)',
   client_id string COMMENT '推送标示id(登录、第三方登录、注册、支付回调、给用户推送消息时使用)',
   login_time string,
   login_ip string,
   logout_time string
) 
COMMENT '用户登录记录表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc tblproperties ('orc.compress' = 'ZLIB');

sqoop data synchronization

  • First time (full amount)

1. Regardless of the mode, the first time is full synchronization; when the cycle is synchronized again, you can control the scope of the synchronization data by yourself through the where condition;

2. ${TD_DATE} represents the partition date, which should normally be the day before today, because under normal circumstances, it is 12 o'clock in the night, and the work of the previous day is done, so the partition field of the data should belong to the previous day.

3. For demonstration purposes, ${TD_DATE} is hardcoded first.

sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *,'2022-11-18' as dt from t_user_login where  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_user_login \
--m 1
  • Loop (incremental synchronization)
#!/bin/bash
date -s '2022-11-20'  #模拟导入增量19号的数据

#你认为现在是2022-11-20,昨天是2022-11-19
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_user_login where 1=1 and (login_time between '${TD_DATE} 00:00:00' and 
'${TD_DATE} 23:59:59') and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_user_login \
-m 1

1.4 ODS layer construction - data import - new and updated synchronization

A new date partition is added every day, and the new and updated data of the day is synchronized and stored.

Applicable to both new and updated data, such as user tables, order tables, product tables, etc.

Here we take the t_store shop table as an example to explain.

drop table if exists yp_ods.t_store;
CREATE TABLE yp_ods.t_store
(
    `id`                 string COMMENT '主键',
    `user_id`            string,
    `store_avatar`       string COMMENT '店铺头像',
    `address_info`       string COMMENT '店铺详细地址',
    `name`               string COMMENT '店铺名称',
    `store_phone`        string COMMENT '联系电话',
    `province_id`        INT COMMENT '店铺所在省份ID',
    `city_id`            INT COMMENT '店铺所在城市ID',
    `area_id`            INT COMMENT '店铺所在县ID',
    `mb_title_img`       string COMMENT '手机店铺 页头背景图',
    `store_description` string COMMENT '店铺描述',
    `notice`             string COMMENT '店铺公告',
    `is_pay_bond`        TINYINT COMMENT '是否有交过保证金 1:是0:否',
    `trade_area_id`      string COMMENT '归属商圈ID',
    `delivery_method`    TINYINT COMMENT '配送方式  1 :自提 ;3 :自提加配送均可; 2 : 商家配送',
    `origin_price`       DECIMAL,
    `free_price`         DECIMAL,
    `store_type`         INT COMMENT '店铺类型 22天街网店 23实体店 24直营店铺 33会员专区店',
    `store_label`        string COMMENT '店铺logo',
    `search_key`         string COMMENT '店铺搜索关键字',
    `end_time`           string COMMENT '营业结束时间',
    `start_time`         string COMMENT '营业开始时间',
    `operating_status`   TINYINT COMMENT '营业状态  0 :未营业 ;1 :正在营业',
    `create_user`        string,
    `create_time`        string,
    `update_user`        string,
    `update_time`        string,
    `is_valid`           TINYINT COMMENT '0关闭,1开启,3店铺申请中',
    `state`              string COMMENT '可使用的支付类型:MONEY金钱支付;CASHCOUPON现金券支付',
    `idCard`             string COMMENT '身份证',
    `deposit_amount`     DECIMAL(11,2) COMMENT '商圈认购费用总额',
    `delivery_config_id` string COMMENT '配送配置表关联ID',
    `aip_user_id`        string COMMENT '通联支付标识ID',
    `search_name`        string COMMENT '模糊搜索名称字段:名称_+真实名称',
    `automatic_order`    TINYINT COMMENT '是否开启自动接单功能 1:是  0 :否',
    `is_primary`         TINYINT COMMENT '是否是总店 1: 是 2: 不是',
    `parent_store_id`    string COMMENT '父级店铺的id,只有当is_primary类型为2时有效'
)
comment '店铺表'
partitioned by (dt string) 
row format delimited fields terminated by '\t' 
stored as orc tblproperties ('orc.compress'='ZLIB');

sqoop data synchronization

The key to realizing new and updated synchronization is that there are two time-related fields in the table:

create_time The creation time will not be modified once it is generated

update_time update time data change time modification

Control the scope of the synchronized data by yourself through the where condition.

  • first
sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *,'2022-11-18' as dt  from t_store where 1=1 and \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_store \
-m 1
  • cycle
#!/bin/bash
date -s '2022-11-20'
TD_DATE=`date -d '1 days ago' "+%Y-%m-%d"`
/usr/bin/sqoop import "-Dorg.apache.sqoop.splitter.allow_text_splitter=true" \
--connect 'jdbc:mysql://192.168.88.80:3306/yipin?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true' \
--username root \
--password 123456 \
--query "select *, '${TD_DATE}' as dt from t_store where 1=1 and ((create_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59') or (update_time between '${TD_DATE} 00:00:00' and '${TD_DATE} 23:59:59')) and  \$CONDITIONS" \
--hcatalog-database yp_ods \
--hcatalog-table t_store \
-m 1

Finally, all ODS layer tables imported from MySql

insert image description here

1.5 Summary

Here is an introduction to the construction of the ODS layer of the HIve data warehouse new retail project, mainly in three ways.

  1. full coverage
  2. Incremental synchronization
  3. Synchronization of new additions and updates

2 Construction of DWD layer

2.1 Functions and Responsibilities of DWD Layer

Clean the data in the ods layer table, refer to the data cleaning rules, and clean the data according to the actual situation.

注意:如果清洗规则使用SQL可以实现,那么就使用SQL实现数据清洗,如果清洗的规则使用SQL实现起来非常麻烦,或者使用SQL压根无法实现,此时就可以考虑需要使用MapReduce代码或者Spark代码对数据进行清洗了。
  • The dwd layer is called the detailed data layer in Chinese.

  • The main function:

    • Data cleaning and transformation, providing quality assurance;
    • Distinguish between facts and dimensions.
  • Table name specification

    dwd.fact_xxxxxx

    Order main table, order settlement, order group, order refund, order product snapshot, shopping cart, store collection, etc.

    dwd.dim_yyyyyy

    User, area, time, store, business district, address information, product, product category, brand, etc.

2.2 DWD layer construction - regional dimension table - full coverage import

DROP TABLE if EXISTS yp_dwd.dim_district;
CREATE TABLE yp_dwd.dim_district(
  id string COMMENT '主键ID', 
  code string COMMENT '区域编码', 
  name string COMMENT '区域名称', 
  pid string COMMENT '父级ID', 
  alias string COMMENT '别名')
COMMENT '区域字典表'
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');

Full coverage operation

INSERT overwrite TABLE yp_dwd.dim_district
select * from yp_ods.t_district
WHERE code IS NOT NULL AND name IS NOT NULL;

2.3 DWD layer construction - order evaluation form - incremental import

#解释:每一次增量的数据都创建一个分区进行报错
DROP TABLE if EXISTS yp_dwd.fact_goods_evaluation;
CREATE TABLE yp_dwd.fact_goods_evaluation(
  id string, 
  user_id string COMMENT '评论人id', 
  store_id string COMMENT '店铺id', 
  order_id string COMMENT '订单id', 
  geval_scores int COMMENT '综合评分', 
  geval_scores_speed int COMMENT '送货速度评分0-5分(配送评分)', 
  geval_scores_service int COMMENT '服务评分0-5分', 
  geval_isanony tinyint COMMENT '0-匿名评价,1-非匿名', 
  create_user string, 
  create_time string, 
  update_user string, 
  update_time string, 
  is_valid tinyint COMMENT '0 :失效,1 :开启')
COMMENT '订单评价表'
partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as orc 
tblproperties ('orc.compress' = 'SNAPPY');
  • Import for the first time (full amount)
-- 从ods层进行加载
INSERT overwrite TABLE yp_dwd.fact_goods_evaluation PARTITION(dt)
select 
   id,
   user_id,
   store_id,
   order_id,
   geval_scores,
   geval_scores_speed,
   geval_scores_service,
   geval_isanony,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   substr(create_time, 1, 10) as dt  
from yp_ods.t_goods_evaluation;
  • Incremental import operations
INSERT into TABLE yp_dwd.fact_goods_evaluation PARTITION(dt)
select 
   id,
   user_id,
   store_id,
   order_id,
   geval_scores,
   geval_scores_speed,
   geval_scores_service,
   geval_isanony,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   substr(create_time, 1, 10) as dt
from yp_ods.t_goods_evaluation
where dt='2022-11-19';

2.4 DWD layer construction - order fact table - loop and zipper import

The zipper table is the key point of the interview. If you are facing a position related to the warehouse, the interviewer is particularly fond of asking.

DROP TABLE if EXISTS yp_dwd.fact_shop_order;
CREATE TABLE if not exists yp_dwd.fact_shop_order(  -- 拉链表
  id string COMMENT '根据一定规则生成的订单编号',
  order_num string COMMENT '订单序号',
  buyer_id string COMMENT '买家的userId',
  store_id string COMMENT '店铺的id',
  order_from string COMMENT '此字段可以转换 1.安卓\; 2.ios\; 3.小程序H5 \; 4.PC',
  order_state int COMMENT '订单状态:1.已下单\; 2.已付款, 3. 已确认 \;4.配送\; 5.已完成\; 6.退款\;7.已取消',
  create_date string COMMENT '下单时间',
  finnshed_time timestamp COMMENT '订单完成时间,当配送员点击确认送达时,进行更新订单完成时间,后期需要根据订单完成时间,进行自动收货以及自动评价',
  is_settlement tinyint COMMENT '是否结算\;0.待结算订单\; 1.已结算订单\;',
  is_delete tinyint COMMENT '订单评价的状态:0.未删除\;  1.已删除\;(默认0)',
  evaluation_state tinyint COMMENT '订单评价的状态:0.未评价\;  1.已评价\;(默认0)',
  way string COMMENT '取货方式:SELF自提\;SHOP店铺负责配送',
  is_stock_up int COMMENT '是否需要备货 0:不需要    1:需要    2:平台确认备货  3:已完成备货 4平台已经将货物送至店铺 ',
  create_user string,
  create_time string,
  update_user string,
  update_time string,
  is_valid tinyint COMMENT '是否有效  0: false\; 1: true\;   订单是否有效的标志',
  end_date string COMMENT '拉链结束日期'
) COMMENT '订单表'
partitioned by (start_date string)
row format delimited fields terminated by '\t'
stored as orc tblproperties ('orc.compress' = 'SNAPPY');

insert image description here

  • first import
  • If it is a dynamic partition insert, don't forget the relevant parameters
  • If the field of the table in the ods layer has an enumeration type, you can use the case when statement to convert it in the process of ETL to dwd.
INSERT overwrite TABLE yp_dwd.fact_shop_order PARTITION (start_date)
SELECT 
   id,
   order_num,
   buyer_id,
   store_id,
   case order_from 
      when 1
      then 'android'
      when 2
      then 'ios'
      when 3
      then 'miniapp'
      when 4
      then 'pcweb'
      else 'other'
      end
      as order_from,
   order_state,
   create_date,
   finnshed_time,
   is_settlement,
   is_delete,
   evaluation_state,
   way,
   is_stock_up,
   create_user,
   create_time,
   update_user,
   update_time,
   is_valid,
   '9999-99-99' end_date,
    dt as start_date
FROM yp_ods.t_shop_order;

insert image description here

  • Zip operation
insert overwrite table yp_dwd.fact_shop_order partition (start_date)
select *
from (
   --1、ods表的新分区数据(有新增和更新的数据)
         select id,
                order_num,
                buyer_id,
                store_id,
                case order_from
                    when 1
                        then 'android'
                    when 2
                        then 'ios'
                    when 3
                        then 'miniapp'
                    when 4
                        then 'pcweb'
                    else 'other'
                    end
                    as order_from,
                order_state,
                create_date,
                finnshed_time,
                is_settlement,
                is_delete,
                evaluation_state,
                way,
                is_stock_up,
                create_user,
                create_time,
                update_user,
                update_time,
                is_valid,
                '9999-99-99' end_date,
               '2022-11-19' as start_date
         from yp_ods.t_shop_order
         where dt='2022-11-19'

         union all

    -- 2、历史拉链表数据,并根据up_id判断更新end_time有效期
         select
             fso.id,
             fso.order_num,
             fso.buyer_id,
             fso.store_id,
             fso.order_from,
             fso.order_state,
             fso.create_date,
             fso.finnshed_time,
             fso.is_settlement,
             fso.is_delete,
             fso.evaluation_state,
             fso.way,
             fso.is_stock_up,
             fso.create_user,
             fso.create_time,
             fso.update_user,
             fso.update_time,
             fso.is_valid,
             --3、更新end_time:如果没有匹配到变更数据,或者当前已经是无效的历史数据,则保留原始end_time过期时间;否则变更end_time时间为前天(昨天之前有效)
             if (tso.id is not null and fso.end_date='9999-99-99',date_add(tso.dt, -1), fso.end_date) end_time,
             fso.start_date
         from yp_dwd.fact_shop_order fso left join (select * from yp_ods.t_shop_order where dt='2022-11-19') tso
         on fso.id=tso.id
     ) his
order by his.id, start_date;

2.5 Summary

Here is an introduction to the construction of the DWD layer of the new retail project of HIve, mainly in three ways:

  1. full coverage import
  2. incremental import
  3. Loops and zip imports

3 Construction of DWS layer

3.1 Functions and Responsibilities of DWS Layer

DWS layer: Based on topic statistical analysis, this layer is generally used for the most fine-grained statistical operations

3.1.1 Dimension combination:

  • date

  • date+city

  • date + city + business district

  • date + city + business district + store

  • date + brand

  • date + category

  • date+major category+medium category

  • date + major category + middle column + small category

3.1.2 Indicators:

Sales revenue, platform revenue, delivery turnover, mini-program turnover, Android APP turnover, Apple APP turnover, PC mall turnover, order volume, participation volume, negative review volume, delivery volume, and refund volume , Mini Program Orders, Android APP Orders, Apple APP Orders, and PC Mall Orders.

3.2 Wide Table of Sales Theme Statistics

Finally, it is required to use group_type to determine the aggregation of which dimension the indicator comes from
insert image description here

3.3 Summary

(Grouping sets) model, the Grouping sets model is suitable for the construction of multi-dimensional, multi-indicator sparse wide tables, and different dimensions can be placed in the same wide table to facilitate future queries. At the same time, when creating an aggregation field, you can customize the aggregation operation according to each dimension. More flexible.

(Full join) model, mainly for low-dimensional, multi-index situations. The main idea of ​​the Full join model is

  1. Use the with statement to extract the key fields of the dwb_order_detail table
  2. First count the data of 6 small result tables
  3. Full join the data of the 6 small result tables
  4. Extract data from the result table of full join
  5. Deduplication, remove the duplicate data of date and goods_id

Guess you like

Origin blog.csdn.net/An1090239782/article/details/128796976