Local data warehouse project (2) - detailed process of building a system business data warehouse

1 Description

Based on the business data of "Local Data Warehouse Project (1) - Detailed Process of Local Data Warehouse Construction" , this paper builds a system business data warehouse locally.
Generate business data according to the simulated sql script, and execute in sequence to generate business data.
insert image description here
The sql script is provided as follows

链接:https://pan.baidu.com/s/1AhLIuTNIyJ_GBD7M0b2RoA 
提取码:1lm8 

The generated data is as follows:
insert image description here

2 Import business data into data warehouse

The overall framework of the data warehouse is as follows. In the previous "Local Data Warehouse Project (1) - Detailed Process of Building a Local Data Warehouse", the overall process of data collection and analysis has been completed. The business data warehouse data here needs to use sqoop to import data from mysql to HDFS.
insert image description here

2.1 Install sqoop

2.1.1 Unzip and rename

tar -zxvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha sqoop-1.4.6

2.1.2 Configure the SQOOP_HOME environment variable

SQOOP_HOME=/root/soft/sqoop-1.4.6
PATH=$PATH:$JAVA_HOME/bin:$SHELL_HOME:$FLUME_HOME/bin:$HIVE_HOME/bin:$KAFKA_HOME/bin:$ZOOKEEPER_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SQOOP_HOME/bin

2.1.3 Configure sqoop-env.sh

mv sqoop-env-template.sh sqoop-env.sh
export HADOOP_COMMON_HOME=/root/soft/hadoop-2.7.2
export HADOOP_MAPRED_HOME=/root/soft/hadoop-2.7.2
export HIVE_HOME=/root/soft/hive
export ZOOKEEPER_HOME=/root/soft/zookeeper-3.4.10
export ZOOCFGDIR=/root/soft/zookeeper-3.4.10

2.1.4 Copy the jdbc driver of mysql to the lib directory of sqoop

2.1.5 Test link

bin/sqoop list-databases --connect jdbc:mysql://192.168.2.100:3306/ --username root --password 123456

The following page appears to indicate that the sqoop installation is successful
insert image description here

2.2 sqoop import data to HDFS

The following sqoop script can automatically import data to HDFS at regular intervals

#!/bin/bash

db_date=$2
echo $db_date
db_name=gmall

import_data() {
    
    
/root/soft/sqoop-1.4.6/bin/sqoop import \
--connect jdbc:mysql://192.168.2.100:3306/$db_name \
--username root \
--password 123456 \
--target-dir /origin_data/$db_name/db/$1/$db_date \
--delete-target-dir \
--num-mappers 1 \
--fields-terminated-by "\t" \
--query "$2"' and $CONDITIONS;' \
--null-string '\\N' \
--null-non-string '\\N'
}

import_sku_info(){
    
    
  import_data "sku_info" "select 
id, spu_id, price, sku_name, sku_desc, weight, tm_id,
category3_id, create_time
  from sku_info where 1=1"
}

import_user_info(){
    
    
  import_data "user_info" "select 
id, name, birthday, gender, email, user_level, 
create_time 
from user_info where 1=1"
}

import_base_category1(){
    
    
  import_data "base_category1" "select 
id, name from base_category1 where 1=1"
}

import_base_category2(){
    
    
  import_data "base_category2" "select 
id, name, category1_id from base_category2 where 1=1"
}

import_base_category3(){
    
    
  import_data "base_category3" "select id, name, category2_id from base_category3 where 1=1"
}

import_order_detail(){
    
    
  import_data   "order_detail"   "select 
    od.id, 
    order_id, 
    user_id, 
    sku_id, 
    sku_name, 
    order_price, 
    sku_num, 
    o.create_time  
  from order_info o, order_detail od
  where o.id=od.order_id
  and DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date'"
}

import_payment_info(){
    
    
  import_data "payment_info"   "select 
    id,  
    out_trade_no, 
    order_id, 
    user_id, 
    alipay_trade_no, 
    total_amount,  
    subject, 
    payment_type, 
    payment_time 
  from payment_info 
  where DATE_FORMAT(payment_time,'%Y-%m-%d')='$db_date'"
}

import_order_info(){
    
    
  import_data   "order_info"   "select 
    id, 
    total_amount, 
    order_status, 
    user_id, 
    payment_way, 
    out_trade_no, 
    create_time, 
    operate_time  
  from order_info 
  where (DATE_FORMAT(create_time,'%Y-%m-%d')='$db_date' or DATE_FORMAT(operate_time,'%Y-%m-%d')='$db_date')"
}

case $1 in
  "base_category1")
     import_base_category1
;;
  "base_category2")
     import_base_category2
;;
  "base_category3")
     import_base_category3
;;
  "order_info")
     import_order_info
;;
  "order_detail")
     import_order_detail
;;
  "sku_info")
     import_sku_info
;;
  "user_info")
     import_user_info
;;
  "payment_info")
     import_payment_info
;;
   "all")
   import_base_category1
   import_base_category2
   import_base_category3
   import_order_info
   import_order_detail
   import_sku_info
   import_user_info
   import_payment_info
;;
esac

Note:
①When sqoop imports data by default, convert the Null type of Mysql to 'null'
②Use \N in hive to represent the NULL type
③If you want to convert the Null type of Mysql to the type you expect when importing,
Need to use --null-string and --null-non-string

–null-string: When the string type column of mysql is null, what to use instead when importing to hive!
–null-string a: If in mysql, the current column is a string type (varchar, char), if the value of this column is NULL, when importing to hive, use a instead!
–null-non-string: When mysql’s non-string type column is null, what to use instead when importing to hive!
–null-non-string b: If the current column in mysql is not a string type (varchar, char), if the value of this column is NULL, when importing to hive, use b instead!

④ If you want to export the specified parameters as the NULL type of mysql when exporting, you need to use

--input-null-string and --input-null-non-string --input-null-string a: When hive is exported to mysql, if the value of the string type column in hive is a, export it to mysql and use NULL instead!
–input-null-non-string b:
When hive is exported to mysql, if the value of a non-string column in hive is b, export to mysql and use NULL instead!

Execute the script and import the data
insert image description here
insert image description here

3 ODS layers

3.1 Create ods table

3.1.1 Create order table

drop table if exists ods_order_info;
create external table ods_order_info (
    `id` string COMMENT '订单编号',
    `total_amount` decimal(10,2) COMMENT '订单金额',
    `order_status` string COMMENT '订单状态',
    `user_id` string COMMENT '用户id',
    `payment_way` string COMMENT '支付方式',
    `out_trade_no` string COMMENT '支付流水号',
    `create_time` string COMMENT '创建时间',
    `operate_time` string COMMENT '操作时间'
) COMMENT '订单表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_order_info/';

3.1.2 Create order details

drop table if exists ods_order_detail;
create external table ods_order_detail( 
    `id` string COMMENT '订单详情编号',
    `order_id` string  COMMENT '订单号', 
    `user_id` string COMMENT '用户id',
    `sku_id` string COMMENT '商品id',
    `sku_name` string COMMENT '商品名称',
    `order_price` string COMMENT '商品单价',
    `sku_num` string COMMENT '商品数量',
    `create_time` string COMMENT '创建时间'
) COMMENT '订单明细表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t' 
location '/wavehouse/gmall/ods/ods_order_detail/';

3.1.3 Create commodity information table

drop table if exists ods_sku_info;
create external table ods_sku_info( 
    `id` string COMMENT 'skuId',
    `spu_id` string   COMMENT 'spuid', 
    `price` decimal(10,2) COMMENT '价格',
    `sku_name` string COMMENT '商品名称',
    `sku_desc` string COMMENT '商品描述',
    `weight` string COMMENT '重量',
    `tm_id` string COMMENT '品牌id',
    `category3_id` string COMMENT '品类id',
    `create_time` string COMMENT '创建时间'
) COMMENT '商品表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_sku_info/';

3.1.4 Create user table

drop table if exists ods_user_info;
create external table ods_user_info( 
    `id` string COMMENT '用户id',
    `name`  string COMMENT '姓名',
    `birthday` string COMMENT '生日',
    `gender` string COMMENT '性别',
    `email` string COMMENT '邮箱',
    `user_level` string COMMENT '用户等级',
    `create_time` string COMMENT '创建时间'
) COMMENT '用户信息'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_user_info/';

3.1.5 Create a commodity classification table

drop table if exists ods_base_category1;
create external table ods_base_category1( 
    `id` string COMMENT 'id',
    `name`  string COMMENT '名称'
) COMMENT '商品一级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_base_category1/';

3.1.6 Create a secondary classification table for commodities

drop table if exists ods_base_category2;
create external table ods_base_category2( 
    `id` string COMMENT ' id',
    `name` string COMMENT '名称',
    category1_id string COMMENT '一级品类id'
) COMMENT '商品二级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_base_category2/';

3.1.7 Create a three-level table of commodities

drop table if exists ods_base_category3;
create external table ods_base_category3(
    `id` string COMMENT ' id',
    `name`  string COMMENT '名称',
    category2_id string COMMENT '二级品类id'
) COMMENT '商品三级分类'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_base_category3/';

3.1.8 Create payment flow table

drop table if exists ods_payment_info;
create external table ods_payment_info(
    `id`   bigint COMMENT '编号',
    `out_trade_no`    string COMMENT '对外业务编号',
    `order_id`        string COMMENT '订单编号',
    `user_id`         string COMMENT '用户编号',
    `alipay_trade_no` string COMMENT '支付宝交易流水编号',
    `total_amount`    decimal(16,2) COMMENT '支付金额',
    `subject`         string COMMENT '交易内容',
    `payment_type`    string COMMENT '支付类型',
    `payment_time`    string COMMENT '支付时间'
   )  COMMENT '支付流水表'
PARTITIONED BY (`dt` string)
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ods/ods_payment_info/';

3.2 Import data

load data inpath '/origin_data/gmall/db/order_info/2023-01-04' OVERWRITE into table gmall.ods_order_info partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/order_info/2023-01-05' OVERWRITE into table gmall.ods_order_info partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/order_detail/2023-01-04' OVERWRITE into table gmall.ods_order_detail partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/order_detail/2023-01-05' OVERWRITE into table gmall.ods_order_detail partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/sku_info/2023-01-04' OVERWRITE into table gmall.ods_sku_info partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/sku_info/2023-01-05' OVERWRITE into table gmall.ods_sku_info partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/user_info/2023-01-04' OVERWRITE into table gmall.ods_user_info partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/user_info/2023-01-05' OVERWRITE into table gmall.ods_user_info partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/payment_info/2023-01-04' OVERWRITE into table gmall.ods_payment_info partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/payment_info/2023-01-05' OVERWRITE into table gmall.ods_payment_info partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/base_category1/2023-01-04' OVERWRITE into table gmall.ods_base_category1 partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/base_category1/2023-01-05' OVERWRITE into table gmall.ods_base_category1 partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/base_category2/2023-01-04' OVERWRITE into table gmall.ods_base_category2 partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/base_category2/2023-01-05' OVERWRITE into table gmall.ods_base_category2 partition(dt='2023-01-05');

load data inpath '/origin_data/gmall/db/base_category3/2023-01-04' OVERWRITE into table gmall.ods_base_category3 partition(dt='2023-01-04');
load data inpath '/origin_data/gmall/db/base_category3/2023-01-05' OVERWRITE into table gmall.ods_base_category3 partition(dt='2023-01-05');

The above can be written as a script, with the date as the parameter, and it can be executed regularly every day.

4 DWD layers

4.1 Create dwd schedule

4.1.1 Create order table

drop table if exists dwd_order_info;
create external table dwd_order_info (
    `id` string COMMENT '',
    `total_amount` decimal(10,2) COMMENT '',
    `order_status` string COMMENT ' 1 2 3 4 5',
    `user_id` string COMMENT 'id',
    `payment_way` string COMMENT '',
    `out_trade_no` string COMMENT '',
    `create_time` string COMMENT '',
    `operate_time` string COMMENT ''
) 
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dwd/dwd_order_info/'
tblproperties ("parquet.compression"="snappy");

4.1.2 Create order details table

drop table if exists dwd_order_detail;
create external table dwd_order_detail( 
    `id` string COMMENT '',
    `order_id` decimal(10,2) COMMENT '', 
    `user_id` string COMMENT 'id',
    `sku_id` string COMMENT 'id',
    `sku_name` string COMMENT '',
    `order_price` string COMMENT '',
    `sku_num` string COMMENT '',
    `create_time` string COMMENT ''
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dwd/dwd_order_detail/'
tblproperties ("parquet.compression"="snappy");

4.1.3 Create user table

drop table if exists dwd_user_info;
create external table dwd_user_info( 
    `id` string COMMENT 'id',
    `name` string COMMENT '', 
    `birthday` string COMMENT '',
    `gender` string COMMENT '',
    `email` string COMMENT '',
    `user_level` string COMMENT '',
    `create_time` string COMMENT ''
) 
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dwd/dwd_user_info/'
tblproperties ("parquet.compression"="snappy");

4.1.4 Create payment flow table

drop table if exists dwd_payment_info;
create external table dwd_payment_info(
    `id`   bigint COMMENT '',
    `out_trade_no`    string COMMENT '',
    `order_id`        string COMMENT '',
    `user_id`         string COMMENT '',
    `alipay_trade_no` string COMMENT '',
    `total_amount`    decimal(16,2) COMMENT '',
    `subject`         string COMMENT '',
    `payment_tpe`    string COMMENT '',
    `payment_time`    string COMMENT ''
   )  
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dwd/dwd_payment_info/'
tblproperties ("parquet.compression"="snappy");

4.1.5 Create a commodity classification table

drop table if exists dwd_sku_info;
create external table dwd_sku_info(
    `id` string COMMENT 'skuId',
    `spu_id` string COMMENT 'spuid',
    `price` decimal(10,2) COMMENT '',
    `sku_name` string COMMENT '',
    `sku_desc` string COMMENT '',
    `weight` string COMMENT '',
    `tm_id` string COMMENT 'id',
    `category3_id` string COMMENT '1id',
    `category2_id` string COMMENT '2id',
    `category1_id` string COMMENT '3id',
    `category3_name` string COMMENT '3',
    `category2_name` string COMMENT '2',
    `category1_name` string COMMENT '1',
    `create_time` string COMMENT ''
) 
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dwd/dwd_sku_info/'
tblproperties ("parquet.compression"="snappy");

4.2 Import data

#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/root/soft/hive/bin/hive

# 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then
	do_date=$1
else 
	do_date=`date -d "-1 day" +%F`  
fi 

sql="

set hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table "$APP".dwd_order_info partition(dt)
select * from "$APP".ods_order_info 
where dt='$do_date' and id is not null;
 
insert overwrite table "$APP".dwd_order_detail partition(dt)
select * from "$APP".ods_order_detail 
where dt='$do_date'   and id is not null;

insert overwrite table "$APP".dwd_user_info partition(dt)
select * from "$APP".ods_user_info
where dt='$do_date' and id is not null;
 
insert overwrite table "$APP".dwd_payment_info partition(dt)
select * from "$APP".ods_payment_info
where dt='$do_date' and id is not null;

insert overwrite table "$APP".dwd_sku_info partition(dt)
select  
    sku.id,
    sku.spu_id,
    sku.price,
    sku.sku_name,
    sku.sku_desc,
    sku.weight,
    sku.tm_id,
    sku.category3_id,
    c2.id category2_id,
    c1.id category1_id,
    c3.name category3_name,
    c2.name category2_name,
    c1.name category1_name,
    sku.create_time,
    sku.dt
from
    "$APP".ods_sku_info sku
join "$APP".ods_base_category3 c3 on sku.category3_id=c3.id 
    join "$APP".ods_base_category2 c2 on c3.category2_id=c2.id 
    join "$APP".ods_base_category1 c1 on c2.category1_id=c1.id 
where sku.dt='$do_date'  and c2.dt='$do_date'
and c3.dt='$do_date' and c1.dt='$do_date'
and sku.id is not null;
"
$hive -e "$sql"

insert image description here

5 dws layers

5.1 User Behavior Wide Table

The demand goal is to aggregate the behavior of each user in a single day to form a multi-column wide table, so that the statistical analysis from different angles can be performed after correlating user dimension information.

5.1.1 Create a user behavior wide table

drop table if exists dws_user_action;
create external table dws_user_action 
(   
    user_id          string      comment '用户 id',
    order_count     bigint      comment '下单次数 ',
    order_amount    decimal(16,2)  comment '下单金额 ',
    payment_count   bigint      comment '支付次数',
    payment_amount  decimal(16,2) comment '支付金额 ',
    comment_count   bigint      comment '评论次数'
) COMMENT '每日用户行为宽表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dws/dws_user_action/';

5.1.2 Import data

with 
tmp_order as
(
    select 
        user_id, 
		count(*)  order_count,
        sum(oi.total_amount) order_amount
    from dwd_order_info oi
    where date_format(oi.create_time,'yyyy-MM-dd')='2023-01-04'
    group by user_id
) ,
tmp_payment as
(
    select
        user_id, 
        sum(pi.total_amount) payment_amount, 
        count(*) payment_count 
    from dwd_payment_info pi 
    where date_format(pi.payment_time,'yyyy-MM-dd')='2023-01-04'
    group by user_id
),
tmp_comment as
(
    select
        user_id,
        count(*) comment_count
    from dwd_comment_log c
    where date_format(c.dt,'yyyy-MM-dd')='2023-01-04'
    group by user_id
)

insert overwrite table dws_user_action partition(dt='2023-01-04')
select
    user_actions.user_id,
    sum(user_actions.order_count),
    sum(user_actions.order_amount),
    sum(user_actions.payment_count),
    sum(user_actions.payment_amount),
    sum(user_actions.comment_count)
from 
(
    select
        user_id,
        order_count,
        order_amount,
        0 payment_count,
        0 payment_amount,
        0 comment_count
    from tmp_order

    union all
    select
        user_id,
        0,
        0,
        payment_count,
        payment_amount,
        0
    from tmp_payment

    union all
    select
        user_id,
        0,
        0,
        0,
        0,
        comment_count
    from tmp_comment
 ) user_actions
group by user_id;

6 needs

6.1 Requirement 1

Find the total turnover of GMV.
GMV refers to the total turnover (such as one day, one week, one month) within a certain period of time.

drop table if exists ads_gmv_sum_day;
create external table ads_gmv_sum_day(
    `dt` string COMMENT '统计日期',
    `gmv_count`  bigint COMMENT '当日gmv订单个数',
    `gmv_amount`  decimal(16,2) COMMENT '当日gmv订单总金额',
    `gmv_payment`  decimal(16,2) COMMENT '当日支付金额'
) COMMENT 'GMV'
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ads/ads_gmv_sum_day/';

insert data

INSERT INTO TABLE ads_gmv_sum_day
SELECT 
'2023-01-04' dt, 
sum(order_count) gmv_count, 
sum(order_amount) gmv_amount, 
sum(payment_amount) gmv_payment
FROM 
dws_user_action
WHERE dt='2023-01-04'
GROUP BY dt;

6.2 Requirement 2

User freshness and funnel analysis for conversion rate
insert image description here

6.2.1 Ratio of new users in ADS layer to daily active users (user freshness)

build table

drop table if exists ads_user_convert_day;
create external table ads_user_convert_day( 
    `dt` string COMMENT '统计日期',
    `uv_m_count`  bigint COMMENT '当日活跃设备',
    `new_m_count`  bigint COMMENT '当日新增设备',
    `new_m_ratio`   decimal(10,2) COMMENT '当日新增占日活的比率'
) COMMENT '转化率'
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ads/ads_user_convert_day/';

data import

insert into table ads_user_convert_day
select
    '2023-01-04' dt,
    sum(uc.dc) sum_dc,
    sum(uc.nmc) sum_nmc,
    cast(sum( uc.nmc)/sum( uc.dc)*100 as decimal(10,2)) new_m_ratio
from 
(
    select
        day_count dc,
        0 nmc
    from ads_uv_count
	where dt='2023-01-04'
    union all
    select
        0 dc,
        new_mid_count nmc
    from ads_new_mid_count
    where create_date='2023-01-04'
)uc;

insert image description here

6.2.2 User behavior funnel analysis of ADS layer

insert image description here
create table

drop table if exists ads_user_action_convert_day;
create external  table ads_user_action_convert_day(
    `dt` string COMMENT '统计日期',
    `total_visitor_m_count`  bigint COMMENT '总访问人数',
    `order_u_count` bigint     COMMENT '下单人数',
    `visitor2order_convert_ratio`  decimal(10,2) COMMENT '访问到下单转化率',
    `payment_u_count` bigint     COMMENT '支付人数',
    `order2payment_convert_ratio` decimal(10,2) COMMENT '下单到支付的转化率'
 ) COMMENT '用户行为漏斗分析'
row format delimited  fields terminated by '\t'
location '/wavehouse/gmall/ads/ads_user_action_convert_day/';

insert data

insert into table ads_user_action_convert_day
select 
    '2023-01-04',
    uv.day_count,
    ua.order_count,
    cast(ua.order_count/uv.day_count as  decimal(10,2)) visitor2order_convert_ratio,
    ua.payment_count,
    cast(ua.payment_count/ua.order_count as  decimal(10,2)) order2payment_convert_ratio
from  
(
	select 
    	dt,
        sum(if(order_count>0,1,0)) order_count,
        sum(if(payment_count>0,1,0)) payment_count
    from dws_user_action
	where dt='2023-01-04'
	group by dt
)ua join ads_uv_count  uv on uv.dt=ua.dt;

insert image description here

6.3 Requirement 3

Brand repurchase rate
Demand: Statistics in units of months, users who purchased more than 2 times

6.3.1 Create table in DWS layer

drop table if exists dws_sale_detail_daycount;
create external table dws_sale_detail_daycount
(   
    user_id   string  comment '用户 id',
    sku_id    string comment '商品 Id',
    user_gender  string comment '用户性别',
    user_age string  comment '用户年龄',
    user_level string comment '用户等级',
    order_price decimal(10,2) comment '商品价格',
    sku_name string   comment '商品名称',
    sku_tm_id string   comment '品牌id',
    sku_category3_id string comment '商品三级品类id',
    sku_category2_id string comment '商品二级品类id',
    sku_category1_id string comment '商品一级品类id',
    sku_category3_name string comment '商品三级品类名称',
    sku_category2_name string comment '商品二级品类名称',
    sku_category1_name string comment '商品一级品类名称',
    spu_id  string comment '商品 spu',
    sku_num  int comment '购买个数',
    order_count string comment '当日下单单数',
    order_amount string comment '当日下单金额'
) COMMENT '用户购买商品明细表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/wavehouse/gmall/dws/dws_user_sale_detail_daycount/'
tblproperties ("parquet.compression"="snappy");

data import

with
tmp_detail as
(
    select
        user_id,
        sku_id, 
        sum(sku_num) sku_num,   
        count(*) order_count, 
        sum(od.order_price*sku_num) order_amount
    from dwd_order_detail od
    where od.dt='2023-01-05'
    group by user_id, sku_id
)  
insert overwrite table dws_sale_detail_daycount partition(dt='2023-01-05')
select 
    tmp_detail.user_id,
    tmp_detail.sku_id,
    u.gender,
    months_between('2023-01-05', u.birthday)/12  age, 
    u.user_level,
    price,
    sku_name,
    tm_id,
    category3_id,
    category2_id,
    category1_id,
    category3_name,
    category2_name,
    category1_name,
    spu_id,
    tmp_detail.sku_num,
    tmp_detail.order_count,
    tmp_detail.order_amount 
from tmp_detail 
left join dwd_user_info u on tmp_detail.user_id =u.id and u.dt='2023-01-05'
left join dwd_sku_info s on tmp_detail.sku_id =s.id and s.dt='2023-01-05';

6.3.2 ods layer

build table

drop table ads_sale_tm_category1_stat_mn;
create external table ads_sale_tm_category1_stat_mn
(   
    tm_id string comment '品牌id',
    category1_id string comment '1级品类id ',
    category1_name string comment '1级品类名称 ',
    buycount   bigint comment  '购买人数',
    buy_twice_last bigint  comment '两次以上购买人数',
    buy_twice_last_ratio decimal(10,2)  comment  '单次复购率',
    buy_3times_last   bigint comment   '三次以上购买人数',
    buy_3times_last_ratio decimal(10,2)  comment  '多次复购率',
    stat_mn string comment '统计月份',
    stat_date string comment '统计日期' 
)   COMMENT '复购率统计'
row format delimited fields terminated by '\t'
location '/wavehouse/gmall/ads/ads_sale_tm_category1_stat_mn/';

insert data

insert into table ads_sale_tm_category1_stat_mn
select   
    mn.sku_tm_id,
    mn.sku_category1_id,
    mn.sku_category1_name,
    sum(if(mn.order_count>=1,1,0)) buycount,
    sum(if(mn.order_count>=2,1,0)) buyTwiceLast,
    sum(if(mn.order_count>=2,1,0))/sum( if(mn.order_count>=1,1,0)) buyTwiceLastRatio,
    sum(if(mn.order_count>=3,1,0))  buy3timeLast  ,
    sum(if(mn.order_count>=3,1,0))/sum( if(mn.order_count>=1,1,0)) buy3timeLastRatio ,
    date_format('2023-01-04' ,'yyyy-MM') stat_mn,
    '2023-01-04' stat_date
from 
(
select 
        user_id, 
sd.sku_tm_id,
        sd.sku_category1_id,
        sd.sku_category1_name,
        sum(order_count) order_count
    from dws_sale_detail_daycount sd 
    where date_format(dt,'yyyy-MM')=date_format('2023-01-04' ,'yyyy-MM')
    group by user_id, sd.sku_tm_id, sd.sku_category1_id, sd.sku_category1_name
) mn
group by mn.sku_tm_id, mn.sku_category1_id, mn.sku_category1_name;

insert image description here

6.4 Requirement 4

Ranking of the top ten products with repurchase rate corresponding to each user level
Create a table

drop  table ads_ul_rep_ratio;
create  table ads_ul_rep_ratio(   
    user_level string comment '用户等级' ,
    sku_id string comment '商品id',
buy_count bigint  comment '购买总人数',
buy_twice_count bigint comment  '两次购买总数',
    buy_twice_rate decimal(10,2)  comment  '二次复购率', 
rank string comment  '排名' ,
    state_date string comment '统计日期'
)   COMMENT '复购率统计'
row format delimited  fields terminated by '\t' 
location '/wavehouse/gmall/ads/ads_ul_rep_ratio/';

insert data

with 
tmp_count as(
  select -- 每个等级内每个用户对每个产品的下单次数    
user_level,
user_id,
    sku_id,
    sum(order_count) order_count
  from dws_sale_detail_daycount
  where dt<='2023-01-04'
  group by user_level, user_id, sku_id
)
insert overwrite table ads_ul_rep_ratio
select
  *
from(
  select
    user_level,
    sku_id,
    sum(if(order_count >=1, 1, 0)) buy_count,
    sum(if(order_count >=2, 1, 0)) buy_twice_count,
    sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) * 100  buy_twice_rate,
    row_number() over(partition by user_level order by sum(if(order_count >=2, 1, 0)) / sum(if(order_count >=1, 1, 0)) desc) rn,
    '2023-01-04'
  from tmp_count
  group by user_level, sku_id
) t1
where rn<=10

Next is the data visualization and task scheduling of the local data warehouse project. For details, see "Local Data Warehouse Project (3) - Data Visualization and Task Scheduling"

Guess you like

Origin blog.csdn.net/Keyuchen_01/article/details/128554570