Data warehouse ETL (shell+sql) summary statistics code development template by day, week, month and quarter

table of Contents

0 Preface

1 Summary of date usage in the shell

1.1 Basic grammar

1.2 date shows the current time

1.3 date displays non-current time

2 Statistics by day

3 Statistics by week

4 Monthly statistics

5 Statistics by quarter

6 Summary



0 Preface

When data warehouse ETL statistics, we often need to perform batch statistics on the data in several time dimensions such as day, week, month, and quarter. The general development mode is in the form of sql in the shell, so that we can run shell scripts according to timed tasks, and at the same time use the shell to write some functions to replace the stored procedures in SQL. The time dimensions of day, week, month, and quarter in this article are also calculated using the time function in the shell, which reduces the difficulty of SQL development and makes the code more maintainable.

shell中时间的获取常常采用date -d 或date --date来获取

1 Summary of date usage in the shell

1.1 Basic grammar

1) Basic grammar

date [OPTION]... [+FORMAT]

2) Option description

Table 1-20

Options

Features

-d<time string>

Display the time indicated by the specified "time string" instead of the current time

-s<date time>

Set the system date and time

3) Parameter description

Table 1-21

parameter

Features

<+date time format>

Specify the date and time format used when displaying

1.2 date shows the current time

1) Basic grammar

(1) date (function description: display the current time)

(2) date +%Y (Function description: display the current year)

(3) date +%m (function description: display the current month)

(4) date +%d (Function description: display the current day)

(5) date "+%Y-%m-%d %H:%M:%S" (Function description: display year, month, day, hour, minute, and second)

2) Case practice

(1) Display current time information

[root@bigdata-1 ~]# date

2020年 10月 26日 星期一 13:40:45 CST

(2) Display the current time year, month and day

[root@bigdata-1 ~]# date +%Y%m%d

20201026

(3) Display the current time year, month, day, hour, minute and second

[root@bigdata-1 ~]# date "+%Y-%m-%d %H:%M:%S"

2020-10-26 13:42:15

1.3 date displays non-current time

1) Basic grammar

(1) date -d '1 days ago' (Function description: display the time of the previous day)

(2) date -d'-1 days ago' (function description: display tomorrow time)

2) Case practice

(1)显示前一天

[root@bigdata-1 ~]# date -d '1 days ago'

2020年 10月 25日 星期日 13:42:45 CST
[root@bigdata-1 ~]# date -d '-1 days'

2020年 10月 25日 星期日 13:43:25 CST

[root@bigdata-1 ~]# date -d 'last day'

2020年 10月 25日 星期日 13:43:51 CST


(2)显示明天时间

[root@bigdata-1 ~]# date -d '1 days'

2020年 10月 27日 星期二 13:44:53 CST

[root@bigdata-1 ~]# date -d '-1 days ago'

2020年 10月 27日 星期二 13:44:28 CST

[root@bigdata-1 ~]# date -d 'next day'

2020年 10月 27日 星期二 13:47:37 CST

(3)指定时间显示

[root@bigdata-1 ~]# date -d '2020-10-26 3 months ago'

2020年 07月 26日 星期日 00:00:00 CST

1.4 date Set system time

1) Basic grammar

       date -s string time

2) Case practice

       (1) Set the current time of the system

[root@bigdata-1 ~]# date -s "2020-10-26 13:52:18" 

2 Statistics by day

#!/bin/bash

#1获取时间
lastday=`date --date '-1days' +%F` #获得昨天的日期(今天算昨天的)
if [ "$1" != "" ];then
    lastday=$1
fi;
#2定义变量
hive='/usr/idp/current/hive-client/bin/hive';
APP=phmdwdb
input_table="${APP}.输入表名";
output_table="${APP}.输出表名";

#3写SQL
sql="
insert overwrite table ${output_table}
PARTITION (compute_day='${lastday}')
select
     ,xxx
     ,xxx
     ,xxx
from ${input_table}
where compute_day='${lastday}' 
...........
...........  
;
";

#执行SQL

${hive} -e "${sql}" >>/tmp/${output_table}.log  2>&1 ;

3 Statistics by week

#!/bin/bash

#按周统计的表依赖于按天统计的表,因此输入的表为按天统计的表
#1获取时间,每周周一开始算上周的任务
today=`date +%F` #获得当前的日期
start_week=`date -d "${today} -7 days" +%F` #获取上周的周一日期
end_week=`date -d "${today} -1 days" +%F` #获取上周的周末日期
day=`date -d "${today}" +%w` #获取if条件中要匹配的日期
compute_week=`date -d "${start_week}" +%V` #%V:以周一为每周的第一天。%U:以以周日为每星期第一天
if [ "$1" != "" ];then
    today=$1
fi;

#2定义变量
hive='/usr/idp/current/hive-client/bin/hive';
APP=phmdwdb
input_table="${APP}.输入表名";
output_table="${APP}.输出表名";

#3写SQL
sql="
insert overwrite table ${output_table}
PARTITION (compute_week='${compute_week}')
select
     ,xxx
     ,xxx
     ,xxx
from ${input_table}
where compute_day>='${start_week}' and compute_day<'${today}' --注意按周分析的依赖于按天统计的表,所以用compute_day
...........
...........  
;
";

#执行SQL

if [ ${day} == '1' ];then

   ${hive} -e "${sql}"   >>/tmp/$log_dir.log  2>&1 ;
else
   echo '只有在周一计算上一周的统计值';
fi

4 Monthly statistics

#!/bin/bash

#按月统计的表依赖于按天统计的表,因此输入的表为按天统计的表
#1获取时间,每月初开始算上月的任务
today=`date +%Y-%m-%d` #获得当前的日期
start_date=`date -d "${today} -1 days " +%Y-%m-01` #获得上个月月初的时间
end_date=`date -d"${today} last day" +%Y-%m-%d` #上个月最后一天
day=`date -d "${today}"+%d`; #获取当前需要匹配的时间.(只有月初的时候才会计算,月初时获取天值为01)
compute_month=`date -d ${start_date} +%Y-%m`; #获取静态分区的指定健值。计算的是上个月。

if [ "$1" != "" ];then
    today=$1
fi;

#2定义变量
hive='/usr/idp/current/hive-client/bin/hive';
APP=phmdwdb
input_table="${APP}.输入表名";
output_table="${APP}.输出表名";

#3写SQL
sql="
insert overwrite table ${output_table}
PARTITION (compute_month='${compute_month}')
select
     ,xxx
     ,xxx
     ,xxx
from ${input_table}
where compute_day>='${start_date}' and compute_day<'${today}' --注意按月分析的依赖于按天统计的表,所以用compute_day
...........
...........  
;
";

#执行SQL.月的时候匹配的是01

if [ ${day} == '01' ];then

   ${hive} -e "${sql}"   >>/tmp/$log_dir.log  2>&1 ;
else
   echo '只有在月初计算上一月的统计值';
fi

5 Statistics by quarter

 Note: This script uses the division operation of the shell when counting. The division of the shell is done with the bc calculator, where scale is the number of decimal places, and there are spaces on both sides. When the shell calculates addition, subtraction, multiplication, and division, if it is an integer operation, $((a+b)) or $[a+b] is generally used. If it is a decimal operation, the bc calculator is generally used for calculation, and the scale controller has the number of decimal places.

#!/bin/bash

#按季度统计的表依赖于按月统计的表,因此输入的表为按月统计的表
#1获取时间,每季度初开始算上季度的任务
today=`date +%F` #获取当前的日期	 
current_month=`date -d "${today}" +%Y-%m` #获取当前的月份
start_month=`date -d "${today} 3 month ago" +%Y-%m` #获取季度开始的月份
end_month=`date -d "${today} -1 month" +%Y-%m` #获取季度结束的月份
match=`date -d "${today}" +%m` #获取条件需要匹配的值
temp=`date -d ${start_quarter}_01 +%m` #获取上一季度开始的月份数字,作为中间结果值
compute_quarter=`echo "scale=0; (${temp}-1) / 3 + 1" | bc` # 计算静态分区的健值,季度值。按月份求季度的算法。

if [ "$1" != "" ];then
    today=$1
fi;

#2定义变量
hive='/usr/idp/current/hive-client/bin/hive';
APP=phmdwdb
input_table="${APP}.输入表名";
output_table="${APP}.输出表名";

#3写SQL
sql="
insert overwrite table ${output_table}
PARTITION (compute_quarter='${compute_quarter}')
select
     ,xxx
     ,xxx
     ,xxx
from ${input_table}
where compute_month>='${start_month}' and compute_month<'${current_month}' --注意按季度分析的依赖于按月统计的表,所以用compute_month
...........
...........  
;
";

#执行SQL.季度的时候匹配的是01,04,07,10.每过一季度统计一次

if [ ${match} == '01' ]; then 
${hive} -e "set mapred.job.name=统计第四季度;${sql}"  >>/tmp/$log_dir.log  2>&1;
elif [ ${match} == '04' ]; then
${hive} -e "set mapred.job.name=统计第一季度;${sql}"  >>/tmp/$log_dir.log  2>&1;
elif [ ${match} == '07' ]; then
${hive} -e "set mapred.job.name=统计第二季度;${sql}"  >>/tmp/$log_dir.log  2>&1;
elif [ ${match} == '10' ]; then
${hive} -e "set mapred.job.name=统计第三季度;${sql}"  >>/tmp/$log_dir.log  2>&1;
else
echo '只有等到季度初的时候才进行统计';
fi 

6 Summary

Several code development templates in this article are also frequently used in actual ETL statistics. The article summarizes several common time dimension statistics codes in data warehouses, and abstracts them into templates for readers’ reference and use.

 

Guess you like

Origin blog.csdn.net/godlovedaniel/article/details/109264067