Table of contents
Filter with numbers/with Chinese characters
HIVE calculation duration (filtering is not closed time period)
Dynamic Partition & Static Partition
Table of contents
usage record
line train
The data and tables are as follows:
Sun Wukong Aries A
Sea Sagittarius A
Song Song Aries B
Pig Bajie Aries A
Phoenix Sister Sagittarius A Xiaobai
Sheep B
The requirements are as follows:
group people with the same constellation and blood type together
Sagittarius, A Haihai|Fengjie
Aries, A Sun Wukong|Pig Bajie
Aries, B Song Song|Xiao Ming
analyze:
- First use the concat_ws function to connect the constellation and blood type with ","
- According to the connected constellation and blood type group by
- Use the collect_set function to aggregate the name,
- Use the concat_ws function to split the aggregated name with "|"
The implementation is as follows:
SELECT
t1.c_b,
CONCAT_WS("|",collect_set(t1.name))
FROM (
SELECT
NAME,
CONCAT_WS(',',constellation,blood_type) c_b
FROM person_info
)t1
GROUP BY t1.c_b
column wrapping
The data are as follows:
"Suspect Tracker" suspense, action, science fiction, drama
"Lie to me" suspense, cops, action, psychological, drama
"Wolf Warrior 2" war, action, disaster
The requirements are as follows:
Expand the array data in the movie classification
"Person of Interest" suspense
"Person of Interest" action
"Person of Interest" science fiction
"Person of Interest" drama
"Lie to me" suspense
"Lie to me" cops
"Lie to me" action
"Lie to me" psychology
"Lie to me" Drama
"Wolf Warrior 2" War
"Wolf Warrior 2" Action
"Wolf Warrior 2" Disaster
analyze:
First use the split function to divide the category into an array according to "," and
combine the explode function with the lateral view to perform profile writing after the explosion. The
implementation is as follows:
SELECT
movie,
category_name
FROM
movie_info
lateral VIEW
explode(split(category,",")) movie_info_tmp AS category_name;
my realization
--get_json_all_keys为自定义UDF函数,获得json所有key,按','分割
SELECT u_id,u_tag FROM ods_user_action_score
lateral VIEW
explode(split(get_json_all_keys(action_data),',')) u_tag_tmp AS u_tag
where action_data is NOT NULL and action_data <>'[]' and action_data <> '';
Filter with numbers/with Chinese characters
--含汉字 rlike '[\\u4e00-\\u9fa5]'
--含数字 rlike '[0-9]'
select u_tag, u_id from dws_user_tags
where u1.u_tag rlike '[\\u4e00-\\u9fa5]' or u1.u_tag rlike '[0-9]'
HIVE calculation duration (filtering is not closed time period)
lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )
lag( xxx, xxx, xxx) over ( PARTITION BY xxx ORDER BY xxx)
Hive's analysis function is also called window function. There is such an analysis function in Oracle, which is mainly used for statistical analysis of data.
The Lag and Lead analysis functions can extract the data of the first N rows (Lag) and the data of the last N rows (Lead) of the same field as independent columns in the same query. This kind of operation can replace the self-join of the table, and LAG and LEAD have higher efficiency, where over() indicates the result set object of the current query, and the statement in the brackets indicates that the result set is processed.
case description
User behavior analysis, analyzing data such as user online time or browsing time. Filter out unclosed time periods
data structure
id user type time
1 101 quit_time 2021-11-01 10:34:08
2 101 landing_time 2021-11-01 10:34:14
3 101 quit_time 2021-11-01 10:34:21
4 101 landing_time 2021-11-01 10:34:25
5 101 quit_time 2021-11-01 10:39:14
6 101 landing_time 2021-11-01 10:47:34
7 101 quit_time 2021-11-01 10:48:05
8 101 landing_time 2021-11-01 14:19:09
9 101 quit_time 2021-11-01 14:20:28
10 101 landing_time 2021-11-01 14:36:30
11 101 quit_time 2021-11-01 14:41:38
12 101 landing_time 2021-11-01 14:55:41
13 123062 landing_time 2021-11-01 10:07:07
14 123062 landing_time 2021-11-01 10:20:20
15 123062 landing_time 2021-11-01 15:06:10
16 123062 landing_time 2021-11-01 15:08:48
17 123062 landing_time 2021-11-01 15:21:57
18 123062 landing_time 2021-11-01 15:41:01
19 123062 landing_time 2021-11-01 16:37:50
20 123062 landing_time 2021-11-01 16:50:47
21 123062 landing_time 2021-11-01 17:11:37
22 123491 landing_time 2021-11-01 21:52:57
23 123491 quit_time 2021-11-01 21:52:59
24 123511 landing_time 2021-11-01 17:03:25
25 123511 quit_time 2021-11-01 17:04:38
26 123511 landing_time 2021-11-01 17:04:40
27 123511 quit_time 2021-11-01 17:05:27
28 123511 landing_time 2021-11-01 17:20:51
The requirement is to calculate the online duration of each user and filter out unclosed time periods.
Use the lead over (partition by order by) window function, lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) means the time of the next record as the end time;
and ( lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time' in case when means that when the next piece of data is logout type, filter out the unclosed time period.
SELECT
date_id,
user_id,
CASE WHEN behavior_type = 'quit_time' THEN logout_ts ELSE login_ts END AS login_ts,
CASE WHEN behavior_type = 'quit_time' THEN login_ts ELSE logout_ts END AS logout_ts
FROM(
SELECT date_id,user_id,date_time AS login_ts,logout_ts,rn,behavior_type
FROM(
SELECT
date_id,behavior_type,user_id,date_time,
CASE
WHEN behavior_type = 'landing_time'
and (lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time'
THEN
lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )
ELSE null
END AS logout_ts,
ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY date_time ASC ) rn
FROM dws_base_event_user
WHERE dt = '${do_date}' and user_id <> '0' and (behavior_type = 'landing_time' OR behavior_type = 'quit_time' )
) tt
WHERE behavior_type = 'landing_time' OR ( behavior_type = 'quit_time' AND rn = 1 ) ) ttt;
The result is as follows
id date user login_ts logout_ts
1 20211101 101 NULL 2021-11-01 10:34:08
2 20211101 101 2021-11-01 10:34:14 2021-11-01 10:34:21
3 20211101 101 2021-11-01 10:34:25 2021-11-01 10:39:14
4 20211101 101 2021-11-01 10:47:34 2021-11-01 10:48:05
5 20211101 101 2021-11-01 14:19:09 2021-11-01 14:20:28
6 20211101 101 2021-11-01 14:36:30 2021-11-01 14:41:38
7 20211101 101 2021-11-01 14:55:41 NULL
8 20211101 123062 2021-11-01 10:07:07 NULL
9 20211101 123062 2021-11-01 10:20:20 NULL
10 20211101 123062 2021-11-01 15:06:10 NULL
11 20211101 123062 2021-11-01 15:08:48 NULL
12 20211101 123062 2021-11-01 15:21:57 NULL
13 20211101 123062 2021-11-01 15:41:01 NULL
14 20211101 123062 2021-11-01 16:37:50 NULL
15 20211101 123062 2021-11-01 16:50:47 NULL
16 20211101 123062 2021-11-01 17:11:37 NULL
17 20211101 123491 2021-11-01 21:52:57 2021-11-01 21:52:59
18 20211101 123511 2021-11-01 17:03:25 2021-11-01 17:04:38
19 20211101 123511 2021-11-01 17:04:40 2021-11-01 17:05:27
20 20211101 123511 2021-11-01 17:20:51 NULL
After getting this record, just filter out the unclosed time period by NULL and calculate the duration
Duration calculation
unix_timestamp()是hive系统时间,格式是timestamp,精确到秒。
unix_timestamp(ymdhms)是把时间转换成timestamp格式,是2018-05-23 07:15:50格式。
unix_timestamp() - unix_timestamp(ymdhms)是两个时间转换为timestamp之后相减,timestamp单位是秒,相减之后是两个时间之间相差的秒数。
CAST((unix_timestamp() - unix_timestamp(ymdhms)) % 60 AS int)是相差的秒数。
CAST((unix_timestamp() - unix_timestamp(ymdhms)) / 60 AS int) % 60是相差的分钟数。
CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60) AS int) % 24是相差的小时数。
concat(CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60 * 24) AS int)是相差的天
date format conversion
方法1: from_unixtime+ unix_timestamp
--20171205转成2017-12-05
select from_unixtime(unix_timestamp('20171205','yyyyMMdd'),'yyyy-MM-dd') ;
--2017-12-05转成20171205
select from_unixtime(unix_timestamp('2017-12-05','yyyy-MM-dd'),'yyyyMMdd') ;
--2017-12-05转成20171205
select date_formate('2017-12-05','yyyyMMdd') ;
--日期间隔天数 日期格式必须是yyyy-MM-dd
SELECT datediff('2021-07-25', '2021-07-01');
Dynamic Partition & Static Partition
INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure='UB005_')
select
u.date_id,
'UB004' measure_id,
u.count_num measure_value,
cast(u.parent_app_id as STRING) parent_app_id,
null province,
null city,
null measure_hour,
null extend,
'注册用户日活跃量' remark,
date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt
from
(SELECT
count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num,
u2.parent_id parent_app_id,u1.date_id
from
dws_base_event_user u1
LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id
GROUP BY u2.parent_id,u1.date_id) u
hive error
FAILED: SemanticException [Error 10094]: Line 4:56 Dynamic partition cannot be the parent of a static partition ''UB005_''
Note that dynamic partitions do not allow the primary partition to use dynamic columns and the secondary partitions to use static columns, which will cause all primary partitions to create partitions defined by the static columns of the secondary partitions
change into
INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure)
select
u.date_id,
'UB004' measure_id,
u.count_num measure_value,
cast(u.parent_app_id as STRING) parent_app_id,
null province,
null city,
null measure_hour,
null extend,
'注册用户日活跃量' remark,
date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt,
'UB005_' measure
from
(SELECT
count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num,
u2.parent_id parent_app_id,u1.date_id
from
dws_base_event_user u1
LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id
GROUP BY u2.parent_id,u1.date_id) u
for loop in shell
#$1和$2为hue 的workflow传参
first=$1
second=$2
while [ "$first" != "$second" ]
do
echo $first
sql_topic="
use dd_database_bigdata;
--DWT层 用户活跃主题表
insert overwrite table dwt_user_active_topic
select
nvl (new.user_id, old.user_id) user_id,
nvl (new.parent_app_id, old.parent_app_id) parent_app_id,
IF (old.user_id IS NULL,new.mobile_type,old.mobile_type_first) mobile_type_first,
IF (old.user_id IS NULL,new.province,old.province_first) province_first,
IF (old.user_id IS NULL,new.city,old.city_first) city_first,
IF (old.login_date_first IS NULL,'$first',old.login_date_first) login_date_first,
IF (new.user_id IS NOT NULL,from_unixtime(unix_timestamp(new.date_id,'yyyymmdd'),'yyyy-mm-dd'),old.login_date_last) login_date_last,
IF (new.user_id IS NOT NULL,new.active_count,0) login_day_count,
nvl (old.login_count, 0) +nvl (new.active_count, 0) login_count,
nvl (old.log_count, 0) +IF (new.active_count > 0, 1, 0) log_count,
nvl (new.is_visitor, old.is_visitor) is_visitor
FROM
(SELECT * FROM dwt_user_active_topic) old
FULL OUTER JOIN
(SELECT u1.*,u2.parent_id parent_app_id FROM dws_user_active_daily u1
LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id
WHERE dt = '$first') new ON old.user_id = new.user_id and new.parent_app_id = old.parent_app_id;"
echo $sql_topic
first=`date -d "-1 days ago ${first}" +%Y-%m-%d`
$hive -e "$sql_topic"
done
Supplement is being executed. . . . .