HIVE SQL daily use records

Table of contents

usage record

line train

column wrapping

Filter with numbers/with Chinese characters

HIVE calculation duration (filtering is not closed time period)

case description

Duration calculation

date format conversion

Dynamic Partition & Static Partition

for loop in shell


Table of contents

usage record

line train

The data and tables are as follows:
Sun Wukong Aries A
Sea Sagittarius A
Song Song Aries B
Pig Bajie Aries A
Phoenix Sister Sagittarius A Xiaobai
Sheep B

The requirements are as follows:
group people with the same constellation and blood type together

Sagittarius, A Haihai|Fengjie
Aries, A Sun Wukong|Pig Bajie
Aries, B Song Song|Xiao Ming

analyze:

  • First use the concat_ws function to connect the constellation and blood type with ","
  • According to the connected constellation and blood type group by
  • Use the collect_set function to aggregate the name,
  • Use the concat_ws function to split the aggregated name with "|"

The implementation is as follows:

SELECT
t1.c_b,
CONCAT_WS("|",collect_set(t1.name))
FROM (
SELECT
NAME,
CONCAT_WS(',',constellation,blood_type) c_b
FROM person_info
)t1
GROUP BY t1.c_b

column wrapping

The data are as follows:
"Suspect Tracker" suspense, action, science fiction, drama
"Lie to me" suspense, cops, action, psychological, drama
"Wolf Warrior 2" war, action, disaster

The requirements are as follows:
Expand the array data in the movie classification

"Person of Interest" suspense
"Person of Interest" action
"Person of Interest" science fiction
"Person of Interest" drama
"Lie to me" suspense
"Lie to me" cops
"Lie to me" action
"Lie to me" psychology
"Lie to me" Drama
"Wolf Warrior 2" War
"Wolf Warrior 2" Action
"Wolf Warrior 2" Disaster

analyze:

First use the split function to divide the category into an array according to "," and
combine the explode function with the lateral view to perform profile writing after the explosion. The
implementation is as follows:

SELECT
movie,
category_name
FROM
movie_info
lateral VIEW
explode(split(category,",")) movie_info_tmp AS category_name;

my realization

--get_json_all_keys为自定义UDF函数,获得json所有key,按','分割
SELECT u_id,u_tag FROM ods_user_action_score 
lateral VIEW
explode(split(get_json_all_keys(action_data),',')) u_tag_tmp AS u_tag
where action_data is NOT NULL and action_data <>'[]' and action_data <> '';

Filter with numbers/with Chinese characters

--含汉字 rlike '[\\u4e00-\\u9fa5]'
--含数字 rlike '[0-9]'
select u_tag, u_id from dws_user_tags
where u1.u_tag rlike '[\\u4e00-\\u9fa5]' or u1.u_tag rlike '[0-9]' 

HIVE calculation duration (filtering is not closed time period)

lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) 

lag( xxx, xxx, xxx) over ( PARTITION BY xxx ORDER BY xxx) 

Hive's analysis function is also called window function. There is such an analysis function in Oracle, which is mainly used for statistical analysis of data.
The Lag and Lead analysis functions can extract the data of the first N rows (Lag) and the data of the last N rows (Lead) of the same field as independent columns in the same query. This kind of operation can replace the self-join of the table, and LAG and LEAD have higher efficiency, where over() indicates the result set object of the current query, and the statement in the brackets indicates that the result set is processed.

case description

User behavior analysis, analyzing data such as user online time or browsing time. Filter out unclosed time periods

data structure

id  user type       time
1	101	quit_time	2021-11-01 10:34:08
2	101	landing_time	2021-11-01 10:34:14
3	101	quit_time	2021-11-01 10:34:21
4	101	landing_time	2021-11-01 10:34:25
5	101	quit_time	2021-11-01 10:39:14
6	101	landing_time	2021-11-01 10:47:34
7	101	quit_time	2021-11-01 10:48:05
8	101	landing_time	2021-11-01 14:19:09
9	101	quit_time	2021-11-01 14:20:28
10	101	landing_time	2021-11-01 14:36:30
11	101	quit_time	2021-11-01 14:41:38
12	101	landing_time	2021-11-01 14:55:41
13	123062	landing_time	2021-11-01 10:07:07
14	123062	landing_time	2021-11-01 10:20:20
15	123062	landing_time	2021-11-01 15:06:10
16	123062	landing_time	2021-11-01 15:08:48
17	123062	landing_time	2021-11-01 15:21:57
18	123062	landing_time	2021-11-01 15:41:01
19	123062	landing_time	2021-11-01 16:37:50
20	123062	landing_time	2021-11-01 16:50:47
21	123062	landing_time	2021-11-01 17:11:37
22	123491	landing_time	2021-11-01 21:52:57
23	123491	quit_time	2021-11-01 21:52:59
24	123511	landing_time	2021-11-01 17:03:25
25	123511	quit_time	2021-11-01 17:04:38
26	123511	landing_time	2021-11-01 17:04:40
27	123511	quit_time	2021-11-01 17:05:27
28	123511	landing_time	2021-11-01 17:20:51

The requirement is to calculate the online duration of each user and filter out unclosed time periods.

Use the lead over (partition by order by) window function, lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) means the time of the next record as the end time;

and ( lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time' in case when means that when the next piece of data is logout type, filter out the unclosed time period.

SELECT
date_id,
user_id,
CASE WHEN behavior_type = 'quit_time' THEN logout_ts ELSE login_ts END AS login_ts,
CASE WHEN behavior_type = 'quit_time' THEN login_ts ELSE logout_ts END AS logout_ts 
FROM(
	SELECT date_id,user_id,date_time AS login_ts,logout_ts,rn,behavior_type 
	FROM(
SELECT
		date_id,behavior_type,user_id,date_time,
		CASE 
		  WHEN behavior_type = 'landing_time' 
		    and (lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time'
		  THEN
		   lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) 
		  ELSE null
		END AS logout_ts,
		ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY date_time ASC ) rn 
		FROM dws_base_event_user 
		WHERE dt = '${do_date}' and user_id <> '0' and (behavior_type = 'landing_time' OR behavior_type = 'quit_time' )
	) tt 
WHERE behavior_type = 'landing_time' OR ( behavior_type = 'quit_time' AND rn = 1 ) ) ttt;

The result is as follows

id  date       user  login_ts  logout_ts 
1	20211101	101	NULL	2021-11-01 10:34:08
2	20211101	101	2021-11-01 10:34:14	2021-11-01 10:34:21
3	20211101	101	2021-11-01 10:34:25	2021-11-01 10:39:14
4	20211101	101	2021-11-01 10:47:34	2021-11-01 10:48:05
5	20211101	101	2021-11-01 14:19:09	2021-11-01 14:20:28
6	20211101	101	2021-11-01 14:36:30	2021-11-01 14:41:38
7	20211101	101	2021-11-01 14:55:41	NULL
8	20211101	123062	2021-11-01 10:07:07	NULL
9	20211101	123062	2021-11-01 10:20:20	NULL
10	20211101	123062	2021-11-01 15:06:10	NULL
11	20211101	123062	2021-11-01 15:08:48	NULL
12	20211101	123062	2021-11-01 15:21:57	NULL
13	20211101	123062	2021-11-01 15:41:01	NULL
14	20211101	123062	2021-11-01 16:37:50	NULL
15	20211101	123062	2021-11-01 16:50:47	NULL
16	20211101	123062	2021-11-01 17:11:37	NULL
17	20211101	123491	2021-11-01 21:52:57	2021-11-01 21:52:59
18	20211101	123511	2021-11-01 17:03:25	2021-11-01 17:04:38
19	20211101	123511	2021-11-01 17:04:40	2021-11-01 17:05:27
20	20211101	123511	2021-11-01 17:20:51	NULL

After getting this record, just filter out the unclosed time period by NULL and calculate the duration

Duration calculation

unix_timestamp()是hive系统时间,格式是timestamp,精确到秒。
   unix_timestamp(ymdhms)是把时间转换成timestamp格式,是2018-05-23 07:15:50格式。
   unix_timestamp() - unix_timestamp(ymdhms)是两个时间转换为timestamp之后相减,timestamp单位是秒,相减之后是两个时间之间相差的秒数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) % 60 AS int)是相差的秒数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) / 60 AS int) % 60是相差的分钟数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60) AS int) % 24是相差的小时数。
   concat(CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60 * 24) AS int)是相差的天

date format conversion

方法1: from_unixtime+ unix_timestamp
 
--20171205转成2017-12-05 
select from_unixtime(unix_timestamp('20171205','yyyyMMdd'),'yyyy-MM-dd') ;
 
--2017-12-05转成20171205
select from_unixtime(unix_timestamp('2017-12-05','yyyy-MM-dd'),'yyyyMMdd') ;

--2017-12-05转成20171205
select date_formate('2017-12-05','yyyyMMdd') ;

--日期间隔天数 日期格式必须是yyyy-MM-dd
SELECT datediff('2021-07-25', '2021-07-01');

Dynamic Partition & Static Partition


INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure='UB005_') 
select
  u.date_id, 
  'UB004' measure_id, 
  u.count_num measure_value, 
  cast(u.parent_app_id as STRING) parent_app_id, 
  null province,
  null city,
  null measure_hour,
  null extend,
  '注册用户日活跃量' remark, 
  date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
  from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt
from 
  (SELECT 
      count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num, 
      u2.parent_id parent_app_id,u1.date_id
    from 
      dws_base_event_user u1 
      LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    GROUP BY u2.parent_id,u1.date_id) u

hive error

FAILED: SemanticException [Error 10094]: Line 4:56 Dynamic partition cannot be the parent of a static partition ''UB005_''

Note that dynamic partitions do not allow the primary partition to use dynamic columns and the secondary partitions to use static columns, which will cause all primary partitions to create partitions defined by the static columns of the secondary partitions

change into

INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure) 
select
  u.date_id, 
  'UB004' measure_id, 
  u.count_num measure_value, 
  cast(u.parent_app_id as STRING) parent_app_id, 
  null province,
  null city,
  null measure_hour,
  null extend,
  '注册用户日活跃量' remark, 
  date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
  from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt,
  'UB005_' measure
from 
  (SELECT 
      count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num, 
      u2.parent_id parent_app_id,u1.date_id
    from 
      dws_base_event_user u1 
      LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    GROUP BY u2.parent_id,u1.date_id) u

for loop in shell

#$1和$2为hue 的workflow传参
first=$1
second=$2

while [ "$first" != "$second" ]
do
echo $first

sql_topic="
use dd_database_bigdata;
--DWT层 用户活跃主题表
insert overwrite table dwt_user_active_topic
select 
	nvl (new.user_id, old.user_id) user_id,
	nvl (new.parent_app_id, old.parent_app_id) parent_app_id,
	IF (old.user_id IS NULL,new.mobile_type,old.mobile_type_first) mobile_type_first,
	IF (old.user_id IS NULL,new.province,old.province_first) province_first,
	IF (old.user_id IS NULL,new.city,old.city_first) city_first,
	IF (old.login_date_first IS NULL,'$first',old.login_date_first) login_date_first,
	IF (new.user_id IS NOT NULL,from_unixtime(unix_timestamp(new.date_id,'yyyymmdd'),'yyyy-mm-dd'),old.login_date_last) login_date_last,
	IF (new.user_id IS NOT NULL,new.active_count,0) login_day_count,
	nvl (old.login_count, 0) +nvl (new.active_count, 0) login_count,
	nvl (old.log_count, 0) +IF (new.active_count > 0, 1, 0) log_count,
	nvl (new.is_visitor, old.is_visitor) is_visitor
FROM
	(SELECT * FROM dwt_user_active_topic) old
	FULL OUTER JOIN 
	(SELECT u1.*,u2.parent_id parent_app_id FROM dws_user_active_daily u1
	LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    WHERE dt = '$first') new ON old.user_id = new.user_id and new.parent_app_id = old.parent_app_id;"
echo $sql_topic

first=`date -d "-1 days ago ${first}" +%Y-%m-%d`

$hive -e "$sql_topic"
done

Supplement is being executed. . . . .

Guess you like

Origin blog.csdn.net/xieedeni/article/details/121330808