HIVE SQL日常使用记录

SELECT
t1.c_b,
CONCAT_WS("|",collect_set(t1.name))
FROM (
SELECT
NAME,
CONCAT_WS(',',constellation,blood_type) c_b
FROM person_info
)t1
GROUP BY t1.c_b

列转行

数据如下：
《疑犯追踪》悬疑,动作,科幻,剧情
《Lie to me》悬疑,警匪,动作,心理,剧情
《战狼 2》战争,动作,灾难

需求如下：
将电影分类中的数组数据展开

《疑犯追踪》悬疑
《疑犯追踪》动作
《疑犯追踪》科幻
《疑犯追踪》剧情
《Lie to me》悬疑
《Lie to me》警匪
《Lie to me》动作
《Lie to me》心理
《Lie to me》剧情
《战狼 2》战争
《战狼 2》动作
《战狼 2》灾难

分析：

先用split函数将category根据“，”分割成数组
lateral view结合explode函数进行炸裂后的侧写
实现如下：

SELECT
movie,
category_name
FROM
movie_info
lateral VIEW
explode(split(category,",")) movie_info_tmp AS category_name;

我的实现

--get_json_all_keys为自定义UDF函数，获得json所有key，按','分割
SELECT u_id,u_tag FROM ods_user_action_score 
lateral VIEW
explode(split(get_json_all_keys(action_data),',')) u_tag_tmp AS u_tag
where action_data is NOT NULL and action_data <>'[]' and action_data <> '';

过滤含数字/含汉字

--含汉字 rlike '[\\u4e00-\\u9fa5]'
--含数字 rlike '[0-9]'
select u_tag, u_id from dws_user_tags
where u1.u_tag rlike '[\\u4e00-\\u9fa5]' or u1.u_tag rlike '[0-9]'

HIVE计算时长（过滤不闭合时间段）

lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )

lag( xxx, xxx, xxx) over ( PARTITION BY xxx ORDER BY xxx)

Hive的分析函数又叫窗口函数，在oracle中就有这样的分析函数，主要用来做数据统计分析的。
Lag和Lead分析函数可以在同一次查询中取出同一字段的前N行的数据(Lag)和后N行的数据(Lead)作为独立的列。这种操作可以代替表的自联接，并且LAG和LEAD有更高的效率，其中over()表示当前查询的结果集对象，括号里面的语句则表示对这个结果集进行处理。

案例描述

用户行为分析，分析用户在线时长或者浏览时长等数据。过滤掉不闭合的时间段

数据结构

id  user type       time
1	101	quit_time	2021-11-01 10:34:08
2	101	landing_time	2021-11-01 10:34:14
3	101	quit_time	2021-11-01 10:34:21
4	101	landing_time	2021-11-01 10:34:25
5	101	quit_time	2021-11-01 10:39:14
6	101	landing_time	2021-11-01 10:47:34
7	101	quit_time	2021-11-01 10:48:05
8	101	landing_time	2021-11-01 14:19:09
9	101	quit_time	2021-11-01 14:20:28
10	101	landing_time	2021-11-01 14:36:30
11	101	quit_time	2021-11-01 14:41:38
12	101	landing_time	2021-11-01 14:55:41
13	123062	landing_time	2021-11-01 10:07:07
14	123062	landing_time	2021-11-01 10:20:20
15	123062	landing_time	2021-11-01 15:06:10
16	123062	landing_time	2021-11-01 15:08:48
17	123062	landing_time	2021-11-01 15:21:57
18	123062	landing_time	2021-11-01 15:41:01
19	123062	landing_time	2021-11-01 16:37:50
20	123062	landing_time	2021-11-01 16:50:47
21	123062	landing_time	2021-11-01 17:11:37
22	123491	landing_time	2021-11-01 21:52:57
23	123491	quit_time	2021-11-01 21:52:59
24	123511	landing_time	2021-11-01 17:03:25
25	123511	quit_time	2021-11-01 17:04:38
26	123511	landing_time	2021-11-01 17:04:40
27	123511	quit_time	2021-11-01 17:05:27
28	123511	landing_time	2021-11-01 17:20:51

需求是算出每个用户在线时长，过滤掉不闭合的时间段。

使用lead over (partition by order by)窗口函数，lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) 意思是下一条记录的时间作为结束时间；

case when 中的and ( lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time'意思是下一条是登出类型的数据时，过滤掉不闭合的时间段。

SELECT
date_id,
user_id,
CASE WHEN behavior_type = 'quit_time' THEN logout_ts ELSE login_ts END AS login_ts,
CASE WHEN behavior_type = 'quit_time' THEN login_ts ELSE logout_ts END AS logout_ts 
FROM(
	SELECT date_id,user_id,date_time AS login_ts,logout_ts,rn,behavior_type 
	FROM(
SELECT
		date_id,behavior_type,user_id,date_time,
		CASE 
		  WHEN behavior_type = 'landing_time' 
		    and (lead( behavior_type, 1, null ) over ( PARTITION BY user_id ORDER BY date_time )) = 'quit_time'
		  THEN
		   lead( date_time, 1, null ) over ( PARTITION BY user_id ORDER BY date_time ) 
		  ELSE null
		END AS logout_ts,
		ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY date_time ASC ) rn 
		FROM dws_base_event_user 
		WHERE dt = '${do_date}' and user_id <> '0' and (behavior_type = 'landing_time' OR behavior_type = 'quit_time' )
	) tt 
WHERE behavior_type = 'landing_time' OR ( behavior_type = 'quit_time' AND rn = 1 ) ) ttt;

结果如下

id  date       user  login_ts  logout_ts 
1	20211101	101	NULL	2021-11-01 10:34:08
2	20211101	101	2021-11-01 10:34:14	2021-11-01 10:34:21
3	20211101	101	2021-11-01 10:34:25	2021-11-01 10:39:14
4	20211101	101	2021-11-01 10:47:34	2021-11-01 10:48:05
5	20211101	101	2021-11-01 14:19:09	2021-11-01 14:20:28
6	20211101	101	2021-11-01 14:36:30	2021-11-01 14:41:38
7	20211101	101	2021-11-01 14:55:41	NULL
8	20211101	123062	2021-11-01 10:07:07	NULL
9	20211101	123062	2021-11-01 10:20:20	NULL
10	20211101	123062	2021-11-01 15:06:10	NULL
11	20211101	123062	2021-11-01 15:08:48	NULL
12	20211101	123062	2021-11-01 15:21:57	NULL
13	20211101	123062	2021-11-01 15:41:01	NULL
14	20211101	123062	2021-11-01 16:37:50	NULL
15	20211101	123062	2021-11-01 16:50:47	NULL
16	20211101	123062	2021-11-01 17:11:37	NULL
17	20211101	123491	2021-11-01 21:52:57	2021-11-01 21:52:59
18	20211101	123511	2021-11-01 17:03:25	2021-11-01 17:04:38
19	20211101	123511	2021-11-01 17:04:40	2021-11-01 17:05:27
20	20211101	123511	2021-11-01 17:20:51	NULL

得到此记录后，只需按NULL过滤掉不闭合的时间段，计算时长即可

时长计算

unix_timestamp()是hive系统时间，格式是timestamp，精确到秒。
   unix_timestamp(ymdhms)是把时间转换成timestamp格式，是2018-05-23 07:15:50格式。
   unix_timestamp() - unix_timestamp(ymdhms)是两个时间转换为timestamp之后相减，timestamp单位是秒，相减之后是两个时间之间相差的秒数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) % 60 AS int)是相差的秒数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) / 60 AS int) % 60是相差的分钟数。
   CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60) AS int) % 24是相差的小时数。
   concat(CAST((unix_timestamp() - unix_timestamp(ymdhms)) / (60 * 60 * 24) AS int)是相差的天

日期格式转换

方法1: from_unixtime+ unix_timestamp
 
--20171205转成2017-12-05 
select from_unixtime(unix_timestamp('20171205','yyyyMMdd'),'yyyy-MM-dd') ;
 
--2017-12-05转成20171205
select from_unixtime(unix_timestamp('2017-12-05','yyyy-MM-dd'),'yyyyMMdd') ;

--2017-12-05转成20171205
select date_formate('2017-12-05','yyyyMMdd') ;

--日期间隔天数 日期格式必须是yyyy-MM-dd
SELECT datediff('2021-07-25', '2021-07-01');

动态分区&静态分区


INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure='UB005_') 
select
  u.date_id, 
  'UB004' measure_id, 
  u.count_num measure_value, 
  cast(u.parent_app_id as STRING) parent_app_id, 
  null province,
  null city,
  null measure_hour,
  null extend,
  '注册用户日活跃量' remark, 
  date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
  from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt
from 
  (SELECT 
      count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num, 
      u2.parent_id parent_app_id,u1.date_id
    from 
      dws_base_event_user u1 
      LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    GROUP BY u2.parent_id,u1.date_id) u

hive报错

FAILED: SemanticException [Error 10094]: Line 4:56 Dynamic partition cannot be the parent of a static partition ''UB005_''

注意，动态分区不允许主分区采用动态列而副分区采用静态列，这样将导致所有的主分区都要创建副分区静态列所定义的分区

修改为

INSERT OVERWRITE TABLE dwt_user_measure_topic PARTITION(dt,measure) 
select
  u.date_id, 
  'UB004' measure_id, 
  u.count_num measure_value, 
  cast(u.parent_app_id as STRING) parent_app_id, 
  null province,
  null city,
  null measure_hour,
  null extend,
  '注册用户日活跃量' remark, 
  date_format(current_timestamp, 'yyyyMMddHHmmss') etl_stamp ,
  from_unixtime(unix_timestamp(u.date_id,'yyyymmdd'),'yyyy-mm-dd') dt,
  'UB005_' measure
from 
  (SELECT 
      count(DISTINCT(case when u1.is_visitor='0' then u1.user_id else null end)) count_num, 
      u2.parent_id parent_app_id,u1.date_id
    from 
      dws_base_event_user u1 
      LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    GROUP BY u2.parent_id,u1.date_id) u

shell中的for循环

#$1和$2为hue 的workflow传参
first=$1
second=$2

while [ "$first" != "$second" ]
do
echo $first

sql_topic="
use dd_database_bigdata;
--DWT层 用户活跃主题表
insert overwrite table dwt_user_active_topic
select 
	nvl (new.user_id, old.user_id) user_id,
	nvl (new.parent_app_id, old.parent_app_id) parent_app_id,
	IF (old.user_id IS NULL,new.mobile_type,old.mobile_type_first) mobile_type_first,
	IF (old.user_id IS NULL,new.province,old.province_first) province_first,
	IF (old.user_id IS NULL,new.city,old.city_first) city_first,
	IF (old.login_date_first IS NULL,'$first',old.login_date_first) login_date_first,
	IF (new.user_id IS NOT NULL,from_unixtime(unix_timestamp(new.date_id,'yyyymmdd'),'yyyy-mm-dd'),old.login_date_last) login_date_last,
	IF (new.user_id IS NOT NULL,new.active_count,0) login_day_count,
	nvl (old.login_count, 0) +nvl (new.active_count, 0) login_count,
	nvl (old.log_count, 0) +IF (new.active_count > 0, 1, 0) log_count,
	nvl (new.is_visitor, old.is_visitor) is_visitor
FROM
	(SELECT * FROM dwt_user_active_topic) old
	FULL OUTER JOIN 
	(SELECT u1.*,u2.parent_id parent_app_id FROM dws_user_active_daily u1
	LEFT JOIN ods_app_config u2 on u1.app_id = u2.app_id 
    WHERE dt = '$first') new ON old.user_id = new.user_id and new.parent_app_id = old.parent_app_id;"
echo $sql_topic

first=`date -d "-1 days ago ${first}" +%Y-%m-%d`

$hive -e "$sql_topic"
done

执行补充中。。。。。