Hive使用中常见问题总结(一)

尊敬的读者您好:笔者很高兴自己的文章能被阅读,但原创与编辑均不易,所以转载请必须注明本文出处并附上本文地址超链接以及博主博客地址:https://blog.csdn.net/vensmallzeng。若觉得本文对您有益处还请帮忙点个赞鼓励一下,笔者在此感谢每一位读者,如需联系笔者,请记下邮箱:[email protected],谢谢合作!

hive是基于Hadoop的一个数据仓库工具,可以将结构化的数据文件映射为一张数据库表,并提供简单的sql查询功能,可以将sql语句转换为MapReduce任务进行运行。

几种常用命令总结如下:

1、查询表student_comment_behavior的分区取值

show PARTITIONs 表名(student_comment_behavior)

2、显示student_comment_behavior表的创建结构

show create table 表名(student_comment_behavior)

3、显示student的字段

desc 表名(student_comment_behavior)

4、从student_comment_behavior表中查询表分区为20191119的unionid值

SELECT unionid FROM student_comment_behavior where dt = 20191119 LIMIT 1000

5、将traveltypeid = 1即具有family属性的用户unionid与commentid返回;

SELECT unionid, commentid FROM student_comment_behavior_base where dt = 20191119 and traveltypeid = 1 LIMIT 1000

6、结合4,将具有family属性的用户的comment返回出来;

Select a.content
From
(SELECT * from teacher_commentitem_stg where tablepartition='commentitem7') a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt = 20191119 and traveltypeid = 1) b
On a.parentid = b.commentid
limit 100

7、修改表分区的取值,并扩大查询范围

Select a.content
From
(SELECT * from teacher_commentitem_stg where tablepartition > '1') a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt = 20191119 and traveltypeid = 1) b
On a.parentid = b. commentid

8、将包含亲子的字段统统返回

Select a.content
From
(SELECT * from teacher_commentitem_stg where tablepartition > '1' and content like '%亲子%') a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt = 20191119 and traveltypeid = 1) b
On a.parentid = b. commentid
limit 10

9、扩大亲子字段查询范围

Select a.content
From
(SELECT * from teacher_commentitem_stg where tablepartition > '1' and (content like '%亲子%' or content like '%孩子%' or content like '%小孩%' or content like '%女儿%' or content like '%儿子%' or content like '%家庭%')) a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt = 20191119 and traveltypeid = 1) b
On a.parentid = b. commentid

10、将查询到的结果写入zzh_comment_20191121表中

CREATE table zzh_comment_20191122 as
Select a.content
From
(SELECT * from teacher_commentitem_stg where tablepartition > '1' and (content like '%亲子%' or content like '%孩子%' or content like '%小孩%' or content like '%女儿%' or content like '%儿子%' or content like '%家庭%')) a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt > '1' and traveltypeid = 1) b
On a.parentid = b. commentid

11、对查询的内容用distinct去重,如果表过大,建议不要用distinct去重,可以考虑用group by进行去重

CREATE table zzh_comment_20191122 as
Select distinct a.content
From
(SELECT * from student_commentitem_stg where tablepartition > '1' and (content like '%亲子%' or content like '%孩子%' or content like '%小孩%' or content like '%女儿%' or content like '%儿子%' or content like '%家庭%')) a
Left join
(SELECT unionid, commentid FROM student_comment_behavior_base
where dt > '1' and traveltypeid = 1) b
On a.parentid = b. commentid

12、join、left join、right join、full join、left semi join、cross join的区别?

join:默认为内连接,返回两张表中都有的信息;

left join:以前面的表作为主表和其他表进行关联,返回的记录数和主表的记录数相同,关联不上的字段用NULL,left [outer] join对其无影响;

right join:与left join相反,以后面的表为主表,和前面的表做关联,返回的记录数和主表一致,关联不上的字段为NULL;

full join:为全关联,返回两个表记录的并集,关联不上的字段为NULL,使用full join时,hive不会用mapjoin来优化;

left semi join:以关键字前面的表为主表,两个表对on的条件字段做交集,返回前面表的记录;

cross join(笛卡尔积关联):返回两个表的笛卡尔积结果,不需要指定关联键;

注:hive中的join操作的关键字必须在on中指定,不能再where中指定,不然会先做笛卡尔积再过滤

12、将查询包括亲子评论和不包括亲子评论的语句进行union all操作,通过”row_number() over(PARTITION BY parentid ORDER BY parentid DESC) rank“方式予以去重,并将最终结果写入student_family_comment_acquire表中。

CREATE TABLE if not exists student_family_comment_acquire(
    `unionid`      string comment 'unionid',
    `comment`      string comment 'comment',
    `label`      string comment 'label'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
location '/data/hotel/dev/student_family_comment_acquire';


insert overwrite table student_family_comment_acquire

Select b.unionid, c.content, 1
From
( SELECT * FROM
(SELECT *, row_number() over(PARTITION BY parentid ORDER BY parentid DESC) rank from dev_hotel.hotel_revenue_commentitem_stg where tablepartition > '1' and (content like '%亲子%' or content like '%孩子%' or content like '%小孩%' or content like '%女儿%' or content like '%儿子%' or content like '%家庭%')) a 
where a.rank=1) c
Left join 
(SELECT commentid, unionid  FROM dev_hotel.user_comment_behavior_base
 where dt > '1' and traveltypeid = 1) b 
On c.parentid = b.commentid

union all

Select b.unionid, c.content, 0
From
( SELECT * FROM
(SELECT *, row_number() over(PARTITION BY parentid ORDER BY parentid DESC) rank from dev_hotel.hotel_revenue_commentitem_stg where tablepartition > '1' and (content not like '%亲子%' and content not like '%孩子%' and content not like '%小孩%' and content not like '%女儿%' and content not like '%儿子%' and content not like '%家庭%')) a 
where a.rank=1) c

Left join 

(SELECT commentid, unionid  FROM dev_hotel.user_comment_behavior_base
 where dt > '1' and traveltypeid = 1) b 
On c.parentid = b.commentid

ORDER BY rand()
LIMIT 10000000

13、正负样本各取10w组,其中正负样本的id是不交叉的。

CREATE TABLE if not exists dev_hotel.user_neg_pos_comment_acquire(
`unionid` string comment 'unionid',
`comment` string comment 'comment',
`label` string comment 'label'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
location '/data/hotel/dev/user_neg_pos_comment_acquire';



with sam_pos as (SELECT b.unionid, a.content FROM(
SELECT * from dev_hotel.hotel_revenue_commentitem_stg where tablepartition > '1'
and parentid > 0
and content not like '%尊敬%' and content not like '%亲爱%' and content not like '%点评%'
and content not like '%好评%' and content not like '%评论%' and content not like '%评价%' and content not like '%宾客%'
and content not like '%您好%' and content not like '%感谢您%' and content not like '%亲,%' and content not like '%客官%'
and content not like '%欢迎%' and (content like '%亲子%' or content like '%孩子%' or content like '%小孩%'
or content like '%女儿%' or content like '%儿子%' or content like '%家庭%')
) a

Left join

(SELECT commentid, unionid FROM (
select u.unionid, h_index.commentid
from dev_hotel.hotel_revenue_commentindex_stg h_index
join dev_hotel.user_portrait_unionid u
on h_index.userid = u.e_cardno
where u.unionid is NOT NULL and u.unionid <> ''
) user_comment
) b
On a.parentid = b.commentid
where b.commentid is not null and b.commentid <> ''
),


sam_total as (
SELECT b.unionid, a.content FROM
(SELECT * from dev_hotel.hotel_revenue_commentitem_stg where tablepartition > '1'
and parentid > 0
) a

Left join

(SELECT commentid, unionid FROM (
select u.unionid, h_index.commentid
from dev_hotel.hotel_revenue_commentindex_stg h_index
join dev_hotel.user_portrait_unionid u
on h_index.userid = u.e_cardno
where u.unionid is NOT NULL and u.unionid <> ''
) user_comment
) b
On a.parentid = b.commentid
where b.commentid is not null and b.commentid <> ''
)


insert overwrite table dev_hotel.user_neg_pos_comment_acquire

select unionid, content, 1 as label
from sam_pos
ORDER BY rand()
LIMIT 100000

union all

select unionid, content, 0 as label
from (select sam_total.unionid, sam_total.content
from sam_total
left join sam_pos on sam_total.unionid = sam_pos.unionid
where sam_pos.unionid is null
) sam_neg
ORDER BY rand()
LIMIT 100000

14、获取c.order_id, d.room_type, c.person_number,验证房型和入住人数是否一致

With order_map_customer as (SELECT b.customer_id, a.order_id, b.reser_no, a.room_type FROM
(select * from dev_hotel.user_portrait_order_base) a
join
(select * from base_elong.dshord_reserve_guests where customer_id > 0) b
On a.order_id = b.reser_no
where cast (b.reser_no as string) <> '')


SELECT c.order_id, d.room_type, c.person_number from
(SELECT order_id, count(DISTINCT customer_id) as person_number FROM order_map_customer
group by order_id) c
join
(select * from order_map_customer) d
On c.order_id = d.reser_no

推荐阅读:

[1] https://www.cnblogs.com/lixiaochun/p/9446350.html

[2] https://blog.csdn.net/mrlevo520/article/details/74906302

[3] https://blog.csdn.net/xiaoshunzi111/article/details/48727831

[4] http://lxw1234.com/archives/2015/06/315.htm

日积月累,与君共进,增增小结,未完待续。

发布了152 篇原创文章 · 获赞 147 · 访问量 11万+

猜你喜欢

转载自blog.csdn.net/Vensmallzeng/article/details/103249597