HIVE数据操作语言DML

1.hive的集中join方式和使用场景，map-side join的reduce-side join适用场景
2.复杂查询的单标过滤条件使用位置，where和on的区别，涉及列裁剪，分区裁剪；

列裁剪优化：

select * from order_table limit 10;优化后只读取需要的数据列
select user,item from order_table limit 10;

分区裁剪优化使用查询条件添加分区：

select count(orderid) form order_table where[i] dt='2014-11-11' [/i]and to_data(sale_time)='2014-11-11' and hour(to_date(sale_time))=10

判断扫描的分区有个explain dependency帮助语法可以看出扫描的分区，帮助优化hiveql

如果join的key相同优化减少JOB数：

select a.val,b.val,c.val from a join b on (a.key = b.key1) join c on [i](b.key1=c.key)[/i];// 不要使用c.key = b.key2

muti-insuert,union all使用
muti-insert使用

insert overwrite table tmp1 select ... from a where 条件1
insert overwrite table tmp1 select ... from a where 条件1
//优化后
from a 
insert overwrite table tmp1 select ... where 条件1
insert overwrite table tmp1 select ... where 条件2;

避免笛卡尔积：

select ...
from order_table ot
left outer join(
selct * from order_table_info oti
where (oti.mobile <> 'unkenwun' or oti.imsi <> 'unkenown')
and oti.imei<> 'unknuwn'
and oti.pt='$data_desc' 
) a
on ot.app_id=a.app_id and oti.imei=a.imei
//提桥过滤，使用on作为条件

join前过滤掉不需要的数据：

//以下应该先进行过滤分区，再执行子查询
select a.val,b.val from a.left outer join b on (a.key=b.key)where a.ds='2009-07-07' and b.ds =''2009-07-07
//优化后
select x.val,b.val from (select key,val from a where a.ds='2009-07-07') x left outer join (select key,val from b where b.ds = '2009-07-07') y
on x.key=y.key

3.distribute by/sort by/cluster by/order by语法的使用和区别

//distinct 和 group by 尽量避免使用distinct，用group by代替
select distinct key from table
select key from table group by key

//order by优化
1.order by在全局有序，所以要排序是在一个reduce中实现，不能并发执行。
2.sort by部分实现排序，单个reduce输出的结果是有序的，效率高，通常和distribute by关键字一起使用
3.cluster by col 等价于distribute by col sort by col1 但是不能分配排序的规则和权重

4.hive函数使用：
（京东自己开发sysdate()）,date_format(),concat,substr,like,split,explode 等

select col1,col2,newCol from table lateral view explode(myCol) adTanle as newCol

//使用explode进行转换，按照不同维度进行订单汇总
select type,code,sum(sales) from(
	select split(part,'_')[1] as type,
			split(part,'_')[0] as code,
			sales from order_table LATERRAL VIEW explode(split(concant(province,'_1-',city,'_2-',country,'_3-')))
			adTable AS part
)df group by type,code;

5.union all注意事项
union all在不同表相当于是multiple inputs,同一个表下相当于map一次输出多条
hive对union all的查询只限于非嵌套查询

select * from (
select c1,c2,c3 from t1 group by c1,c2,c3 
union all 
select c1,c2,c3 from t2 group by c1,c2,c3)
t3;
//如果t1和t2比较小的话使用下边的写法
select * from
（select * from t1
union all
select * from t2
）t3
group bu c1,c2,c3;

不同表太多的union all操作可以使用创建临时分区

insert overwrite table test (flag='1')
select sndaid,mobile from test1
insert overwrite table test (flag='2')
select sndaid,mobile from test2

6.数据倾斜的原因和解决方式
//数据倾斜
任务长时间维持在99%，查看任务页面发现只有少量的reduce子任务未完成，因为此处的reduce和其他的reduce差异过大，单一的reduce的记录数与平均数差异过大

[size=x-large]7.hive分区表（EDW常见的几种分区表）更有效的查询方法[/size]

8.多表插入

9.hive NULL值关联：

10.动态分区：

//设置参数
set hive.exec.dyanmic.partition=true
set hive.exec.dyanmic.partition.mode=nonstrict
//创建动态分区表
create table test(
sndaid string,
mobile string,
)partitioned by (dt string)
stored an rcfile;
//插入动态分区数据
insert overwrite table test partition(dt)
select sndaid,mobile.dt feom test2

11.数据采样：

HIVE数据操作语言DML

猜你喜欢