文章目录

1. fetch属性
2. 本地模式
3. 表优化
4. 数据倾斜处理
5. 并行执行
6. JVM重用
7. 推测执行
8. 压缩

1. fetch属性

在旧版本的 Hive 中, hive-default.xml.template 文件中 hive.fetch.task.conversion 默认是 minimal, 修改为 more 后, 全局查找、字段查找、limit查找等都会直接执行而不会运行 mapreduce.
新版本的 Hive Fetch 的默认值已改为 more.

2. 本地模式

数据量小的情况下, 可以使用本地模式单机查询
通过设置 hive.exec.mode.local.auto 的值为 true 来实现
可直接在 Hive 的 CLI 中或 beeline 中设置

set hive.exec.mode.local.auto=true; 
//设置最大输入数据量，小于这个值采用本地模式，默认为134217728(128M)
set hive.exec.mode.local.auto.inputbytes.max=536870912;
//设置最大输入文件个数，小于这个值时采用本地模式，默认为4
set hive.exec.mode.local.auto.input.files.max=30;

数据量小的情况下, 查询速度可提升数倍

3. 表优化

JOIN 语句, 小表, key 较分散的表放在左边, 大表放在右边
实际测试中, 新版的 Hive 优化器已对此做出处理, 无论放左边还是右边已无明显区别

空 key 过滤

hive (test)> insert overwrite table jointable 
select n.* from (select * from nidtable where id is not null) n 
left join bigt b on n.id = b.id;

空 key 转化

有时候虽然 key 为空, 但数据依然需要保留, 可以将空值转化为随即值, 这样就可以较为平均的分配到各个 reducer 中, 防止数据倾斜

hive (test)> set mapreduce.job.reduces = 5;
hive (test)> insert overwrite table jointable
select n.* from nidtable n full join bigt b on 
case when n.id is null then concat('hive', rand()) else n.id end = b.id;

Map 端 Join

默认 hive.auto.convert.join=true, 符合条件自动开启 map 端 join
可自定义小表阈值 set hive.mapjoin.smalltable.filesize=25000000

执行小表 join 大表语句
```
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
join bigtable  b
on s.id = b.id;
```
符合小表条件时, 客户端会启动 Task A, 将小表 s 表数据转换成 HashTable 数据结构, 写入本地并加载到 DistributeCache 中, Task B 启动 MapTasks 扫描大表 b , 所有的 mapper 都可以读到 DistributeCache 中的数据, 所有在 mapper 中会直接关联两个表并输出与 mapper 相同数量的文件

由于表的 join 操作是在 map 端且在内存进行的，所以其并不需要启动 reduce 任务也就不需要经过 shuffle 阶段，从而能在一定程度上节省资源提高 join 效率
Map 端聚合

默认开启此功能 hive.map.aggr=true
可自定义 map 端聚合的条目阈值 hive.groupby.mapaggr.checkinterval=100000
自动负载均衡(默认false) hive.groupby.skewindata=true

开启负载均衡后, 如果条目超过阈值, 查询计划会启动两个 MR Job, 第一个 Job 的 reducer 会接收随机 key, 做提前聚合, 类似 Combiner, 第二行 reducer 接收相同的 group by key, 完成最后聚合. 注意要保证计算逻辑, 不能用来求均值.
去重
```
select count(distinct id) from bigtable;
```
数据量大的情况下, 使用子查询 group by 去重
```
select count(id) from (select id from bigtable group by id) a;
```
会多启动一个 job, 在 map 端合并相同 id
动态分区

一个数据文件太大, 手动建立分区过于繁琐, 根据指定参数推断分区的名称, 自动建立分区
例如可根据 date, 将每天或每周的数据放入一个分区中, 方便查询

动态分区功能(默认开启) hive.exec.dynamic.partition=true
设置为非严格模式, 严格模式下必须指定至少一个分区为静态分区 hive.exec.dynamic.partition.mode=nonstrict
自定义所有节点动态分区总和最大值 hive.exec.maxdynamic.partitions=1000
自定义单个节点动态分区最大值 hive.exec.max.partitions.pernode=100
设置整个 job 中可创建 hdfs 文件最大值 hive.exec.max.created.files=100000

创建表
```
create table business
(uid bigint, name string, click_num int) 
partitioned by (p_time date)
row format delimited fields terminated by ',';
```
插入数据
```
insert into table business
partition(p_time)
select uid, name, click_num from data_table //这里不能写 * 
```
分桶
手动分区
尽量先过滤缩小范围(放在子查询中即可提前过滤)再 join
使用 Explain 检查查询计划

4. 数据倾斜处理

上面已经介绍了几种有可能导致数据倾斜的情况, 并列出具体处理手段, 下面介绍另外几种常用方法

调整 mapper 的数量

InputFormat 类中有两个方法, 分别是 getInputSplits(), 返回类型 inputSplit, 和 createRecordReader

getInputSplits() 中有以下方法:
```
computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize)))=blocksize=128M
```

调整 maxSize 最大值。让 maxSize 最大值低于 blocksize 就可以增加 mapper 的个数

设置最大切片值

set mapreduce.input.fileinputformat.split.maxsize=xxx;

调整 reducer 的数量

直接设置个数
```
set mapreduce.job.reduces = 15;
```
永久修改: 调整 mapred-default.xml 中的此参数

调整配置

hive.exec.reducers.bytes.per.reducer=256000000  //每个 reducer 默认处理量   x
hive.exec.reducers.max=1009 //默认每个任务的最大 reducer 数   y
N=min(y, 数据总量/x) //reducer 数量公式

进行小文件的合并

默认 hive.input.format=HiveInputFormat , 没有小文件合并功能
改为 CombineHiveInputFormat, 将对小文件以默认格式合并
```
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
```
另一个参数 set hive.merge.mapfiles=true

5. 并行执行

一个 Hive 查询包含多个阶段, 如抽样, 合并, limit 等, 默认情况下, Hive一次只会执行一个阶段, 某些特定 Job 包含众多并不互相依赖的阶段, 可以并行执行

set hive.exec.parallel=true;              //打开任务并行执行
set hive.exec.parallel.thread.number=16;  //同一个sql允许最大并行度，默认为8。

6. JVM重用

Hadoop 默认使用派生 JVM 执行 map 和 reduce 任务, JVM的启动过程可能会造成相当大的开销，特别是执行的 job 包含大量 task 任务处理小文件的情况.
可以设置 mapred-site.xml 配置同一个 JVM 实例的重用次数

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

缺点是开启 JVM 重用将一直占用使用到的 task 插槽, 直到此 job 完成才释放

7. 推测执行

自动推测同一个作业中运行慢的节点, 启动额外任务同时处理同一份数据, 最终使用最先运行完成的作为结果

Hadoop 中的推测执行配置

<property>
  <name>mapreduce.map.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some map tasks 
               may be executed in parallel.</description>
</property>

<property>
  <name>mapreduce.reduce.speculative</name>
  <value>true</value>
  <description>If true, then multiple instances of some reduce tasks 
               may be executed in parallel.</description>
</property>

Hive 中也提供了推测执行配置, 可直接修改

<property>
	<name>hive.mapred.reduce.tasks.speculative.execution</name>
 	<value>true</value>
	<description>Whether speculative execution for reducers should be turned on. </description>
</property>

8. 压缩

一般使用 snappy(不可切) 或 lzo(可切) 压缩方式, 保证速度

查看 hadoop 支持的压缩方式
hadoop checknative

添加 snappy 压缩的步骤:

下载 snappy 源码包编译进 hadoop 包中
上传解压后进入 hadoop 根目录下 lib/native 并拷贝所有文件到原 hadoop 根目录下的 lib/native中覆盖
分发集群并重启

map 阶段压缩

可减少 mapper 与 reducer 直接的数据传输量

1.开启 hive 中间传输数据压缩功能
set hive.exec.compress.intermediate=true;
2.开启 map 输出压缩
set mapreduce.map.output.compress=true;
3.设置 map 输出压缩方式
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
reduce 阶段压缩

1．开启 hive 最终输出压缩功能
set hive.exec.compress.output=true;
2．开启 mapreduce 最终输出压缩
set mapreduce.output.fileoutputformat.compress=true;
3．设置 mapreduce 最终输出压缩方式
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
4．设置 mapreduce 最终输出压缩为块压缩
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

设置储存格式

常见格式: TextFile, orc, parquet
常用的组合为 snappy + orc, orc 是行列结合储存的格式
orc 自身带有 ZLIB 压缩, 压缩率高于 snappy, 但速度较慢, 一般不用

创建一个 snappy 压缩以 orc 存储的表

create table data_orc_snappy(
	track_time string,
	url string,
	session_id string,
	referer string,
	ip string,
	end_user_id string,
	city_id string
)
row format delimited fields terminated by '\t'
stored as orc tblproperties ("orc.compress"="snappy ");

Hive调优详解