Two, hive tune

The following is a hive use some tuning strategies

A, fetch crawl

Fetch fetch means, Hive query for some cases may not necessarily use MapReduce calculations. For example: SELECT * FROM employees; in this case, Hive can simply read the file storage directory corresponding to the employee, and outputs the query results to the console.
Hive.fetch.task.conversion default hive-default.xml.template file is more, the old version hive default is minimal, the property was later modified more, in the global search, find the field, limit and so do not look away mapreduce .

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion     禁用fetch抓取
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT 
                   只有select分区字段,以及limit时才能使用fetch(不走MapReduce)
      2. more  : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
                  字段查找,limit都不走MapReduce
    </description>
  </property>

You can also temporarily modify the value of this parameter in the command line hive:

hive (default)> set hive.fetch.task.conversion=more;

Second, the local mode

Sometimes amount of input data Hive is very small. In this case, the trigger for the query execution time consuming task may be much more than the execution time of the actual job. For most of these cases, Hive local mode can handle all of the tasks on a single machine. For small data sets, the execution time can be significantly shortened. That is, only start a map and reduce tasks and execute on a single host. Related parameters are as follows:

//开启本地mr,自动根据下面的配置决定是否使用本地模式
set hive.exec.mode.local.auto=true;  

//设置local mr的最大输入数据量,当输入数据量小于这个值时采用local  mr的方式,默认为134217728bytes,即128M
set hive.exec.mode.local.auto.inputbytes.max=50000000;

//设置local mr的最大输入文件个数,当输入文件个数小于这个值时采用local mr的方式,默认为4
set hive.exec.mode.local.auto.input.files.max=10;

Third, optimization table

3.1 large table join a small table

The key is relatively dispersed, and the small amount of data in the table on the left of the join, which can effectively reduce the chance of memory overflow error occurs, because the table on the left is first read; and further, can use smaller dimensions Group table (record number of 1000 or less) advanced memory. Reduce the end to complete the map.
The actual test found that: The new version of the hive has a large table on the small table JOIN JOIN small table and a large table is optimized. Small table on the left and right have no significant difference.

3.2 large tables join a large table

This experiment can be opened hadoop of jobhistory server to see the implementation of the job, including execution time and so on.

配置 mapred-site.xml
<property>
<name>mapreduce.jobhistory.address</name>
<value>bigdata111:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>bigdata111:19888</value>
</property>

启动历史服务器:
mr-jobhistory-daemon.sh start historyserver

进入historyserver 的web页面:
http://192.168.1.102:19888

Two, hive tune

Figure 3.1 hive large chart table join results

There are a lot of job you can see the results of the implementation of the results of state parameters, such as execution time.

3.2.1 Empty key filter

Sometimes join timeout because too much of certain key data corresponding to the same key and the corresponding data will be sent to the same reducer, resulting in memory is not enough. At this point we should carefully analyze these unusual key, in many cases, key data corresponding to these data are abnormal, we need to be filtered in SQL statements. For example, key is null, if the data is abnormal, then it should be filtered out. E.g:

 insert overwrite table jointable 
select n.* from (select * from nullidtable where id is not null ) n  left join ori o on n.id = o.id;
这里就事先对 nullidtable 表中 id 为null 的行过滤掉。
但是要注意,确定key是null的数据是无效数据时才过滤,如果是有效数据就不能采用这种方式了

3.2.2 Empty key conversion

Although sometimes a key blank corresponding to a lot of data, the corresponding data is not abnormal data, must be included in the result of the join, in this case we can list a key blank field is assigned a random value, so that the data not divided uniformly at random on a different reducer. E.g:

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
case when n.id is null then concat('hive', rand()) else n.id end = o.id;

使用 case when xxx then value1 else id end  语句判断id是否为空,为空则用随机数替代,否则直接

3.3 open automatically map join

If no MapJoin MapJoin or do not meet the conditions, the parser will then Hive converted into the Common Join Join operation, namely: Reduce the completed join stage. Data skew easily occurs. MapJoin the small table can all be loaded into memory map at the end join, reducer treatment avoided.
We can specify when the small table when using reduce join more than how much less than on the use of map join

(1)设置自动选择Mapjoin
set hive.auto.convert.join = true; 默认为true

(2)大表小表的阈值设置(默认25M一下认为是小表):
set hive.mapjoin.smalltable.filesize=25000000;

3.4 group by automatic load balancing

Take reduce polymerization default, Map Key phases of the same data distributed to a reduce, when a key on the tilt of the data is too large. Not all polymerization operations need to be done in Reduce end, a lot of polymerization operations can be carried out first partially polymerized in the Map end, the final result in the conclusion that Reduce end.

(1)是否在Map端进行聚合,默认为True
    hive.map.aggr = true
(2)在Map端进行聚合操作的条目数目
    hive.groupby.mapaggr.checkinterval = 100000
(3)有数据倾斜的时候进行负载均衡(默认是false)
    hive.groupby.skewindata = true
当这一项设置为true时,生成的查询计划会有两个MR Job。第一个MR Job中,Map的输出结果会随机分布到Reduce中,每个Reduce做部分聚合操作,并输出结果,这样处理的结果是相同的Group By Key有可能被分发到不同的Reduce中,从而达到负载均衡的目的;第二个MR Job再根据预处理的数据结果按照Group By Key分布到Reduce中(这个过程可以保证相同的Group By Key被分布到同一个Reduce中),最后完成最终的聚合操作。

3.5 statistical weight to the first group by re-count

When ordinary circumstances the election, we count the number of rows of data after deduplication, so statistics:

select count(distinct id) from bigtable;

In this way there is a huge flaw, because it is to the overall weight, so when MapReduce, can not use multiple tasks reducer, if used, becomes partial to heavy, but overall is not guaranteed to go heavy. In this case a reducer of the load is actually great, it can be used to optimize the following way:

select count(id) from (select id from bigtable group by id) a;

First start MapReduce according to id group by, this process is already de-emphasis, and it can be used in the group by more than reducer tasks, so you can reduce the pressure of a single reducer. Followed by start another MapReduce, count the number of lines after the data group by statistics. So here are two MapReduce job becomes a task performed, so pay attention to this way only when large volumes of data, or multi-task scheduling but take up more resources, and efficiency is not good.

Carried out before 3.6 join the ranks of the filter

Column filters: select * but try not to use the specified fields to be queried
line filter: In our external join, if you want a table to filter some rows. Must first be filtered before join, not to join two tables after filtration again, because the data than the original increases after join, to filter longer.
After filtration join:

select o.id from bigtable b
join ori o on o.id = b.id
where o.id <= 10;

所以where语句不要放在join之后,这是不好的,大数据量的时候耗时很长

join before filtering:

select b.id from bigtable b
join (select id from ori where id <= 10 ) o on b.id = o.id;

这里就是先对ori表进行id列的过滤,过滤后的数据再和bigtable表join

3.7 open dynamic partitioning adjustment

If the hive is a table partition table, under normal circumstances, we insert the insertion time, you need to clearly specify which is inserted into the partition. And if open dynamic partitions, partition field will be based on import data automatically imported into the specified partition, if the partition does not exist, it is created automatically.

(1)开启动态分区功能(默认true,开启)
hive.exec.dynamic.partition=true

(2)设置为非严格模式(动态分区的模式,默认strict,表示必须指定至少一个分区为静态分区,nonstrict模式表示允许所有的分区字段都可以使用动态分区。)
hive.exec.dynamic.partition.mode=nonstrict

(3)在所有执行MR的节点上,最大一共可以创建多少个动态分区。
hive.exec.max.dynamic.partitions=1000

(4)在每个执行MR的节点上,最大可以创建多少个动态分区。该参数需要根据实际的数据来设定。比如:源数据中包含了一年的数据,即day字段有365个值,那么该参数就需要设置成大于365,如果使用默认值100,则会报错。
hive.exec.max.dynamic.partitions.pernode=100

(5)整个MR Job中,最大可以创建多少个HDFS文件。
hive.exec.max.created.files=100000

(6)当有空分区生成时,是否抛出异常。一般不需要设置。
hive.error.on.empty.partition=false

Examples:
Demand: ori in accordance with the time data (eg: 20111230000008), inserted into the corresponding partition of the target table ori_partitioned_target

(1)创建分区表
create table ori_partitioned(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) 
partitioned by (p_time bigint) 
row format delimited fields terminated by '\t';
(2)加载数据到分区表中
hive (default)> load data local inpath '/opt/module/datas/ds1' into table ori_partitioned partition(p_time='20111230000010') ;
hive (default)> load data local inpath '/opt/module/datas/ds2' into table ori_partitioned partition(p_time='20111230000011') ;
(3)创建目标分区表
create table ori_partitioned_target(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) PARTITIONED BY (p_time STRING) row format delimited fields terminated by '\t';
(4)设置动态分区
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;

hive (default)> insert overwrite table ori_partitioned_target partition (p_time) 
select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;

Fourth, data skew

4.1 reasonable set number of map

4.1.1 The large number of small files led to a large number of map

MapReduce in this issue said in a default is a whole file to each slice, a file is at least one slice, when a large number of small files, is bound to generate a lot of map tasks. The problem is the same in the hive.
Solution:
Merge to perform in front of a small map file, reducing the number of map: CombineHiveInputFormat has the function (the default format) for small files merge. HiveInputFormat not merge function for small files.
set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

4.1.2 single map work overload

When the execution of each map is very slow, probably because of complex processing logic, this time can be considered a small spot size setting of the slice, increasing the number of map, to reduce the workload of each map.
Increasing the map method: The computeSliteSize (Math.max (minSize, Math.min ( maxSize, blocksize))) = blocksize = 128M formula maxSize maximum adjustment. Let maxSize lower than the maximum blocksize can increase the number of the map.

设置最大切片值为100个字节
hive (default)> set mapreduce.input.fileinputformat.split.maxsize=100;

这里只是例子,具体设置为多大,根据具体情况决定

4.2 reasonable set the number of reduce

Adjustment:

(1)每个Reduce处理的数据量默认是256MB
hive.exec.reducers.bytes.per.reducer=256000000
(2)每个任务最大的reduce数,默认为1009
hive.exec.reducers.max=1009
(3)计算reducer数的公式
N=min(参数2,总输入数据量/参数1)

pay attention:

1) start-up and initialization reduce excess will consume time and resources;
2) In addition, the number of reduce, there will be a number of output files, if you generate a lot of small files, so if these small files as the next task the input file is too small problem can also arise;
in settings reduce the number, they also need to consider these two principles: to handle large amounts of data using appropriate reduce the number; the single-tasking reduce the amount of data size to fit;

Fifth, open concurrent execution

Hive query will be converted into one or a plurality of stages. This stage can be MapReduce stage sampling phase, consolidation phase, limit stage. Or other stages of the process Hive implementation may be required. By default, Hive once only execution stage. However, a particular job may contain a number of stages, and these stages may not be entirely dependent on one another, that is to say some stage can be executed in parallel, so that may make the whole job execution time is shortened. However, if there are more stages can be executed in parallel, the job may be completed faster.
Hive.exec.parallel by setting parameter is true, it can be opened concurrently. However, in a shared cluster, we need to pay attention to the next, if the job in the parallel phase increase, then the cluster utilization will increase.

set hive.exec.parallel=true;              //打开任务并行执行
set hive.exec.parallel.thread.number=16;  //同一个sql允许最大并行度,默认为8。

Sixth, turn strict mode

Hive provides a strict mode, you can prevent users to perform queries that might adversely affected not the intention. By setting the default attribute value of a non-strict model hive.mapred.mode nonstrict. Turn on strict mode requires modification hive.mapred.mode is strict, strict mode is turned on three types of queries can be disabled.

<property>
    <name>hive.mapred.mode</name>
    <value>strict</value>
    <description>
      The mode in which the Hive operations are being performed. 
      In strict mode, some risky queries are not allowed to run. They include:
        Cartesian Product.
        No partition being picked up for a query.
        Comparing bigints and strings.
        Comparing bigints and doubles.
        Orderby without limit.
    </description>
  </property>

Sql statement execution restrictions of the following three conditions:
1) The partition table, partition field unless the filter condition where clause to limit the scope of containing, or not allowed. In other words, the user is allowed to scan all partitions. Be reason for this restriction is that the partition table usually have very large data sets, and data increasing rapidly. No partition limit query can consume unacceptably huge resources to deal with this table.
2) For the query order by the statement, which calls for the use of limit statement. Because the order by order to perform the sorting process will be distributed to all the resulting data are processed in the same Reducer, force users to increase the LIMIT statement can prevent Reducer additional execution for a long time.
3) limit the Cartesian product of a query. Very understanding of relational database users may expect not to use JOIN ON statement in the implementation of the query but uses where statements execute such a relational database optimizer can be efficiently converted into the WHERE clause ON statement. Unfortunately, Hive does not perform this optimization, therefore, if the table is large enough, then the query will be uncontrollable situation.

Seven open JVM reuse

JVM reuse content Hadoop tuning parameters, the performance of the Hive has a very big impact, especially for a particularly large number of small files is difficult to avoid the scene or task scene, most of this type of scenario execution time is very short.
The default configuration of Hadoop is usually derived using JVM to perform map and Reduce tasks. At this point the JVM startup process can cause considerable overhead, especially job execution contains hundreds task task situation. JVM JVM instances such reuse may be reused N times in the same job. The value of N may be disposed in the Hadoop mapred-site.xml file. Usually between 10-20, the specific need to test how much stars based on specific business scenarios.

<property>
  <name>mapreduce.job.jvm.numtasks</name>
  <value>10</value>
  <description>How many tasks to run per jvm. If set to -1, there is
  no limit. 
  </description>
</property>

This current also has shortcomings, this shortcoming function is turned on JVM reuse will always take up the task to use slots for reuse, until the task is completed to release, it says so on the entire job is finished, occupied all jvm It will be released. If an "unbalanced" job there are a few reduce task execution time of other much more time consuming than the Reduce task, then the entire job has been reserved slots will not be idle but the other job use until all the task was over before release.

Eight, speculative execution

Speculative execution on speculation MapReduce MapReduce see part of the implementation of relevant content, not repeated here.
The hive themselves have provided a configuration item to control speculative reduce-side execution:

<property>
    <name>hive.mapred.reduce.tasks.lative.execution</name>
    <value>true</value>
    <description>Whether speculative execution for reducers should be turned on. </description>
  </property>

Speculative execution on tuning these variables, it is difficult to give a specific recommendation. If the user is running for the deviation is very sensitive, then the functions can be closed off. If the user input because a large amount of data needed to perform a long map or Reduce task, then start the waste caused by the speculative execution is very big huge.

Nine, enable compression

This can be seen "hive-- basic principle of" compression-related content. Mainly optimized to reduce the amount of data transmission and reduce the map, and reduce the size of the reduced output file.

Ten, view the execution plan

When execute sql task, you can view the execution is expected to explain the use of the process to see if there can optimize the point.

(1)查看下面这条语句的执行计划
hive (default)> explain select * from emp;

(2)查看详细执行计划
hive (default)> explain extended select * from emp;

Guess you like

Origin blog.51cto.com/kinglab/2447321