Hive Tuning Tips

1.Fetch crawl

set hive.fetch.task.conversion=more(默认)
1

Fetch fetch means, Hive query for some cases may not necessarily use MapReduce calculations.
After this property is set to more, in the global search, find the field, limit and so do not look away MapReduce. After setting none for all types of statements have to go look for MapReduce;

2. Local Mode

set hive.exec.mode.local.auto=true(开启本地模式)
1

Hive local mode can handle all of the tasks on a single machine. For small data sets, the execution time can be significantly shortened
1. The maximum amount of input data to set local mr after opening the local mode, using local mr manner when the data amount is smaller than this value

set hive.exec.mode.local.auto.inputbytes.max=134217728(默认)
1

To set the maximum number of input files of local mr 2. enable local mode, using local mr manner when the data amount is smaller than this value

set hive.exec.mode.local.auto.input.files.max=4(默认)
1

3. Optimization of the table

3.1 small table join large table (small table required on the left.)

Note: The new version of the hive has a large table on the small table JOIN JOIN small table and a large table is optimized. Small table on the left and right have no significant difference

3.2 large tables join a large table

When there are many null values in the table will result in a MapReduce process, the air becomes a key value, the corresponding value will be a lot of value, but the value will reach a key together reduce caused by insufficient memory; so finding ways to filter the air value.
1. Search results are not empty

   insert overwrite table jointable select n.* from 
   (select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;
12

2. Query the null value and the random number assigned to it, to avoid the key is null

insert overwrite table jointable
select n.* from nullidtable n full join ori o on 
case when n.id is null then concat('hive', rand()) else n.id end = o.id;
123

Note: This method can solve the problem of data skew

3.3MapJoin

If no MapJoin MapJoin or do not meet the conditions, the parser will then Hive converted into the Common Join Join operation, namely: Reduce the completed join stage. Data skew easily occurs. MapJoin the small table can all be loaded into memory map at the end join, reducer treatment avoided.

Set MapJoin

set hive.auto.convert.join = true(默认)
1

Valve set the value of large tables small tables (the default 25M considered small table):

 set hive.mapjoin.smalltable.filesize=25000000;
1

3.4Group BY

By default, the same phase Map Key reduce a distributed data, a key data when it is too large, not all inclined in the polymerization are completed reduce end, many polymerization operation can now partially polymerized Map end of the last segment in Reduce got the answer

Map open end of a polymerization parameter
whether Map segment polymerization, the default is true

hive.map.aggr = true
1

The number of entries in the Map polymerization operation ends

hive.groupby.mapaggr.checkinterval = 100000
1

When there is data skew load balancing (default is false)

hive.groupby.skewindata = true
1

Note: When the option is set to true, the resulting query plans have two MR Job. The first MR Job, Map output result will be randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to different Reduce in so as to achieve load balancing purposes; second MR Job preprocessing data then results by Group by Key Reduce the distributed (this process can ensure the same Group by Key are distributed to Reduce the same), and finally complete the final polymerization operation.

3.5Count (Distinct) to re-count

Count Distinct is the use of a mapreduce, when data is no less impact when using large data will be difficult to complete only one MapReduce job. This is the need to use Group BY packet will be completed because the use of two sets MapReduce set mapreduce.job.reduces = 5; MapReduce so the first process is accomplished by a map and 5 reduce this reduces the reduce load, although it will use a multi-Job done, but in the case of large volumes of data, this is definitely worth it.

3.6 ranks filters

  • Column process: in the select, take only desired columns, to make use of partitions filtered, less select *
  • Row: cut in the partition, when using an external association, if the sub filter condition table where it will write back the whole table associated with the first, then filtered.

Examples:
1. Test first link two tables, where conditions and then filtered

hive (default)> select o.id from bigtable bjoin ori o on o.id = b.idwhere o.id <= 10;
1

2. After the sub-queries, then the associated table

hive (default)> select b.id from bigtable b join (select id from ori where id <= 10 ) o on b.id = o.id;
1

3.7. Dynamic Partitioning

Relational database, partition table Insert data, the database will automatically partition field based on the value of the data into the appropriate partition, Hive also provides a similar mechanism, i.e., the dynamic partition (Dynamic Partition), but, Hive using dynamic partitioning, the corresponding configuration is required.
First of all property to be set

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
123456

Simulation of dynamic partitioning

 insert overwrite table ori_partitioned_target partition (p_time)
select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
12

4. Data inclined

4.1 reasonable set number Map

Setting slice value

set mapreduce.input.fileinputformat.split.maxsize=???
1

4.2 small files merge

Before the implementation of the merger map small files, reduce the number of map: CombineHiveInputFormat has a small file
function (system default format) merger. HiveInputFormat not merge function for small files.

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
1

Map 4.3 increase in the number of complex files

 set mapreduce.job.maps =???
1

4.4 reasonable set number Reduce

1. A method to adjust reduce the number of
data of each processing Reduce default 256MB

hive.exec.reducers.bytes.per.reducer=256000000
1

Each task largest number of reduce, the default is 1009

hive.exec.reducers.max=1009
1

Reduce the number of formula

N=min(参数2,总输入数据量/参数1)
1

2. The second method of adjusting reduce the number of

set mapreduce.job.reduces=???
1

3.reduce number not better

  • Excessive consumption will reduce start-up and initialization time and resources;
  • In addition, the number of reduce, there will be a number of output files, if you generate a lot of small files, then the problem of too many small files if these small files as input to the next task, you will also appear; setting reduce a We also need to consider the number of two principles: processing large amounts of data using appropriate reduce the number; that a single task processing reduce the data amount to the right;

4.5 parallel execution

Hive.exec.parallel by setting parameter is true, it can be opened concurrently. However, in a shared cluster, we need to pay attention to the next, if the job in the parallel phase increase, then the cluster utilization will increase.

set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个 sql 允许最大并行度,默认为 8。

Guess you like

Origin blog.51cto.com/14309075/2415633
Recommended