01-hive optimization lessons

hive optimization summary.

Hive is in line with the string SQL syntax of MapReduce tool that can be performed on Hadoop parsing generation. Try to use Hive according to some characteristics of distributed computing to design sql, and traditional relational databases are different,

So it is necessary to remove some of the inherent thinking developed under existing relational database.

The basic principle:

1: Try to filter the data as soon as possible, reduce the amount of data of each stage, to add to the partition table partition, select only need to use field

select ... from A

join B

on A.key = B.key

where A.userid>10

     and B.userid<10

        and A.dt='20120417'

        and B.dt='20120417';

It should be rewritten as:

select .... from (select .... from A

                  where dt='201200417'

                                    and userid>10

                              ) a

join ( select .... from B

       where dt='201200417'

                     and userid < 10   

     ) b

on a.key = b.key;

 

2, the calculation of the historical experience of the library (that is to say to optimize the use of the method according to different purposes)

   Historian calculation and use, zoning

 

3: atomic operation possible, to avoid a complex logic contains SQL

You can be accomplished using the intermediate complex logic table   

4 jion operation should pay attention to the small table on the left side of the join (currently there are a lot of TCL small table on the right side of the join).

Otherwise it will lead to consume a lot of disk and memory

 

5: If the part number is greater than 2 union all or a large part of the data of each union should be split into a plurality of insert into statements, actual testing process, can improve the execution time by 50%

insert overwite table tablename partition (dt= ....)

select ..... from (

                   select ... from A

                   union all

                   select ... from B

                   union all

                   select ... from C

                               ) R

where ...;

 

It can be rewritten as:

insert into table tablename partition (dt= ....)

select .... from A

WHERE ...;

 

insert into table tablename partition (dt= ....)

select .... from B

WHERE ...;

 

insert into table tablename partition (dt= ....)

select .... from C

WHERE ...; 

 

5: write SQL must first understand the characteristics of the data itself, if there is join, group operation, then pay attention to whether there is data skew

If the data skew occurs, it should be treated as follows:

set hive.exec.reducers.max=200;

set mapred.reduce.tasks = 200; --- increased number Reduce

set hive.groupby.mapaggr.checkinterval = 100000; - this is the number of records of the group corresponding to the key will be split exceeds this value, the specific value depending on the amount of data

set hive.groupby.skewindata = true; - group by processes occur if the inclination should be set to true

set hive.skewjoin.key = 100000; - This is the number of records corresponding to the key exceeds this value will join spin-off, the data value is set according to the amount of specific

set hive.optimize.skewjoin = true; - join the process occurs if the inclination should be set to true

 

(1)   initiates a job to do many things as possible, to complete a job to do, do not do the job two

 In general, the task ahead is to start slightly can do together to do together, so that subsequent reuse multiple tasks, the model design, good models are especially important and closely linked to this.

(2)  reasonably reduce the number of settings

The number is too small not really reduce the power play hadoop parallel computing, but reduce the number of excessive, will cause a large number of small files problem, the amount of data, resources that only you know best, to find a compromise,

(3)  whether the same control parameters hive.exec.parallel sql in a different job can be run simultaneously, the job to improve concurrency

 

2, let the server try to do little things and take the best path to the goal minimal resource consumption

 such as:

(1) Note the use of join

If used which has a small table Map join, the reduce or join using conventional, will pay attention to the front of the hive join loading table data memory, a smaller table so large tables prior to reduce the consumption of memory resources

(2) pay attention to the problem of small files

There are two common approaches in the hive

The first is to use Combinefileinputformat, multiple small files are packaged as a whole inputsplit, reducing the number of tasks map

set mapred.max.split.size=256000000;

set mapred.min.split.size.per.node=256000000

set  Mapred.min.split.size.per.rack=256000000

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

The second parameter is set hive, additional MR Job start a small package file

hive.merge.mapredfiles = false whether to merge Reduce the output file, the default is False 

 hive.merge.size.per.task = merge file size 256 * 1000 * 1000 

 

(3) Note that the data inclined

In the hive where the more common approach

By generating a first two MR Job control hive.groupby.skewindata = true, the output of the first MR Job Map randomized to reduce pre-aggregated views do, reduce excessive number of certain key values ​​through the number of pieces of some key small problems caused by the tilt data

Map combiner at a second end by doing hive.map.aggr = true (the default is to true), if the map is not substantially different pieces of data, the polymerization does not make sense to do combiner but superfluous, thoughtful hive hive parameter to be considered in .groupby.mapaggr.checkinterval = 100000 (default) hive.map.aggr.hash.min.reduction = 0.5 (default), 100,000 prefetched data aggregation, after the number /100000>0.5 If the polymerization, the polymerization is no longer

 

(4) good use multi insert, union all

Suitable multi insert a scene based on the same source table in a different logic processing different size insert different tables, so only need to scan a source table, Job number unchanged, reducing the number of source scan table

with a union all, and can reduce the number of the scan table, to reduce the number of the job, after the union all query logic is typically different in different conditions previously generated, then uniform group by calculation, union all different tables corresponding to multiple inputs, the same union all tables, multiple output once fairly map

Tuning (5) the parameter

Cluster wide range of parameters, such as for example

May be set for a particular job-specific parameters, such as jvm reuse, reduce copy number of threads provided (for map faster, larger output)

If the number of tasks are many and small, such as in less than a minute to complete, reducing the number of task initialization task to reduce consumption. You can reduce the task of consumption by reusing JVM configuration options

Guess you like

Origin blog.csdn.net/qq_35281775/article/details/52980014