Hive is in line with the string SQL syntax of MapReduce tool that can be performed on Hadoop parsing generation. Try to use Hive according to some characteristics of distributed computing to design sql, and traditional relational databases are different,
So it is necessary to remove some of the inherent thinking developed under existing relational database.
The basic principle:
1: Try to filter the data as soon as possible, reduce the amount of data of each stage, to add to the partition table partition, select only need to use field
select ... from A
join B
on A.key = B.key
where A.userid>10
and B.userid<10
and A.dt='20120417'
and B.dt='20120417';
It should be rewritten as:
select .... from (select .... from A
where dt='201200417'
and userid>10
) a
join ( select .... from B
where dt='201200417'
and userid < 10
) b
on a.key = b.key;
2, the calculation of the historical experience of the library (that is to say to optimize the use of the method according to different purposes)
Historian calculation and use, zoning
3: atomic operation possible, to avoid a complex logic contains SQL
You can be accomplished using the intermediate complex logic table
4 jion operation should pay attention to the small table on the left side of the join (currently there are a lot of TCL small table on the right side of the join).
Otherwise it will lead to consume a lot of disk and memory
5: If the part number is greater than 2 union all or a large part of the data of each union should be split into a plurality of insert into statements, actual testing process, can improve the execution time by 50%
insert overwite table tablename partition (dt= ....)
select ..... from (
select ... from A
union all
select ... from B
union all
select ... from C
) R
where ...;
It can be rewritten as:
insert into table tablename partition (dt= ....)
select .... from A
WHERE ...;
insert into table tablename partition (dt= ....)
select .... from B
WHERE ...;
insert into table tablename partition (dt= ....)
select .... from C
WHERE ...;
5: write SQL must first understand the characteristics of the data itself, if there is join, group operation, then pay attention to whether there is data skew
If the data skew occurs, it should be treated as follows:
set hive.exec.reducers.max=200;
set mapred.reduce.tasks = 200; --- increased number Reduce
set hive.groupby.mapaggr.checkinterval = 100000; - this is the number of records of the group corresponding to the key will be split exceeds this value, the specific value depending on the amount of data
set hive.groupby.skewindata = true; - group by processes occur if the inclination should be set to true
set hive.skewjoin.key = 100000; - This is the number of records corresponding to the key exceeds this value will join spin-off, the data value is set according to the amount of specific
set hive.optimize.skewjoin = true; - join the process occurs if the inclination should be set to true
(1) initiates a job to do many things as possible, to complete a job to do, do not do the job two
In general, the task ahead is to start slightly can do together to do together, so that subsequent reuse multiple tasks, the model design, good models are especially important and closely linked to this.
(2) reasonably reduce the number of settings
The number is too small not really reduce the power play hadoop parallel computing, but reduce the number of excessive, will cause a large number of small files problem, the amount of data, resources that only you know best, to find a compromise,
(3) whether the same control parameters hive.exec.parallel sql in a different job can be run simultaneously, the job to improve concurrency
2, let the server try to do little things and take the best path to the goal minimal resource consumption
such as:
(1) Note the use of join
If used which has a small table Map join, the reduce or join using conventional, will pay attention to the front of the hive join loading table data memory, a smaller table so large tables prior to reduce the consumption of memory resources
(2) pay attention to the problem of small files
There are two common approaches in the hive
The first is to use Combinefileinputformat, multiple small files are packaged as a whole inputsplit, reducing the number of tasks map
set mapred.max.split.size=256000000;
set mapred.min.split.size.per.node=256000000
set Mapred.min.split.size.per.rack=256000000
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
The second parameter is set hive, additional MR Job start a small package file
hive.merge.mapredfiles = false whether to merge Reduce the output file, the default is False
hive.merge.size.per.task = merge file size 256 * 1000 * 1000
(3) Note that the data inclined
In the hive where the more common approach
By generating a first two MR Job control hive.groupby.skewindata = true, the output of the first MR Job Map randomized to reduce pre-aggregated views do, reduce excessive number of certain key values through the number of pieces of some key small problems caused by the tilt data
Map combiner at a second end by doing hive.map.aggr = true (the default is to true), if the map is not substantially different pieces of data, the polymerization does not make sense to do combiner but superfluous, thoughtful hive hive parameter to be considered in .groupby.mapaggr.checkinterval = 100000 (default) hive.map.aggr.hash.min.reduction = 0.5 (default), 100,000 prefetched data aggregation, after the number /100000>0.5 If the polymerization, the polymerization is no longer
(4) good use multi insert, union all
Suitable multi insert a scene based on the same source table in a different logic processing different size insert different tables, so only need to scan a source table, Job number unchanged, reducing the number of source scan table
with a union all, and can reduce the number of the scan table, to reduce the number of the job, after the union all query logic is typically different in different conditions previously generated, then uniform group by calculation, union all different tables corresponding to multiple inputs, the same union all tables, multiple output once fairly map
Tuning (5) the parameter
Cluster wide range of parameters, such as for example
May be set for a particular job-specific parameters, such as jvm reuse, reduce copy number of threads provided (for map faster, larger output)
If the number of tasks are many and small, such as in less than a minute to complete, reducing the number of task initialization task to reduce consumption. You can reduce the task of consumption by reusing JVM configuration options