1.Fetch crawl
set hive.fetch.task.conversion=more(默认)
1
Fetch fetch means, Hive query for some cases may not necessarily use MapReduce calculations.
After this property is set to more, in the global search, find the field, limit and so do not look away MapReduce. After setting none for all types of statements have to go look for MapReduce;
2. Local Mode
set hive.exec.mode.local.auto=true(开启本地模式)
1
Hive local mode can handle all of the tasks on a single machine. For small data sets, the execution time can be significantly shortened
1. The maximum amount of input data to set local mr after opening the local mode, using local mr manner when the data amount is smaller than this value
set hive.exec.mode.local.auto.inputbytes.max=134217728(默认)
1
To set the maximum number of input files of local mr 2. enable local mode, using local mr manner when the data amount is smaller than this value
set hive.exec.mode.local.auto.input.files.max=4(默认)
1
3. Optimization of the table
3.1 small table join large table (small table required on the left.)
Note: The new version of the hive has a large table on the small table JOIN JOIN small table and a large table is optimized. Small table on the left and right have no significant difference
3.2 large tables join a large table
When there are many null values in the table will result in a MapReduce process, the air becomes a key value, the corresponding value will be a lot of value, but the value will reach a key together reduce caused by insufficient memory; so finding ways to filter the air value.
1. Search results are not empty
insert overwrite table jointable select n.* from
(select * from nullidtable where id is not null ) n left join ori o on n.id = o.id;
12
2. Query the null value and the random number assigned to it, to avoid the key is null
insert overwrite table jointable
select n.* from nullidtable n full join ori o on
case when n.id is null then concat('hive', rand()) else n.id end = o.id;
123
Note: This method can solve the problem of data skew
3.3MapJoin
If no MapJoin MapJoin or do not meet the conditions, the parser will then Hive converted into the Common Join Join operation, namely: Reduce the completed join stage. Data skew easily occurs. MapJoin the small table can all be loaded into memory map at the end join, reducer treatment avoided.
Set MapJoin
set hive.auto.convert.join = true(默认)
1
Valve set the value of large tables small tables (the default 25M considered small table):
set hive.mapjoin.smalltable.filesize=25000000;
1
3.4Group BY
By default, the same phase Map Key reduce a distributed data, a key data when it is too large, not all inclined in the polymerization are completed reduce end, many polymerization operation can now partially polymerized Map end of the last segment in Reduce got the answer
Map open end of a polymerization parameter
whether Map segment polymerization, the default is true
hive.map.aggr = true
1
The number of entries in the Map polymerization operation ends
hive.groupby.mapaggr.checkinterval = 100000
1
When there is data skew load balancing (default is false)
hive.groupby.skewindata = true
1
Note: When the option is set to true, the resulting query plans have two MR Job. The first MR Job, Map output result will be randomly distributed to Reduce each polymerization operation Reduce do section, and outputs the result, a result of this process is the same as Group By Key likely to be distributed to different Reduce in so as to achieve load balancing purposes; second MR Job preprocessing data then results by Group by Key Reduce the distributed (this process can ensure the same Group by Key are distributed to Reduce the same), and finally complete the final polymerization operation.
3.5Count (Distinct) to re-count
Count Distinct is the use of a mapreduce, when data is no less impact when using large data will be difficult to complete only one MapReduce job. This is the need to use Group BY packet will be completed because the use of two sets MapReduce set mapreduce.job.reduces = 5; MapReduce so the first process is accomplished by a map and 5 reduce this reduces the reduce load, although it will use a multi-Job done, but in the case of large volumes of data, this is definitely worth it.
3.6 ranks filters
- Column process: in the select, take only desired columns, to make use of partitions filtered, less select *
- Row: cut in the partition, when using an external association, if the sub filter condition table where it will write back the whole table associated with the first, then filtered.
Examples:
1. Test first link two tables, where conditions and then filtered
hive (default)> select o.id from bigtable bjoin ori o on o.id = b.idwhere o.id <= 10;
1
2. After the sub-queries, then the associated table
hive (default)> select b.id from bigtable b join (select id from ori where id <= 10 ) o on b.id = o.id;
1
3.7. Dynamic Partitioning
Relational database, partition table Insert data, the database will automatically partition field based on the value of the data into the appropriate partition, Hive also provides a similar mechanism, i.e., the dynamic partition (Dynamic Partition), but, Hive using dynamic partitioning, the corresponding configuration is required.
First of all property to be set
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
set hive.exec.max.dynamic.partitions = 1000;
set hive.exec.max.dynamic.partitions.pernode = 100;
set hive.exec.max.created.files = 100000;
set hive.error.on.empty.partition = false;
123456
Simulation of dynamic partitioning
insert overwrite table ori_partitioned_target partition (p_time)
select id, time, uid, keyword, url_rank, click_num, click_url, p_time from ori_partitioned;
12
4. Data inclined
4.1 reasonable set number Map
Setting slice value
set mapreduce.input.fileinputformat.split.maxsize=???
1
4.2 small files merge
Before the implementation of the merger map small files, reduce the number of map: CombineHiveInputFormat has a small file
function (system default format) merger. HiveInputFormat not merge function for small files.
set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
1
Map 4.3 increase in the number of complex files
set mapreduce.job.maps =???
1
4.4 reasonable set number Reduce
1. A method to adjust reduce the number of
data of each processing Reduce default 256MB
hive.exec.reducers.bytes.per.reducer=256000000
1
Each task largest number of reduce, the default is 1009
hive.exec.reducers.max=1009
1
Reduce the number of formula
N=min(参数2,总输入数据量/参数1)
1
2. The second method of adjusting reduce the number of
set mapreduce.job.reduces=???
1
3.reduce number not better
- Excessive consumption will reduce start-up and initialization time and resources;
- In addition, the number of reduce, there will be a number of output files, if you generate a lot of small files, then the problem of too many small files if these small files as input to the next task, you will also appear; setting reduce a We also need to consider the number of two principles: processing large amounts of data using appropriate reduce the number; that a single task processing reduce the data amount to the right;
4.5 parallel execution
Hive.exec.parallel by setting parameter is true, it can be opened concurrently. However, in a shared cluster, we need to pay attention to the next, if the job in the parallel phase increase, then the cluster utilization will increase.
set hive.exec.parallel=true; //打开任务并行执行
set hive.exec.parallel.thread.number=16; //同一个 sql 允许最大并行度,默认为 8。