When hive parses sql, it will default that the last table is a large table, it will try to cache other tables, and then scan the last table for calculation, but users do not often put the large table at the end, so We can add a configuration to sql to automatically detect and tell the query optimizer which is a large table
如select /*+streamtable(s)*/a.id from log a left join user b on a.uid=b.uid2.map-side join
set hive.auto.convert.join=true;
If one of the tables is small enough, the small table is completely cached in the memory. When the largest table is used for mapper, it can be matched with the small tables in the memory one by one, thereby omitting the reduce required by conventional connections. the process ofUser can configure and optimize the size of small tables (in bytes)
hive.mapjoin.smalltable.filesize=25000000
Note: hive does not support this optimization for right outer join and full outer join
3. Local mode
set hive.exec.mode.local.auto=true; (default is false)
Only when a job meets the following conditions can it really use the local mode:1. The size of the input data of the job must be smaller than the parameter: hive.exec.mode.local.auto.inputbytes.max (default 128MB)
2. The number of maps of the job must be smaller than the parameter: hive.exec.mode.local.auto.tasks.max (default 4)
3. The reduce number of the job must be 0 or 1
General simple query statements such as:select * from a; this kind of statement will not be used for mapreduce.
For the same simple query statements, although mr will be started, local mode can also be used
select a.name,a.age from user a;
It can also be configured in hive_home/conf/hive-site.xml
4. Parallel execution
set hive.exec.parallel=true;
When hive is executed, it will convert a query into one or more stages, such stages can be multiple MapReduce stages, sampling stages,
The merge phase, limit phase, or other phases in the hive execution process. By default, hive will only execute one stage at a time. However, some specific
A job may contain many stages, and these stages may not be completely interdependent and can be executed in parallel, which can shorten the overall execution
It can also be configured in hive_home/conf/hive-site.xml
5. Strict Mode
set hive.mapred.mode=strict; strict mode
set hive.mapred.mode=nonstrict; non-strict mode
The strict mode provided by hive can prevent users from executing queries that may have unexpected adverse effects.
The first: For partition, unless the where statement contains the filter condition of the partition field to limit the data range, it is not allowed to execute.
The second: for the query using the order by statement, the limit statement must be used. Because order by in order to perform the sorting process will
All the result data is distributed to the same reducer for processing, forcing the user to increase the limit statement can prevent the reducer from adding extra
execution time.
The third type: query that restricts the Cartesian product. You must write an on statement for association
6. Set the number of mappers and reducers
set hive.exec.reducers.max=(the total number of reduce slots in the cluster*1.5)/(the average number of queries in execution)
The default number of reducers in hive is 3
It can also be configured in hive_home/conf/hive-site.xml
7. JVM reuse
set mapred.job.reuse.jvm.num.tasks=10
It can also be configured in hadoop's mapred-site.xml
The default configuration of Hadoop is usually to use a derived JVM to perform map and reduce tasks. At this time, the JVM startup process may be
Causes considerable overhead, especially if the executed job contains hundreds or thousands of tasks. JVM reuse can make JVM instances
Reusing N times in the same job has a disadvantage. Turning on JVM reuse will always occupy the task's slot and will not release it until the task ends.
8.hive dynamic partition
Reference address https://blog.csdn.net/oracle8090/article/details/72627135
9.set hive.map.aggr=true;
Equivalent to the combiner operation on the map side
11. group by optimization
hive.groupy.skewindata=true;
if the skewing occurs during the group by process, it should be set to true
set hive.groupby.mapaggr.checkinterval=100000;
when the number of records corresponding to the group key exceeds this value, it will be optimized
hive configuration information https://blog.csdn.net/chaoping315/article/details/8500407