hvie optimization

1. Configuration optimization

When hive parses sql, it will default that the last table is a large table, it will try to cache other tables, and then scan the last table for calculation, but users do not often put the large table at the end, so We can add a configuration to sql to automatically detect and tell the query optimizer which is a large table

如select /*+streamtable(s)*/a.id from log a left join user b on a.uid=b.uid

2.map-side join 

 set hive.auto.convert.join=true;

If one of the tables is small enough, the small table is completely cached in the memory. When the largest table is used for mapper, it can be matched with the small tables in the memory one by one, thereby omitting the reduce required by conventional connections. the process of

User can configure and optimize the size of small tables (in bytes)

hive.mapjoin.smalltable.filesize=25000000

Note: hive does not support this optimization for right outer join and full outer join

3. Local mode

set hive.exec.mode.local.auto=true; (default is false)

Only when a job meets the following conditions can it really use the local mode:
1. The size of the input data of the job must be smaller than the parameter: hive.exec.mode.local.auto.inputbytes.max (default 128MB)
2. The number of maps of the job must be smaller than the parameter: hive.exec.mode.local.auto.tasks.max (default 4)

3. The reduce number of the job must be 0 or 1

General simple query statements such as:
select * from a; this kind of statement will not be used for mapreduce.
For the same simple query statements, although mr will be started, local mode can also be used

select a.name,a.age from user a;

It can also be configured in hive_home/conf/hive-site.xml

4. Parallel execution

set hive.exec.parallel=true;

When hive is executed, it will convert a query into one or more stages, such stages can be multiple MapReduce stages, sampling stages,

The merge phase, limit phase, or other phases in the hive execution process. By default, hive will only execute one stage at a time. However, some specific

A job may contain many stages, and these stages may not be completely interdependent and can be executed in parallel, which can shorten the overall execution

It can also be configured in hive_home/conf/hive-site.xml

5. Strict Mode

set hive.mapred.mode=strict; strict mode

set hive.mapred.mode=nonstrict; non-strict mode

The strict mode provided by hive can prevent users from executing queries that may have unexpected adverse effects.

The first: For partition, unless the where statement contains the filter condition of the partition field to limit the data range, it is not allowed to execute.

The second: for the query using the order by statement, the limit statement must be used. Because order by in order to perform the sorting process will

          All the result data is distributed to the same reducer for processing, forcing the user to increase the limit statement can prevent the reducer from adding extra

          execution time.

The third type: query that restricts the Cartesian product. You must write an on statement for association

6. Set the number of mappers and reducers

set hive.exec.reducers.max=(the total number of reduce slots in the cluster*1.5)/(the average number of queries in execution)

The default number of reducers in hive is 3

It can also be configured in hive_home/conf/hive-site.xml

7. JVM reuse

set mapred.job.reuse.jvm.num.tasks=10 

It can also be configured in hadoop's mapred-site.xml

The default configuration of Hadoop is usually to use a derived JVM to perform map and reduce tasks. At this time, the JVM startup process may be

Causes considerable overhead, especially if the executed job contains hundreds or thousands of tasks. JVM reuse can make JVM instances

Reusing N times in the same job has a disadvantage. Turning on JVM reuse will always occupy the task's slot and will not release it until the task ends.

8.hive dynamic partition 

Reference address https://blog.csdn.net/oracle8090/article/details/72627135

9.set hive.map.aggr=true;

Equivalent to the combiner operation on the map side

10.Join optimization (when most of the data is null during operation, it will be skewed)
set hive.optimize.skewjoin=true;
If the data skew occurs during the join process, it should be set to true to become automatic optimization
set hive.skewjoin,key=10000;
When the number of records corresponding to the join construction exceeds this value, optimization will be performed.

11. group by optimization

hive.groupy.skewindata=true;
if the skewing occurs during the group by process, it should be set to true
set hive.groupby.mapaggr.checkinterval=100000;
when the number of records corresponding to the group key exceeds this value, it will be optimized



hive configuration information https://blog.csdn.net/chaoping315/article/details/8500407

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324730329&siteId=291194637