Hive performance optimization summary (4)

Table join optimization 

1.   Put the big watch on the back

Hive assumes that the last table in the query is a large table. It will cache other tables and scan the last table.

So it is usually necessary to put the small table first, or mark which table is the large table: /*streamtable(table_name) */

2. Use the same connection key

When joining 3 or more tables , only one MapReduce job will be generated if each on clause uses the same join key .

3. Filter data as early as possible

Reduce the amount of data in each stage , add partitions to the partitioned table, and select only the fields that need to be used.

4.  Try to be as atomic as possible

Try to avoid a SQL containing complex logic, you can use intermediate tables to complete complex logic

 

 

Notice:

Performance analysis of the join between small tables and large tables in Hive :

https://blog.csdn.net/WYpersist/article/details/80001475

 replace union all with insert into

If the number of parts of union all is greater than 2 , or each union part has a large amount of data, it should be split into multiple insert into statements. In the actual test process, the execution time can be increased by 50%

 

Such as:

insert overwite table tablename partition (dt= ....)

select ..... from ( select ... from A 

union all

select ... from B union all select ... from C ) R

where ...;

can be rewritten as:

insert into table tablename partition (dt= ....) select .... from A WHERE ...; insert into table tablename partition (dt= ....) select .... from B WHERE ...; insert into table tablename partition (dt= ....) select .... from C WHERE ...;

order by & sort by 

order by : Sort the query results globally, which takes a long time. Requires set hive.mapred.mode=nostrict

sort by : Local sorting, not global sorting, to improve efficiency.

 

transform+python

A custom function embedded in the hive fetching process. Through the transform statement , the functions that are inconvenient to be implemented in hive can be implemented in python , and then written into the hive table.

grammar:

select transform({column names1})

using '**.py'

as {column names2}

from {table name}

If you have other dependencies besides python scripts, you can use ADD ARVHIVE

The limit statement quickly produces results

Under normal circumstances, the Limit statement still needs to execute the entire query statement, and then return some results.

There is a configuration property that can be turned on to avoid this --- sample the data source

hive.limit.optimize.enable=true --- Enable the function of sampling the data source

hive.limit.row.max.size --- set the minimum sampling size

hive.limit.optimize.limit.file --- set the maximum number of sampling samples

Disadvantage: it is possible that some data will never be processed

 localized execution

For small datasets, the time spent to execute the task for the query is > the time to actually execute the job , so all tasks can be processed on a single machine (or sometimes on a single process) through local mode.

set oldjobtracker=${hiveconf:mapred.job.tracker}; 

set mapred.job.tracker=local;

set marped.tmp.dir=/home/edward/tmp; sql 语句  set mapred.job.tracker=${oldjobtracker};

--You can let hve automatically start this optimization when appropriate by setting the value of the property hive.exec.mode.local.auto to true , or you can write this configuration in the $HOME/.hiverc file.

--When a job meets the following conditions, the local mode can be used:

1. The input data size of the job must be smaller than the parameter: hive.exec.mode.local.auto.inputbytes.max ( default 128MB)

2. The number of maps of the job must be less than the parameter: hive.exec.mode.local.auto.tasks.max ( default 4)

3. The reduce number of the job must be 0 or 1

The maximum amount of memory used by the child jvm can be controlled by the parameter hive.mapred.local.mem ( default 0) .

parallel execution

Hive will convert a query into one or more stages, including: MapReduce stage, sampling stage, merge stage, limit stage, etc. By default, only one stage is executed at a time. However, some stages can be executed in parallel if they are not interdependent.

set hive.exec.parallel=true, you can enable concurrent execution.

set hive.exec.parallel.thread.number=16; // The same sql allows the maximum parallelism, the default is 8 .

It will consume more system resources.

Adjust the number of mappers and reducers

1 Map stage optimization

The main determinants of the number of maps are: the total number of input files, the input file size, and the file block size set by the cluster (default 128M , not customizable).

Example:

a) Suppose there is a file a in the input directory with a size of 780M, then hadoop will divide the file a into 7 blocks ( 6 128m blocks and 1 12m block ), resulting in 7 map numbers

b) Suppose there are 3 files a, b, c in the input directory , the sizes are 10m , 20m , 130m , then hadoop will be divided into 4 blocks ( 10m, 20m, 128m, 2m ) , resulting in 4 map numbers

That is, if the file is larger than the block size (128m), it will be split, and if it is smaller than the block size, the file will be treated as a block.

Map execution time: Time for map task startup and initialization + time for logic processing.

1 ) Reduce the number of maps

If there are a large number of small files (less than 128M ), multiple maps will be generated. The processing method is:

set mapred.max.split.size=100000000; set mapred.min.split.size.per.node=100000000; set mapred.min.split.size.per.rack=100000000;  

--The first three parameters determine the size of the merged file block. Those larger than the file block size of 128m are separated by 128m, those smaller than 128m and larger than 100m are separated by 100m , and those smaller than 100m (including small files and separated large files) are separated. remaining) are merged

set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; --Merge small files before execution 2 ) Increase the number of maps

When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map , thereby improving the execution efficiency of the task.

set mapred.reduce.tasks=?

2 Reduce stage optimization

Adjustment method:

-- set mapred.reduce.tasks=?

-- set hive.exec.reducers.bytes.per.reducer = ?

Generally, according to the total size of the input file , its estimation function is used to automatically calculate the number of reducers: Number of reducers = InputFileSize / bytes per reducer

 

strict mode

set hive.marped.mode=strict ------ prevent users from executing queries that may have unexpected adverse effects

--Partition table, the partition range must be selected

--For queries using order by , the limit statement must be used . Because order by will distribute all the result data to the same reducer for processing in order to perform the sorting process.

--Restricted Cartesian product query: when two tables are joined , there must be an on statement

data skew

Performance: The task progress is maintained at 99% (or 100% ) for a long time. Check the task monitoring page and find that only a few ( 1 or several) reduce subtasks are not completed. Because the amount of data processed by it is too different from other reducers .

The difference between the number of records of a single reduce and the average number of records is too large, usually 3 times or more. The longest duration is much longer than the average duration.

reason

1) , the key distribution is uneven

2) The characteristics of the business data itself

3) , ill-considered when building the table

4) Some SQL statements themselves have data skew

Key words

situation

as a result of

join

One of the tables is smaller, but the keys are concentrated

The data distributed to one or several Reduces is much higher than the average

join

Large tables and large tables, but there are too many 0 or null values ​​in the bucketed judgment field

These null values ​​are processed by a reduce , which is very slow

group by

The group by dimension is too small, and the number of a certain value is too large

It takes a long time to process the reduce of a value

count distinct

Too many special values

Processing this special value reduce time-consuming

 

solution:

Parameter adjustment

hive.map.aggr=true

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324660500&siteId=291194637