Table join optimization
1. Put the big watch on the back
Hive assumes that the last table in the query is a large table. It will cache other tables and scan the last table.
So it is usually necessary to put the small table first, or mark which table is the large table: /*streamtable(table_name) */
2. Use the same connection key
When joining 3 or more tables , only one MapReduce job will be generated if each on clause uses the same join key .
3. Filter data as early as possible
Reduce the amount of data in each stage , add partitions to the partitioned table, and select only the fields that need to be used.
4. Try to be as atomic as possible
Try to avoid a SQL containing complex logic, you can use intermediate tables to complete complex logic
Notice:
Performance analysis of the join between small tables and large tables in Hive :
https://blog.csdn.net/WYpersist/article/details/80001475
replace union all with insert into
If the number of parts of union all is greater than 2 , or each union part has a large amount of data, it should be split into multiple insert into statements. In the actual test process, the execution time can be increased by 50%
Such as:
insert overwite table tablename partition (dt= ....)
select ..... from ( select ... from A
union all
select ... from B union all select ... from C ) R
where ...;
can be rewritten as:
insert into table tablename partition (dt= ....) select .... from A WHERE ...; insert into table tablename partition (dt= ....) select .... from B WHERE ...; insert into table tablename partition (dt= ....) select .... from C WHERE ...;
order by & sort by
order by : Sort the query results globally, which takes a long time. Requires set hive.mapred.mode=nostrict
sort by : Local sorting, not global sorting, to improve efficiency.
transform+python
A custom function embedded in the hive fetching process. Through the transform statement , the functions that are inconvenient to be implemented in hive can be implemented in python , and then written into the hive table.
grammar:
select transform({column names1})
using '**.py'
as {column names2}
from {table name}
If you have other dependencies besides python scripts, you can use ADD ARVHIVE
The limit statement quickly produces results
Under normal circumstances, the Limit statement still needs to execute the entire query statement, and then return some results.
There is a configuration property that can be turned on to avoid this --- sample the data source
hive.limit.optimize.enable=true --- Enable the function of sampling the data source
hive.limit.row.max.size --- set the minimum sampling size
hive.limit.optimize.limit.file --- set the maximum number of sampling samples
Disadvantage: it is possible that some data will never be processed
localized execution
For small datasets, the time spent to execute the task for the query is > the time to actually execute the job , so all tasks can be processed on a single machine (or sometimes on a single process) through local mode.
set oldjobtracker=${hiveconf:mapred.job.tracker};
set mapred.job.tracker=local;
set marped.tmp.dir=/home/edward/tmp; sql 语句 set mapred.job.tracker=${oldjobtracker};
--You can let hve automatically start this optimization when appropriate by setting the value of the property hive.exec.mode.local.auto to true , or you can write this configuration in the $HOME/.hiverc file.
--When a job meets the following conditions, the local mode can be used:
1. The input data size of the job must be smaller than the parameter: hive.exec.mode.local.auto.inputbytes.max ( default 128MB)
2. The number of maps of the job must be less than the parameter: hive.exec.mode.local.auto.tasks.max ( default 4)
3. The reduce number of the job must be 0 or 1
The maximum amount of memory used by the child jvm can be controlled by the parameter hive.mapred.local.mem ( default 0) .
parallel execution
Hive will convert a query into one or more stages, including: MapReduce stage, sampling stage, merge stage, limit stage, etc. By default, only one stage is executed at a time. However, some stages can be executed in parallel if they are not interdependent.
set hive.exec.parallel=true, you can enable concurrent execution.
set hive.exec.parallel.thread.number=16; // The same sql allows the maximum parallelism, the default is 8 .
It will consume more system resources.
Adjust the number of mappers and reducers
1 Map stage optimization
The main determinants of the number of maps are: the total number of input files, the input file size, and the file block size set by the cluster (default 128M , not customizable).
Example:
a) Suppose there is a file a in the input directory with a size of 780M, then hadoop will divide the file a into 7 blocks ( 6 128m blocks and 1 12m block ), resulting in 7 map numbers
b) Suppose there are 3 files a, b, c in the input directory , the sizes are 10m , 20m , 130m , then hadoop will be divided into 4 blocks ( 10m, 20m, 128m, 2m ) , resulting in 4 map numbers
That is, if the file is larger than the block size (128m), it will be split, and if it is smaller than the block size, the file will be treated as a block.
Map execution time: Time for map task startup and initialization + time for logic processing.
1 ) Reduce the number of maps
If there are a large number of small files (less than 128M ), multiple maps will be generated. The processing method is:
set mapred.max.split.size=100000000; set mapred.min.split.size.per.node=100000000; set mapred.min.split.size.per.rack=100000000;
--The first three parameters determine the size of the merged file block. Those larger than the file block size of 128m are separated by 128m, those smaller than 128m and larger than 100m are separated by 100m , and those smaller than 100m (including small files and separated large files) are separated. remaining) are merged
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; --Merge small files before execution 2 ) Increase the number of maps
When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps to reduce the amount of data processed by each map , thereby improving the execution efficiency of the task.
set mapred.reduce.tasks=?
2 Reduce stage optimization
Adjustment method:
-- set mapred.reduce.tasks=?
-- set hive.exec.reducers.bytes.per.reducer = ?
Generally, according to the total size of the input file , its estimation function is used to automatically calculate the number of reducers: Number of reducers = InputFileSize / bytes per reducer
strict mode
set hive.marped.mode=strict ------ prevent users from executing queries that may have unexpected adverse effects
--Partition table, the partition range must be selected
--For queries using order by , the limit statement must be used . Because order by will distribute all the result data to the same reducer for processing in order to perform the sorting process.
--Restricted Cartesian product query: when two tables are joined , there must be an on statement
data skew
Performance: The task progress is maintained at 99% (or 100% ) for a long time. Check the task monitoring page and find that only a few ( 1 or several) reduce subtasks are not completed. Because the amount of data processed by it is too different from other reducers .
The difference between the number of records of a single reduce and the average number of records is too large, usually 3 times or more. The longest duration is much longer than the average duration.
reason
1) , the key distribution is uneven
2) The characteristics of the business data itself
3) , ill-considered when building the table
4) Some SQL statements themselves have data skew
Key words |
situation |
as a result of |
join |
One of the tables is smaller, but the keys are concentrated |
The data distributed to one or several Reduces is much higher than the average |
join |
Large tables and large tables, but there are too many 0 or null values in the bucketed judgment field |
These null values are processed by a reduce , which is very slow |
group by |
The group by dimension is too small, and the number of a certain value is too large |
It takes a long time to process the reduce of a value |
count distinct |
Too many special values |
Processing this special value reduce time-consuming |
solution:
Parameter adjustment
hive.map.aggr=true