Optimization of Hive

1. Optimization of the overall architecture The
hive computing engine not only supports MapReduce, but also supports Tez, Spark, etc. According to different computing engines, different resource scheduling and storage systems can be used.

Overall architecture optimization points:
1. Date partition according to different business requirements, and perform dynamic partitioning.
Relevant parameter settings:
hive.exec.dynamic.partition=true by default in 0.14

2. In order to reduce the disk storage space and the number of I/Os. Compress the data.
Related parameter settings: The
job output file is compressed by Gzip according to BLOCK

mapreduce.output.fileoutputformat.compress=true

mapreduce.output.fileoutputformat.compress.type=BLOCK

mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec

The map output structure is also compressed in Gizp.

mapreduce.map.output.compress=true

mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec

Compress hive output results and intermediate results.


hive.exec.compress.output=true

hive.exec.compress.intermediate=true

3. The hive intermediate table is saved in SequenceFile, which can save the time of serialization and deserialization

Related parameter settings:

hive.query.result.fileformat=SequenceFile

2. The optimized
hive operator in the Map stage:
write picture description here

The execution process is:
write picture description here

reduce cutting algorithm:

Related parameter settings, the default is:
hive.exec.reducers.max=999
hive.exec.reducers.bytes.per.reducer=1G
reduce task num=min{reducers.max,input.size/bytes.per.reducer}, The number of reducers can be adjusted according to actual needs.

3. Job optimization
1. Local execution The local execution mode is turned off by
default , and the local mode can be used for small data to speed up the execution.
Related parameters:
hive.exec.mode.local.auto=true

The default local execution conditions are hive.exec.mode.local.auto.inputbytes.max=128MB, hive.exec.mode.local.auto.tasks.max=4, and a maximum of one reduce task. Performance test:
data volume (ten thousand) operation normal execution time (seconds) local execution time (seconds)
170 group by 36 16
80 count 34 6

2.map join
The default map join is open. hive.auto.convert.join.noconditionaltask.size=10MB
The table loaded into the memory must be a table that has been scanned (excluding operations such as group by), if the two tables of the join are All meet the above conditions, / mapjoin /specified table does not work, only a small table will be loaded into memory, otherwise the scan table that meets the conditions will be selected.

4. SQL optimization
The overall optimization strategy is as follows:
remove the unnecessary column
Where conditions in the query and filter them in the TableScan stage.
Use the Partition information to read only the Partition
Map-side joins that meet the conditions, and use the large table as the driver and the small table. Load all mappers into memory
Adjust the Join order to ensure that the large table is used as the driving table.
For tables with uneven data distribution, in order to avoid data concentration on a few reducers, it is divided into two map-reduce stages. In the first stage, the Distinct column is used for shuffle, and then it is partially aggregated on the reduce side to reduce the data size. In the second map-reduce stage, it is aggregated by the group-by column.
Use hash to perform partial aggregation on the map side to reduce the scale of data processing on the reduce side.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325812427&siteId=291194637