1. Optimization of the overall architecture The
hive computing engine not only supports MapReduce, but also supports Tez, Spark, etc. According to different computing engines, different resource scheduling and storage systems can be used.
Overall architecture optimization points:
1. Date partition according to different business requirements, and perform dynamic partitioning.
Relevant parameter settings:
hive.exec.dynamic.partition=true by default in 0.14
2. In order to reduce the disk storage space and the number of I/Os. Compress the data.
Related parameter settings: The
job output file is compressed by Gzip according to BLOCK
mapreduce.output.fileoutputformat.compress=true
mapreduce.output.fileoutputformat.compress.type=BLOCK
mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.GzipCodec
The map output structure is also compressed in Gizp.
mapreduce.map.output.compress=true
mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.GzipCodec
Compress hive output results and intermediate results.
hive.exec.compress.output=true
hive.exec.compress.intermediate=true
3. The hive intermediate table is saved in SequenceFile, which can save the time of serialization and deserialization
Related parameter settings:
hive.query.result.fileformat=SequenceFile
2. The optimized
hive operator in the Map stage:
The execution process is:
reduce cutting algorithm:
Related parameter settings, the default is:
hive.exec.reducers.max=999
hive.exec.reducers.bytes.per.reducer=1G
reduce task num=min{reducers.max,input.size/bytes.per.reducer}, The number of reducers can be adjusted according to actual needs.
3. Job optimization
1. Local execution The local execution mode is turned off by
default , and the local mode can be used for small data to speed up the execution.
Related parameters:
hive.exec.mode.local.auto=true
The default local execution conditions are hive.exec.mode.local.auto.inputbytes.max=128MB, hive.exec.mode.local.auto.tasks.max=4, and a maximum of one reduce task. Performance test:
data volume (ten thousand) operation normal execution time (seconds) local execution time (seconds)
170 group by 36 16
80 count 34 6
2.map join
The default map join is open. hive.auto.convert.join.noconditionaltask.size=10MB
The table loaded into the memory must be a table that has been scanned (excluding operations such as group by), if the two tables of the join are All meet the above conditions, / mapjoin /specified table does not work, only a small table will be loaded into memory, otherwise the scan table that meets the conditions will be selected.
4. SQL optimization
The overall optimization strategy is as follows:
remove the unnecessary column
Where conditions in the query and filter them in the TableScan stage.
Use the Partition information to read only the Partition
Map-side joins that meet the conditions, and use the large table as the driver and the small table. Load all mappers into memory
Adjust the Join order to ensure that the large table is used as the driving table.
For tables with uneven data distribution, in order to avoid data concentration on a few reducers, it is divided into two map-reduce stages. In the first stage, the Distinct column is used for shuffle, and then it is partially aggregated on the reduce side to reduce the data size. In the second map-reduce stage, it is aggregated by the group-by column.
Use hash to perform partial aggregation on the map side to reduce the scale of data processing on the reduce side.