Detail Hive enterprise-level development optimization

1. Background of the problem

Hive offline data warehouse development is a good data task. Its running time is generally within a reasonable range. When it is found that the indicator data of the report application layer is always output delayed, it is found that some tasks have been executed for more than 10 hours. It is definitely unreasonable. At this time, we should think about how to optimize the data task link, mainly from the following perspectives to solve the problem :

  1. Starting from the hive logic code of the data task itself, that is, hive logic optimization, and partial understanding of the business perspective

  2. Starting from the resource settings of the cluster, that is, hive parameter tuning, I prefer to understand the technical point of view

  3. Starting from the task settings of the global data link, observe whether the task execution scheduling settings are unreasonable

  4. From the perspective of data ease of use and model reusability in data warehouses, intermediate model tables that can be reused for some intermediate logic processes are implemented.

Attached is a screenshot of part of the mind map of the personal combing and summary

Let's share some common hive optimization strategies first~ 会附带案例实践帮助理解

hive optimize article outline

  1. Column pruning and partition pruning

  2. Early data convergence

  3. Predicate Push Down (PPD)

  4. Multiple outputs, reducing the number of table reads and writing multiple result tables

  5. Reasonable selection sort

  6. join optimization

  7. Reasonable choice of file storage format and compression method

  8. Solve the problem of too many small files

  9. distinct and group by

  10. Parameter tuning

  11. Solve the problem of data skew

Second, hive optimization

1. Column pruning and partition pruning

裁剪 顾名思义就是不需要的数据不要多查。
Column clipping, try to reduce select * from tablethe direct operation, first of all, the readability is not good, and you don't know which columns to use, and secondly, the selection of more columns will increase the IO transmission;
partition clipping is to remember to add to the partition table Partition filter conditions, such as a table with time as the partition field, need to add partition filter.

2. Advance data convergence

In the subquery, some conditions can be filtered first, as far as possible in the subquery to filter first, to reduce the amount of data output by the subquery.

-- 原脚本
select
     a.字段a,a.字段b,b.字段a,b.字段b
from 
(
    select 字段a,字段b
    from table_a
    where dt = date_sub(current_date,1)
) a 
left join 
(
    select 字段a,字段b
    from table_b
    where dt = date_sub(current_date,1)
) b 
    on a.字段a = b.字段a
where a.字段b <> ''
and b.字段b <> 'xxx'
;

-- 优化脚本 (数据收敛)
select
     a.字段a,a.字段b,b.字段a,b.字段b
from 
(
    select 字段a,字段b
    from table_a
    where dt = date_sub(current_date,1)
    and 字段b <> ''
) a 
left join 
(
    select 字段a,字段b
    from table_b
    where dt = date_sub(current_date,1)
    and 字段b <> 'xxx'
) b 
    on a.字段a = b.字段a
;

3. Predicate Pushdown

What is Predicate Pushdown? PPD for short, refers to moving the filter expression as close to the data source as possible without affecting the data results, so that irrelevant data can be skipped directly during actual execution , so that the filter conditions can be executed in the map. Reducing map-side data output plays a role in data convergence, reduces the amount of data transmitted on the cluster, saves cluster resources, and improves task performance .
By default, hive enables predicate pushdown, which is set by this parameter. The hive.optimize.ppd=true
so-called pushdown means that predicate filtering is performed on the map side; the so-called non-pushdown means that predicate filtering is performed on the reduce side.
Regarding the rules of predicate push-down, it is mainly divided into on-condition filter push-down of join and where-condition filter push-down. I have compiled a picture for easy understanding.

Core judgment logic: The on conditional filtering of join cannot be pushed down to the reserved row table; the where conditional filtering cannot be pushed down to the null supplementary table.

-- 举例说明:以下脚本 on后面的a表条件过滤没有下推至map端运行而是在reduce端运行,where后面的b表条件过滤则有下推至map端运行
select
     a.字段a,a.字段b,b.字段a,b.字段b
from table_a a
left join table_b b
on a.字段a <> '' -- a表条件过滤
where a.字段b <> 'xxx' -- a表条件过滤
;

Predicate pushdown note:
If an expression contains an indeterminate function, the predicate for the entire expression will not be pushed down. For example, in the following script, the entire conditional filtering is executed on the reduce side:

select a.*
from a join b 
on a.id = b.id
where a.ds = '2019-10-09' 
and a.create_time = unix_timestamp()
;

Because the above unix_timestamp()is an indeterminate function, it cannot be known at compile time, so the entire expression will not be pushed down, that is, ds='2022-07-04' will not be filtered in advance. Similar uncertain functions are rand()functions and so on.

Attached are 2 detailed case analysis explanations on predicate pushdown
! Go to the link, copy and visit by yourself :

① https://cloud.tencent.com/developer/article/1616687

② https://cloud.tencent.com/developer/article/1616689

4. Multiple outputs

When we have 使用一次查询,多次插入some scenarios, we can use the multi-output writing method to reduce the number of table reads and optimize performance.

-- 读取一次源表,同时写入多张目标表
from table_source
insert overwrite table table_a
select *
where dt = date_sub(current_date,1)
and event_name = '事件A'
insert overwrite table table_b
select *
where dt = date_sub(current_date,1)
and event_name = '事件B'
insert oveewrite table table_c
select *
where dt = date_sub(current_date,1)
and event_name = '事件C'
;

Notes on multiple outputs:

  • Under normal circumstances, a sql supports up to 128 channels of output, and an error will be reported if more than 128 channels are supported.

  • When inserting multiple parts into different partitions of the same partition table, it is not allowed to include both insert overwrite and insert into in multiple output in one sql, and the operation must be unified.

5. Reasonable selection sorting

  • order by
    全局排序,只走一个reducer , when the amount of data in the table is large, it is easy to not calculate, and it is used with caution when the performance is poor. In strict mode, limit needs to be added

  • Sort by
    local sorting, that is, to ensure that the results in a single reduce are in order, but there is no ability to sort globally.

  • distribute by
    按照指定的字段把数据划分输出到不同的reducer中,是控制数据如何从map端输出到reduce端 , hive will perform hash distribution according to the fields after distribute by and the number of corresponding reducers

  • cluster by
    has the ability of distrubute by, and also has the ability of sort by,所以可以理解cluster by是 distrubute by+sort by

The following is an example of sorting optimization, taking the information of the top 100 users by age in the user information table (1 billion data volume) :以下案例实现也体现了一个大数据思想,分而治之,大job拆分小job。

 
 
-- 原脚本
select *
from tmp.user_info_table
where dt = '2022-07-04'
order by age -- 全局排序,只走一个reduce
limit 100
;

-- 优化脚本
set mapred.reduce.tasks=50; -- 设置reduce个数为50
select *
from tmp.user_info_table
where dt = '2022-07-04'
distribute by (case when age<20 then 0
        when age >=20 and age <= 40 then 1
        else 2
    end
) -- distribute by主要是为了控制map端输出的数据在reduce端中是如何划分的,防止map端数据随机分配到reduce。这里字段做case when判断是因为用户年龄的零散值会导致分布不均匀,起太多reduce本身也耗时浪费资源
sort by age -- 起多个reduce排序,保证单个reduce结果有序
limit 100 -- 取前100,因为是按照年龄局部排序过,所以前100个也一定是年龄最小的
;

Summary of sorting choices:

  • order by global sorting, but only one reducer is executed, it is easy to calculate if the amount of data is large, use it with caution

  • sort by local sorting, a single reducer is ordered, and the map side is randomly distributed to the reducer side for execution. If you want to achieve global sorting and optimize the requirements of multiple reducers, you can nest one layer in the outer layer 例如:select * from (select * from 表名 sort by 字段名 limit N) order by 字段名 limit N, so that there are 2 Jobs, one is the local sorting of the inner layer, and the other is the merge global sorting of the outer layer

  • Distribute by can hash and distribute the data to the corresponding reducer according to the specified field for execution

  • 当分区字段和排序字段相同时可以使用cluster by来简化distribute by+sort by的写法, but the cluster by sort can only be sorted in ascending order, and the sorting rule cannot be specified as ASC or DESC

6. Join optimization

The join completed by hive in the reduce phase is the common join, and the join completed in the map phase is the map join.

  • Converging the amount of data in advance to ensure that useless data does not participate in the association before the join association
    This can be combined with the previous data convergence module & predicate push-down module, mainly to converge the amount of data in advance, not only in the join scenario, but before other complex calculations The same applies.

  • The use scenario of the left semi join
    left semi join at the beginning is actually an efficient implementation to solve the problem that hive does not support the in/exists subquery. Although the left semi join contains left, it does not retain all the data in the left table. The effect is similar to join, but The final result only takes the columns in the left table, and the final result will be different from the join result in some scenarios.

select a.*
from 
(
select 1 as id,'a' as name 
union all 
select 2 as id,'b' as name 
) a 
left semi join 
( 
select 1 as id,'b' as name 
union all 
select 1 as id,'c' as name 
) b 
    on a.id = b.id
    
-- 你猜left semi join结果是?
id  name
1   a
-- 而如果上面的脚本是join呢,结果?
id  name
1   a
1   a

Notes for left semi join:

  • The conditional filter of the right table can only be written after on, not after where

  • The final result can only display the columns of the left table, and the columns of the right table cannot be displayed

  • The difference between left semi join and join is mainly that when there is duplicate data in the right table, left semi join skips after traversing one piece of data in the right table, and only takes one piece, while join traverses all the way to the last piece of data in the right table, which is also It is to pay attention to whether the actual data scene is repeated and whether to keep it

  • Large table join small table scenario
    If a large table joins a small table, 要把小表放在左边,大表放在右边, this is because the join operation occurs in the reduce phase. Before hive2.x version, the table on the left will be loaded into the memory, so if it is a large table, put it on the left If it is loaded into memory, there will be a risk of memory overflow , but after the hive2.x version, this part has been optimized, no need to pay attention, the bottom layer will help us optimize this problem.

  • When mapjoin is enabled
    mapjoin就是把join的表直接分发到map端的内存中,即在map端来执行join操作 , there is no need to join in the reduce phase, which improves the execution efficiency. If the table is relatively small, it is best to enable mapjoin, and hive enables automatic mapjoin by default.

set hive.auto.convert.join = true;
-- 大表小表的阈值设置(默认25M一下认为是小表)
set hive.mapjoin.smalltable.filesize=26214400;
  • Large table join large table scenario For
    example, suppose that table a contains data with many null values, and table b contains data that does not contain null values.

-- 不做优化时的原始hql
select  a.id 
from a left join b
on a.id = b.id

1、空key过滤,过滤空key的数据
The association process is that the data corresponding to the same key will be sent to the same reducer. If there are too many empty keys, the memory will be insufficient, which will cause the join timeout. Therefore, if you do not need such empty key data, you can first Filter out these abnormal data.

-- 做空key过滤优化时的hql,利用子查询先处理掉后再关联
select a.id 
from (select * from a where id is not null) a
join b
on a.id = b.id

2、空key转换,转换key的数据进行关联时打散key
Of course, sometimes the data with null values ​​is not necessarily abnormal data, and still needs to be retained, but too many empty keys are allocated to a reducer. In this way, even if there is no memory overflow, data skew will occur. If the data is skewed From the perspective of the utilization of cluster resources, it is extremely unfavorable. We can virtualize the empty key into a random number, but we must ensure that it is not the same empty key, thereby reducing the probability of data skew. Although this is processing the associated key, it will The overall increase in execution time, but reduce the burden on the reducer.

-- 做空key转换优化时的hql,利用case when判断加随机数
select a.id 
from a.left join b
on case when a.id is null then concat('hive'+rand()) else a.id end = b.id
  • Avoid Cartesian products
    Try to avoid Cartesian products, that is, avoid adding on conditions or invalid on conditions when joining, because Hive can only use one reducer to complete Cartesian products, but hive will remind it through strict mode. , an error occurs when Cartesian integration occurs in strict mode.

7. Reasonable selection of file storage format and compression method

Regarding this, I wrote an article to introduce several common storage formats and compression methods of hive. For details, you can go to this article I wrote last time
! Link : https://mp.weixin.qq .com/s/RndQKF5y9Mto7QfgiiAOvQ

8. Solve the problem of too many small files

  • Let’s first talk about what a small file is and how it happens
    顾名思义,小文件就是文件很小的文件,小文件的产生一定是发生在向hive表导入数据的时候 , for example:

-- 第①种导入数据方式
insert into table A values();  -- 每执行一条语句hive表就产生一个文件,但这种导入数据方式生产环境少见;
-- 第②种导入数据方式
load data local path '本地文件/本地文件夹 路径' overwrite into table A;  -- 导入文件/文件夹`,即有多少个文件hive表就会产生多少个文件
-- 第③种导入数据方式
insert overwrite table A select * from B;  -- 通过查询的方式导入数据是生产环境最常见的

In MR  reduce 有多少个就输出多少个文件,文件数量 = reduce数量 * 分区数, if some simple jobs do not have a reduce phase and only have a map phase, then that 文件数量 = map数量 * 分区数. From the formula, the number of reduce and the number of partitions ultimately determines the number of output files, so you can adjust the number of reduce and the number of partitions to control the number of files in the hive table.

  • What is the impact
    首先第一点从HDFS底层来看 of too many small files? Too many small files will burden the cluster namenode, that is, the namenode metadata occupies a large amount of memory and affects the performance
    第二点从hive来看of HDFS. When querying, each small file will be regarded as a block, and a Map will be started. The task is completed, and the time for starting and initializing a Map task is much longer than the time for logical processing, which will cause a lot of waste of resources.

  • How to fix too many small files

1、使用hive自带的 concatenate 命令,来合并小文件
However, it should be noted that the concatenate command only supports the hive table storage format is orcfile or rcfile, and this method does not support specifying the number of merged files

-- 对于非分区表
alter table test_table concatenate;
-- 对于分区表
alter table test_table partition(dt = '2022-07-16') concatenate;

2、调整参数减少Map数

  • Set map input to merge small files

-- 102400000B=102400KB=100M

-- 每个Map最大输入大小(这个值决定了合并后文件的数量)
set mapred.max.split.size=102400000;
-- 一个节点上split的至少的大小(这个值决定了多个DataNode上的文件是否需要合并)
set mapred.min.split.size.per.node=102400000;
-- 一个交换机下split的至少的大小(这个值决定了多个交换机上的文件是否需要合并)
set mapred.min.split.size.per.rack=102400000;

-- 前3行设置是确定合并文件块的大小,>128M的文件按128M切块,>100M和<128M的文件按100M切块,剩下的<100M的小文件直接合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;    -- map执行前合并小文件
  • Set map output and reduce output to merge small files

-- 设置map端输出进行合并,默认为true
set hive.merge.mapfiles = true;
-- 设置reduce端输出进行合并,默认为false
set hive.merge.mapredfiles = true;
-- 设置合并文件的大小
set hive.merge.size.per.task = 256*1000*1000;
-- 当输出文件的平均大小小于该值时,启动一个独立的MapReduce任务进行文件merge
set hive.merge.smallfiles.avgsize=16000000; 

3、调整参数减少Reduce数

-- hive中的分区函数 distribute by 正好是控制MR中partition分区的,然后通过设置reduce的数量,结合分区函数让数据均衡的进入每个reduce即可。

-- 直接设置reduce个数
set mapreduce.job.reduces=10;

-- 执行以下语句,将数据均衡的分配到reduce中
set mapreduce.job.reduces=10;
insert overwrite table A partition(dt)
select * from B
distribute by rand();
解释:如设置reduce数量为10,则使用 rand(), 随机生成一个数x % 10,这样数据就会随机进入 reduce 中,防止出现有的文件过大或过小

9. count(distinct) and group by

When calculating the deduplication index, such as the number of users in different age groups, it is generally count(distinct user_id)calculated directly. When the amount of data in the table is not large, the impact will not be significant, but if the amount of data is large, count distinct will consume a lot of performance, because It will only be executed by one reduce task, which is easy to skew the data on the reduce side. Usually, optimization is used 里层group by age然后再外层count(user_id)instead.

Matters needing attention:
As to whether the use 里层group by age然后再外层count(user_id)of count(distinct user_id)direct deduplication calculation will definitely achieve the optimization effect, it also depends on the situation. Assuming that the amount of table data is not particularly large, it 里层group by age然后再外层count(user_id)may not be count(distinct user_id)better in some cases. Therefore, it is better to analyze specific business scenarios in detail. Optimization has never been considered local, but global.

  • Pairs of optimizations have been added to the hive3.x version. count(distinct )Through set hive.optimize.countdistinctconfiguration, even if the data is skewed, it can be automatically optimized and the logic of SQL execution can be changed automatically.

  • 里层group by age然后再外层count(user_id)This method will generate 2 job tasks, which will consume more disk network I/O resources

10. Parameter tuning

  • For querying partitioned tables, where must add partition restrictions

  • When using order by global sorting, limit must be added to limit the number of data queries

  • Restricted Cartesian Product Queries

  • set hive.optimize.countdistinct=trueEnable automatic optimization of count(distinct)

  • set hive.auto.convert.join = true;Enable automatic mapjoin
    set hive.mapjoin.smalltable.filesize=26214400;threshold settings for large tables and small tables (the default is 25M, which is considered to be a small table)

  • set hive.exec.parallel=true;Open tasks to execute
    set hive.exec.parallel.thread.number=16;the same sql in parallel to allow the maximum degree of parallelism, the default value is 8. By default, Hive will only execute one stage at a time. When parallel execution is enabled, the stages that are not interdependent in a SQL statement will be run in parallel, which may shorten the execution time of the entire job. Improve the utilization of cluster resources, but of course, this has an advantage when the system resources are relatively idle, otherwise there are no resources, and parallelism will not be possible.

  • set hive.map.aggr=true;The default value is true. When the option is set to true, the
    set hive.groupby.skewindata = ture;default value of enabling partial aggregation on the map side is false. When there is data skew, load balancing is performed. The generated query plan has two MapReduce tasks . In the first MR Job, The output results of the Map will be randomly distributed to the Reduces, and each Reduce will perform partial aggregation operations and output the results. The result of this processing is that the same Group By Key may be distributed to different Reduces, so as to achieve the purpose of load balancing; The second MR Job is then distributed to the Reduce according to the Group By Key according to the preprocessed data results (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally the final aggregation operation is completed.

  • set hive.mapred.mode=strict;Set strict mode, the default value is nonstrict non-strict mode. In strict mode, the following three types of unreasonable queries will be prohibited, that is, the following three cases will report an error

  • set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;Set the map side to merge small files before execution

  • set hive.exec.compress.output=true;Set whether to compress the query result output of hive
    set mapreduce.output.fileoutputformat.compress=true;Set whether to use compression for the result output of MapReduce Job

  • set hive.cbo.enable=false;Turn off CBO optimization, and the default value is true. It can automatically optimize the order of multiple JOINs in HQL and select the appropriate JOIN algorithm.

11. Solve the problem of data skew

  • What is data skew
    数据倾斜是大量的相同key被partition分配到同一个reduce里,造成了'一个人累死,其他人闲死'的情况, which violates the original intention of parallel computing, and when other nodes have calculated, they have to wait for the calculation of this busy node, and the efficiency will be lowered.

  • Obvious manifestation of data skew
    任务进度长时间维持在99%, check the task monitoring page and find that only a small number (one or several) of reduce subtasks are not completed. Because the amount of data it processes is too different from other reducers

  • What is the root cause of data skew?
    Uneven distribution of keys, uneven processing of reduce data

  • How to avoid data skew as much as possible
    How to evenly distribute data to each reduce is the fundamental reason to avoid data skew. For example, the following two typical cases, about the data skew and solution of the join operation: 就在文章上面的第六点join优化【大表join大表场景】, and a solution to reasonably set the number of maps and reduce.

  • Reasonably set the number of maps and reduce

1、Map端优化
Normally, Job will generate one or more map tasks through the input directory. map数主要取决与input的文件总个数,文件总大小,集群设置的文件块大小。
Starting from hadoop 2.7.3, the default block size of HDFS is 128M. Each hive table corresponds to a file on hdfs. When executing a task, each 128M file is a block, and each block is completed by a map task. If the file exceeds 128M, it will be divided into blocks. If it is less than 128M, it will be independent into a block.
那么:①当小文件过多怎么办?
The answer is that the number of map tasks increases, and the startup and initialization time of map tasks is much longer than the execution logic processing time, thus causing a waste of resources in the cluster.
②是不是让每个文件都接近128M大小就毫无问题了呢?
The answer is impossible. Suppose a file size is 127M, but the table has only one or two fields, and the file size is extended by tens of millions of records. If the data processing logic is complex, it is also time-consuming to execute a map task.
③是不是map数越多越好?
The answer is that this statement is one-sided. The increase in the number of maps is conducive to improving the degree of parallelism. However, the startup and initialization time of a map is much longer than the execution logic processing time. The more maps are started and initialized, the greater the waste of cluster resources.

减少map数量,降低资源浪费,如何做?
The following is equivalent to merging small files into large files (all-in-one)

-- 102400000B=102400KB=100M

-- 每个Map最大输入大小(这个值决定了合并后文件的数量)
set mapred.max.split.size=102400000;
-- 一个节点上split的至少的大小(这个值决定了多个DataNode上的文件是否需要合并)
set mapred.min.split.size.per.node=102400000;
-- 一个交换机下split的至少的大小(这个值决定了多个交换机上的文件是否需要合并)
set mapred.min.split.size.per.rack=102400000;

-- 前3行设置是确定合并文件块的大小,>128M的文件按128M切块,>100M和<128M的文件按100M切块,剩下的<100M的小文件直接合并
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;    -- map执行前合并小文件

Sometimes when hive is optimized, there may not be a big improvement in execution time, but there is a big improvement in computing resources.

增大map数量,分担每个map处理的数据量提升任务效率,如何做?
The following is equivalent to merging small files into large files (one split and more)

According to the formula of mapreduce slicing: computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize))), it can be seen from the formula that adjusting the maximum value of maxSize and making the maximum value of maxSize lower than blocksize can increase the number of maps.

mapreduce.input.fileinputformat.split.minsize(切片最小值), the default value = 1, if the parameter is adjusted to be larger than the blockSize, the slice can be made larger than the blocksize, thereby reducing the number of maps
mapreduce.input.fileinputformat.split.maxsize(切片最大值), the default value = the block size of the blocksize, if the parameter is adjusted to be smaller than the blocksize, the slice will be made smaller , thereby increasing the number of maps

2、Reduce端优化
If the number of reduce is set too large, many small files will also be generated, which will affect the namenode, and the output small files will occasionally be used as the input of the next task, resulting in the problem of too many small files. If the setting is too small, it will result in the data processed by a single reduce. If the amount is too large, an OOM exception occurs.
If not specified, hive will perform the calculation by default according to the calculation formula hive.exec.reducers.bytes.per.reducer(the amount of data processed by each reduce task, the default is 1G) and hive.exec.reducers.max(the maximum number of reducers per task, the default is 1009), min(hive.exec.reducers.max值,总输入数据量/hive.exec.reducers.bytes.per.reducer值)and the result determines the number of reducers, so The number of reduce can be adjusted by adjusting parameter 1 and parameter 2, but the easiest way is to directly control the number of reduce through the following parameters.

-- 手动指定reduce个数
set mapred.reduce.tasks=50;
-- 设置每一个job中reduce个数
set mapreduce.job.reduces=50;

那么:①reduce数是不是越多越好?
The answer is wrong. Like the number of maps, it takes time and resources to start reduce and initialize, and too many reduce will generate multiple files, which will also cause the problem of small files.
②什么情况下当设置了参数指定reduce个数后还是只有单个reduce在跑?

  • The amount of input data itself is less than 1G

  • No group by group summary was added when verifying the amount of measured data. for exampleselect count(1) from test_table where dt = 20201228;

  • Sort by order by

  • Descartes product

A summary of reasonably setting the number of maps and reducers:

  • set hive.input.format = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; //The system default format, set to merge small files before map execution, reduce the number of maps

  • set mapreduce.input.fileinputformat.split.maxsize = 100; //Adjust the maximum slice value so that the maxSize value is lower than the blocksize to increase the number of maps

  • According to the formula of mapreduce slice: computeSliteSize(Math.max(minSize,Math.min(maxSize,blocksize))), it can be seen from the formula that the maximum value of maxSize is adjusted so that the maximum value of maxSize is lower than the blocksize, so that the slice becomes smaller, and the number of maps can be increased.

3. Summary

  1. In daily hive development, always develop the habit of data convergence in advance to avoid useless data participating in the calculation

  2. Don't over-optimize, it may be useless or even have negative effects, the cost of work in tuning is not proportional to the return

  3. For public reusable logic codes, temporary tables or intermediate tables can be extracted to improve reusability.强调复用!

  4. Understand the principle of the underlying execution of hiveQL, and optimize it to follow the rules

  5. Understanding the requirements is the premise of code optimization, pay attention to the global data link, and understand some common hive optimization strategies

  6. When doing hive optimization, you should be careful when it comes to parameter tuning, such as preempting all memory applications, to avoid that your own task tuning will affect the resource allocation of other tasks in the entire cluster.全局优才是优!

Guess you like

Origin blog.csdn.net/ytp552200ytp/article/details/126364196