Big data interview questions: Solutions to data skew

        In daily work, data skew mainly occurs in the Reduce stage and rarely occurs in the Map stage. The reason is that data skew on the Map side is generally caused by uneven HDFS data storage (generally, storage is evenly divided into blocks, and the size of each file is Basically fixed), and the data skew in the Reduce stage is almost always caused by the large amount of key value data.

Solution:

1:

set hive.groupby.skewindata=true;

        If a task is stuck at 99% for a long time, it can be basically considered that data skew has occurred. It is recommended to adjust parameters to achieve load balancing. Principle: The generated query plan will have two MRJobs. In the first MRJob, the output result set of the Map will be randomly distributed to the Reduce. Each Reduce will perform partial aggregation operations and output the results. The result of this processing is that the same GroupBy Key may be distributed to different Reduces, thus To achieve the purpose of load balancing; the second MRJob is distributed to the Reduce according to the GroupBy Key based on the preprocessed data results (this process can ensure that the same GroupBy Key is distributed to the same Reduce), and finally completes the final aggregation operation

2: To associate a small table with a large table, use Mapjoin:

set hive.auto.convert.join=true;--自动开启MAPJOIN优化,默认值为true
set hive.mapjoin.smalltable.filesize=2500000;--通过配置该属性来确定使用该优化的表的大小,如果表的大小小于此值就会被加载进内存中,默认值为2500000(25M)

3: In the Join operation, please note that the associated fields cannot have a large number of duplicate values ​​or null values.

4: Count(distinct id) Deduplication statistics should be used with caution and try to replace them with other methods.

Guess you like

Origin blog.csdn.net/weixin_42258633/article/details/129019049