Hive---The generation and solution of data skew

1. Definition of Data Skew

Data skew means that when data processing is performed in parallel, because the data of a single partition is significantly redundant with other parts, the distribution is uneven, resulting in a large amount of data being distributed to one or several computing nodes, making the processing speed of this part much lower. Because of the average computing speed, it becomes the bottleneck of the entire data set processing, thus affecting the overall computing performance.

2. Several solutions to data skew

1. Data skew caused by null values

During data collection, it is judged whether the key causing data skew is filtered out in advance. In inner join, that is, when inner join is used, hive filters out null values ​​by default, but for left join, etc., it retains the values ​​on the left. Scenarios for empty KEY filtering: 1. Non-inner join; 2. Null fields are not required.

Two filtering methods:

(1) insert overwrite table jointable select n.* from (select * from nullidtable where id is not null) n left join bigtable o on n.id = o.id;//Filter out null values ​​before joining

(2) Sometimes although a certain key is empty, there is a lot of data corresponding to it, but the corresponding data is not abnormal data and must be included in the join result. At this time, we can assign a random value to the empty key field in table a. Make the data randomly and evenly distributed to different reducers.

insert overwrite table jointable select n.* from nullidtable n full join bigtable o on nvl(n.id,rand()) = o.id; //nvl(a, b): If a is Null, take the value of b as the value of a.

2. Large tables join small tables using MapJoin

First, Task A generated on the local client is a MapReduce Local Task responsible for reading small table data from HDFS to the memory hash table. After reading, it will serialize the hash table in memory to a file on disk, and compress the hash table file into a tar file.
Next comes Task B, which is a MapReduce task without Reduce. When it starts , the tar file in the previous step will be placed in the Hadoop distributed cache, and the Hadoop distributed cache will fill the tar file to the local disk of each Mapper and decompress the file. Then the mapper can deserialize the hash table file back to the memory, and perform the join work as before, that is, associate each record in the large table with the HashTable corresponding to the small table in the DistributeCache, and directly output the result.

Benefits: There is no shuffle stage, which reduces a lot of network transmission; there is no reduce stage, which prevents data skew;

3. Data skew caused by group by -> use two-stage aggregation

Principle: Two-stage aggregation refers to first local aggregation and then global aggregation. During partial aggregation, add a random prefix to each key value to disperse, the original same key value will become a different new key value, so that the data originally processed by a task can be based on the new key value after adding a random prefix Distributed to multiple tasks for aggregation, so as to alleviate the problem of excessive data processing by a single task. Then remove the random prefix and do global aggregation to get the final result.

4. Considering whether the number of partitions is not enough, increase it appropriately (spark defaults to 200 partitions, which can be increased appropriately)

at last

I know that most junior and middle-level Java engineers want to improve their skills, and they often try to grow by themselves or enroll in classes. However, the tuition fees of nearly 10,000 yuan for training institutions are really stressful. The effect of self-study is inefficient and long, and it is easy to hit the ceiling and stagnate in technology!

Therefore, I collected and sorted out a " Complete Set of Learning Materials for Java Development " and gave it to everyone. The original intention is also very simple, that is, I hope to help friends who want to learn and improve themselves but don't know where to start, and at the same time reduce everyone's burden.

The editor has encrypted: aHR0cHM6Ly9kb2NzLnFxLmNvbS9kb2MvRFVrVm9aSGxQZUVsTlkwUnc==For security reasons, we have encoded the website through base64, and you can decode the URL through base64.

Guess you like

Origin blog.csdn.net/m0_67265464/article/details/126790203