Do you know data divergence and data skew?

This question has been synchronized to the Mini Program: Full Stack Interview Questions

question

In the daily work of data development, data divergence and data skew problems are relatively common. So how should we judge? How to avoid these two problems at the same time? Note: This question is often used by the interviewer to ask the interviewer

answer

Based on the above questions, the big guys gave the following answers:

data divergence

Destiny: For the problem of data divergence, you can check whether the data in the right table corresponding to the left join association is duplicated. If there are duplicates, it will cause one-to-many, and divergence may occur.

The Zhiyuan boss gave an essential answer: data divergence is caused by the non-unique value of the associated field .

Mr. Nic shared the actual scene of data divergence encountered: In the data reported by a certain buried point, it was found that a certain user data had duplicated about 2000 records, and then the data processing had to explode three columns in the data at that time. , and then use the full join, and finally generate 2000*2000*2000 records.

Pursue selflessness: Data divergence may occur when the fact table is associated with the dimension table or the master detail table. The reason may be that a row of records in the fact table is associated with multiple rows of records in the dimension table, thus resulting in data divergence.

Don’t touch my cheese: My personal understanding of data divergence is that the data has a Cartesian product, and the data is duplicated due to the non-unique association conditions. The solution is to first judge whether the data needs to diverge, that is, what kind of data results are required Set, and secondly, before making the association, it is necessary to judge whether the association condition is unique, and if it is not unique, it needs to be aggregated and deduplicated. At the same time, you need to configure the audit to check whether the data magnitude demonstrations in the tables before and after the association are consistent, and whether the sum of the numerical fields such as the amount is consistent.

data skew

Regarding the problem of data skew, the author has also released many solutions, which can be seen in detail

The only way to develop data - data skew

Hive Topic - Data Skew Positioning

Spark data tilt operation solution

Is your data skewed?


Let's take a look at the understanding of the following two big guys on data skew:

The heartbeat of delusional recovery: Data skew is actually like a hot issue, which means that in a distributed system process, one or a few nodes process most of the data in the entire data set, resulting in a stand-alone performance bottleneck. A common solution to this situation is to locate the skewed key first, and take different measures according to the number of skewed keys, such as adding a prefix in front of the key, and then removing the prefix at the end of the aggregation.

Don't touch my cheese: The most intuitive judgment of data skew is that the reduce is stuck at 99%, and it is generally a large table associated with a small table. The solution is to reduce unnecessary data as much as possible before association, so that the data in large tables can be reduced in magnitude as much as possible

Guess you like

Origin blog.csdn.net/qq_28680977/article/details/125035232