Big data data skew

What is data skew

When we use hive to fetch data, sometimes we just run a simple join statement, but it runs for a long time. Sometimes we think that it is caused by insufficient cluster resources, but in most cases it appears "Data skew" situation.

There are generally two cases of data skew:

There are few variable values: a single variable value accounts for a large proportion of common fields such as gender, education, and age.

There are many variable values: the proportion of a single variable value is extremely small, and common fields such as revenue and order amount are similar.

Data skew , which is very common in the MapReduce programming model, is that a large number of identical keys are assigned to a partition by the partition, resulting in the situation of "one is exhausted, others are idle" , which violates the original intention of parallel computing Efficiency is very low.

Reasons for data skew

When we look at the task progress to maintain at 99% (or 100%) for a long time, looking at the task monitoring page will find that only a small number (1 or a few) of the reduce subtasks are not completed. Because the amount of data it processes is too different from other reduce, this is the direct manifestation of data skew.

The reasons for this can be roughly divided into the following points:

1) uneven key distribution

2) Characteristics of business data itself

3) Poor considerations when building the watch

4) Some SQL statements inherently have data skew

It can be embodied in the following common operations:

 

 

Features of Hadoop computing framework

Before we understand how to avoid data skew, let ’s take a look at the features of the Hadoop framework:

Large data volume is not a big problem, data tilt is a big problem;

The efficiency of jobs with a large number of jobs is relatively low. For example, even if there are millions of tables, if multiple associations are performed for multiple times, a dozen jobs are generated, which takes a long time. The reason is that the map reduce job initialization time is relatively long;

UDAF (User Defined Aggregate Function), such as sum, count, max, min, is not afraid of data skew. Hadoop aggregates and optimizes on the map side so that data skew is not a problem;

Count (distinct), in the case of large amount of data, the efficiency is lower, if it is more count (distinct), the efficiency is lower, because count (distinct) is grouped by group by field, sorted by distinct field, generally this kind of distributed It is very inclined, such as male uv, female uv, Taobao 3 billion pv a day, if grouped by gender, 2 reduce are allocated, and each reduce processes 1.5 billion data.

Common means of optimization

First of all, we must understand the data distribution. It is a good choice to solve the data tilt problem by yourself;

Increase the jvm (Java Virtual Machine: Java virtual machine) memory, which is suitable for the case where the value of the variable is very small, in this case, often can only be tuned by means of hardware, increasing the jvm memory can significantly improve the operating efficiency;

Increase the number of reduce, which is applicable to the case where there are many variable values. In this case, the most likely result is that a large number of the same key is partitioned into a partition, so that a reduce performs a lot of work;

To redesign the key, there is a solution to add a random number to the key during the map stage. The key with the random number will not be allocated to the same node in a large number (small probability), and the random number will be removed after the reduce Just

Use combiner to merge. The combinner is in the map stage, an intermediate stage before reduce. In this stage, a large amount of the same key data can be selectively merged first, which can be regarded as local reduce, and then handed over to reduce to reduce the map side. The amount of data sent to the reduce end (reduced network bandwidth), and also reduced the number of data pulls in the shuffle phase between the map end and the reduce end (localized disk IO rate); (hive.map.aggr = true)

Setting a reasonable number of map reduce tasks can effectively improve performance. (For example, 10w + level calculation, using 160 reduce, that is quite wasteful, 1 is enough);

When the amount of data is large, use count (distinct) cautiously, and count (distinct) is prone to tilt problems;

hive.groupby.skewindata=true;

Load balancing is performed when there is data skew. When the option is set to true, the generated query plan will have two MR jobs. In the first MR Job, the output result set of Map will be randomly distributed to Reduce. Each Reduce will perform partial aggregation operation and output the result. In this way, the processed result is the same Group By Key may be distributed to different Reduce , So as to achieve the purpose of load balancing; the second MR job is then distributed into Reduce according to the result of the preprocessed data according to Group By Key (this process can ensure that the same Group By Key is distributed to the same Reduce), and finally complete the final Aggregate operation.

 

Guess you like

Origin www.cnblogs.com/songyuejie/p/12730983.html