Several solutions for spark combined with hive data skew

Data skew performance:

Some tasks execute very fast, some are very slow or memory overflows

 

Locate where the data is skewed:

Submit in client mode and observe the log

 

solution

1. Data aggregation is performed during hive etl, and data with the same key is aggregated into one piece of data, so that shuffle may not be needed, thereby solving data skew.

When there is no way to aggregate keys, other granular aggregations can also be selected. For example, if the data contains several cities and several occupations, appropriate granular aggregations can be selected.

 

2. Filter keys that cause skew

If the business allows data of certain keys to be discarded, then if there are two keys corresponding to 100,000 pieces of data, and other keys are dozens of pieces, then those two keys can be filtered.

 

3. Improve the parallelism of reduce so that the keys are more dispersed, so that it is possible to solve the data skew.

 

4. Double aggregation method (for non-join, such as groupByKey, reduceByKey)

1. Add a random number and a separator before each key, such as "2_", then the number corresponding to the key is relatively uniform, and then do the first reduce.

2. Remove the prefix (x_) of the RDD obtained by the first aggregation, and perform another aggregation.

 

5、reduce join转map join

It is to put the information of an RDD into the broadcast variable, and use it as the map operation variable of another RDD to perform a pairing operation with it (without the join operation, one RDD performs the mapToPair operation, and the other RDD becomes a broadcast list and then becomes a map)

 

Reduce join is converted into map join. Under what circumstances can it be used?

If two RDDs are to be joined, one of the RDDs is smaller. An RDD is 1 million data, and an RDD is 10,000 data. (One RDD is 100 million data, one RDD is 1 million data)

One of the RDDs must be relatively small. After broadcasting the data of that small RDD, a copy will reside in the block manager of each executor.

Make sure you have enough memory to hold the data in that small RDD

In this way, no shuffle operation will occur at all, and data skew will definitely not occur; the problem of data skew that may be caused by join operation is fundamentally eliminated;

For the case of data skew in join, we try to consider this method first, and the effect is very good; if a certain RDD is relatively small.

 

For the join operation, not only the problem of data skew is considered; even if there is no data skew problem, it can be given priority. Using the advanced reduce join to map join technology we talked about,

Don't use ordinary join to join data through shuffle; you can use a simple map to use map join, sacrificing a little memory resources; if feasible, use this first.

Do not go shuffle, go directly to map, will the performance be much higher? Definitely.

6. You can first extract 10% of the data from an RDD to be joined and get the key with the largest number from it, and then filter the data of this key from the two RDDs, properly two RDDs, rdd1 and rdd2, and then compare the two RDDs. The keys in each RDD are randomized

Prefix, if there is only one data in rdd2, then use for to generate several more, then join the two RDDs, then join those non-special RDDs, and then union the two results.

 

When does it not apply?

If there are many keys that cause data skew in an RDD; then at this time, it is best not to do so;

 

7. Expansion method

It is to expand an RDD by n times (the bigger the better, as long as the memory is sufficient), add x_ before the key of each data of this RDD, a total of n, and add a random prefix to each key of another RDD,

Then join the two RDDs.

 

Applications

The two RDDs are large, and there are multiple data that affect the data skew

 

limited

memory consumption

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327027163&siteId=291194637