Spark data tilt processing practice

Determine whether to tilt

Check which stage is running through the Spark Web UI. It mainly depends on the ratio of the amount of data allocated by Shuffle Write Size / Records in each task of the slowest Stage to the average number of other tasks to determine whether it is data skew.

practice

Positioning

Insert picture description here

As shown in the figure, the stage is basically within a few minutes, and this stage runs for a long time, but only one task is not completed. The probability of tilt in this situation is very high. We can click on the specific stage to see the details.
Insert picture description here
From the DAG we can see that there is a leftOuterJoin operator. Therefore, the task must be generated by shuffle, and then continue to look at the shuffle Write Size / Records allocation
Insert picture description here
in the task. We see that the order of magnitude of the assignment of the task with index 0 is two more than other tasks. So this is an obvious data skew.

solve

Through DAG, we can easily locate the code position. After knowing that it is caused by letf join, we can sample from these two tables, or sample by sample, or check the number of topN keys.
Insert picture description here

Here we sort by key and find that the left empty key has a large proportion. So add a filter before join to filter out empty keys

	rdd_letf.filter(f => !f._1.isEmpty)

Next, we resubmit the task and find that the leftOuterJoin operator just completed within two minutes.
Of course, not everything went well, and the next data tilt would soon be encountered. This time, the inner join has hundreds of millions on the left and more than 8 million on the right. According to the previous method, first check whether there is an obvious over-key situation, and then find that there is no such situation. Next, I had to change another way, which is the commonly used broadcast in spark. Because it is a table-related small table, you can use map instead of join.

    val rdd1 = .. //sc.parallelize(Array(("aa",1),("bb",2),("cc",6)))
    val rdd2 = .. //sc.parallelize(Array(("aa",3),("dd",4),("aa",5)))
    val b_df = sc.broadcast(rdd2.collect).value.toMap 
    // val rdd3 = rdd1.join(rdd2)
    val rdd3 = rdd1.mapPartitions(partitions => partitions.map(r => if(b_df.contains(r._1)) {
    
    (r._1, (r._2, b_df.get(r._1).get))} ))

Expand

Commonly used methods to solve data skew include improving the parallelism of shuffle operations, two-stage aggregation, sampling skew keys and splitting join operations, using random prefixes and expanding RDDs for joins, etc. If you are engaged in data development, these basic methods should be familiar to everyone.

Two-stage polymerization (Meituan sample)

Application scenarios of the solution: This solution is more suitable when performing aggregation shuffle operators such as reduceByKey on RDD or group by statements in Spark SQL for grouping aggregation.

Scheme realization idea: The core realization idea of ​​this scheme is to carry out two-stage aggregation. The first time is partial aggregation. First, each key is marked with a random number, such as a random number within 10, then the same key becomes different, such as (hello, 1) (hello, 1) (hello, 1) (hello, 1) will become (1_hello, 1) (1_hello, 1) (2_hello, 1) (2_hello, 1). Then perform aggregation operations such as reduceByKey on the data with random numbers to perform partial aggregation, then the partial aggregation result will become (1_hello, 2) (2_hello, 2). Then remove the prefix of each key, it will become (hello, 2) (hello, 2), and perform the global aggregation operation again to get the final result, such as (hello, 4).

Scheme realization principle: By adding a random prefix to the original same key, it becomes multiple different keys, so that the data originally processed by one task can be distributed to multiple tasks for partial aggregation, thereby solving a single task processing data The problem of excessive amounts. Then remove the random prefix and perform global aggregation again to get the final result. The specific principle is shown in the figure below.

Solution advantages: The effect is very good for the data skew caused by the shuffle operation of the aggregation class. Usually, data skew can be solved, or at least data skew can be greatly relieved, and the performance of Spark jobs can be improved by several times.

Disadvantages of the solution: only applicable to aggregate shuffle operations, and the scope of application is relatively narrow. If it is a shuffle operation of the join class, other solutions must be used.
Insert picture description here

// 第一步,给RDD中的每个key都打上一个随机前缀。
JavaPairRDD<String, Long> randomPrefixRdd = rdd.mapToPair(
        new PairFunction<Tuple2<Long,Long>, String, Long>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<String, Long> call(Tuple2<Long, Long> tuple)
                    throws Exception {
    
    
                Random random = new Random();
                int prefix = random.nextInt(10);
                return new Tuple2<String, Long>(prefix + "_" + tuple._1, tuple._2);
            }
        });
  
// 第二步,对打上随机前缀的key进行局部聚合。
JavaPairRDD<String, Long> localAggrRdd = randomPrefixRdd.reduceByKey(
        new Function2<Long, Long, Long>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Long call(Long v1, Long v2) throws Exception {
    
    
                return v1 + v2;
            }
        });
  
// 第三步,去除RDD中每个key的随机前缀。
JavaPairRDD<Long, Long> removedRandomPrefixRdd = localAggrRdd.mapToPair(
        new PairFunction<Tuple2<String,Long>, Long, Long>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<Long, Long> call(Tuple2<String, Long> tuple)
                    throws Exception {
    
    
                long originalKey = Long.valueOf(tuple._1.split("_")[1]);
                return new Tuple2<Long, Long>(originalKey, tuple._2);
            }
        });
  
// 第四步,对去除了随机前缀的RDD进行全局聚合。
JavaPairRDD<Long, Long> globalAggrRdd = removedRandomPrefixRdd.reduceByKey(
        new Function2<Long, Long, Long>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Long call(Long v1, Long v2) throws Exception {
    
    
                return v1 + v2;
            }
        });

Join using random prefix and expanded RDD

Solution application scenario: If a large number of keys in the RDD cause data skew during the join operation, there is no point in splitting the keys. At this time, the last solution can only be used to solve the problem.

The realization idea of ​​the scheme: * The realization idea of ​​this scheme is basically similar to "Solution 6". First, check the data distribution in the RDD/Hive table and find the RDD/Hive table that causes data skew. For example, there are multiple keys that correspond to More than 10,000 pieces of data. * Then each piece of data in the RDD is marked with a random prefix within n. * At the same time, expand the capacity of another normal RDD, expand each piece of data into n pieces of data, and each piece of data after expansion is sequentially prefixed with a 0~n prefix. * Finally, join the two processed RDDs.

The realization principle of the scheme: The original same key can be changed into a different key by adding a random prefix, and then these processed "different keys" can be distributed to multiple tasks for processing, instead of letting one task handle a large number of the same key. The difference between this solution and "Solution 6" is that the previous solution is to try to only perform special processing on the data corresponding to a small number of tilted keys. Since the processing process needs to expand the RDD, the previous solution expands the RDD after the memory The occupancy of is not large; and this solution is for the case of a large number of skewed keys, and it is impossible to split some keys for separate processing, so the entire RDD can only be expanded with data, which requires high memory resources.

Solution advantages: Basically, it can handle the data tilt of the join type, and the effect is relatively significant, and the performance improvement effect is very good.

Disadvantages of the scheme: The scheme is more to alleviate data tilt, rather than completely avoid data tilt. Moreover, the entire RDD needs to be expanded, which requires high memory resources.

Practical experience of the solution: When a data requirement was once developed, it was found that a join caused data skew. Before optimization, the execution time of the job was about 60 minutes; after optimization using this scheme, the execution time was shortened to about 10 minutes, and the performance was improved by 6 times.

// 首先将其中一个key分布相对较为均匀的RDD膨胀100倍。
JavaPairRDD<String, Row> expandedRDD = rdd1.flatMapToPair(
        new PairFlatMapFunction<Tuple2<Long,Row>, String, Row>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Iterable<Tuple2<String, Row>> call(Tuple2<Long, Row> tuple)
                    throws Exception {
    
    
                List<Tuple2<String, Row>> list = new ArrayList<Tuple2<String, Row>>();
                for(int i = 0; i < 100; i++) {
    
    
                    list.add(new Tuple2<String, Row>(0 + "_" + tuple._1, tuple._2));
                }
                return list;
            }
        });
  
// 其次,将另一个有数据倾斜key的RDD,每条数据都打上100以内的随机前缀。
JavaPairRDD<String, String> mappedRDD = rdd2.mapToPair(
        new PairFunction<Tuple2<Long,String>, String, String>() {
    
    
            private static final long serialVersionUID = 1L;
            @Override
            public Tuple2<String, String> call(Tuple2<Long, String> tuple)
                    throws Exception {
    
    
                Random random = new Random();
                int prefix = random.nextInt(100);
                return new Tuple2<String, String>(prefix + "_" + tuple._1, tuple._2);
            }
        });
  
// 将两个处理后的RDD进行join即可。
JavaPairRDD<String, Tuple2<String, Row>> joinedRDD = mappedRDD.join(expandedRDD);

Reference from: https://tech.meituan.com/2016/05/12/spark-tuning-pro.html

Guess you like

Origin blog.csdn.net/yyoc97/article/details/109189907
Recommended