[Big Data] Spark and SparkSQL data skew phenomenon and solutions

data skew

Classification

join One of the tables has a small amount of data and the keys are concentrated The data distributed to one or several reducers is much higher than the average
Large table and small table, too many null values These null values ​​are handled by a reduce, which is slow
group by The group by dimension is too small, and the amount of a certain field is too large A reduce that processes a value is very slow
count distinct too many special values The reduce that handles this special value is slow

Data Skew Cause Analysis

Data Skew Performance
  • The task log progress length is 99%, and the log monitoring progress bar shows that only a few reduce progress has not been completed.
  • A certain task processing time > average processing time
  • Executor appears Java heap space, OutOfMemoryError, executor dead, etc.
data reasons
  • The main table-driven table should choose a table with even distribution as the driving table, and do a good job of column pruning.
  • For large and small table join, you need to remember to use map join, the small table will enter the memory first, and the reduce will be completed on the map side.
  • This situation is the most common! ! ! When a large table is joined to a large table, there are a large number of null values ​​and null keys in the associated fields
  • The data type does not match the association, convert the data type first

Common shuffle operators

  • Deduplication
def distinct()
def distinct(numPartitions: Int)
  • polymerization
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) =>
  • to sort
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]
def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length
  • repartition
def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)
  • Set or table operations
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]

Common shuffle SQL operations

  • aggregate function
groupby +(sum, count, distinct count, max, min, avg等)
sum, count, distinct count, max, min, avg等
  • join function

data preparation

Generate users.txt, log.tx, log.txt_nullt, count.txt data through the program

Data file size

du -sh users.txt log.txt log.txt_null count.txt

2.0G    log.txt (key值=1 倾斜)
1.9G    log.txt_null (含有null值)
3.7G    count.txt
324K    users.txt
drop table t_user;
create table t_user (
        id string,
        name string,
        role string,
        sex string,
        birthday string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

drop table t_log;
create table t_log (
        id string,
        user_id string,
        method string,
        response string,
        url string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

drop table t_log_null;
create table t_log_null (
        id string,
        user_id string,
        method string,
        response string,
        url string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

drop table t_count;
create table t_count (
        id string,
        user_id string,
        role_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

drop table t_relation;
create table t_relation (
        id string,
        user_id string,
        role_id string,
        name string,
        sex string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

drop table t_join;
create table t_join (
        id string,
        name string,
        role string,
        url string,
        method string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';

// 导入数据
load data local inpath '/data/users.txt' into table t_user;
load data local inpath '/data/log.txt' into table t_log;
load data local inpath '/data/log.txt_null' into table t_log_null;
load data local inpath '/data/count.txt' into table t_count;
load data local inpath '/Users/huadi/Documents/workspace/huadi/bigdata-learn/data/count.txt' into table t_relation;

The amount of data

select count(0) from t_log;

+------------+
|    _c0     |
+------------+
|  40000000  |
+------------+

select count(0) from t_log_null;

+------------+
|    _c0     |
+------------+
|  40000000  |
+------------+

select count(0) from t_user;

+----------+
|   _c0    |
+----------+
|  10000   |
+----------+

// key的分布 field : user_id

select * from (select user_id, count(*) cou from t_log group by user_id) order by cou desc limit 10;

+----------+-----------+
|  user_id |   count   |
+----------+-----------+
|     1    |  8000000  |
+----------+-----------+
|   8797   |    3415   |
+----------+-----------+
|   9548   |    3402   |
+----------+-----------+
|   5332   |    3398   |
+----------+-----------+
|   6265   |    3395   |
+----------+-----------+
|   4450   |    3393   |
+----------+-----------+
|   3279   |    3393   |
+----------+-----------+
|   888    |    3393   |
+----------+-----------+
|   5573   |    3390   |
+----------+-----------+
|   3986   |    3388   |
+----------+-----------+

// 1值特别多

select * from (select user_id, count(*) cou from t_log_null group by user_id) order by cou desc limit 10;

+----------+-----------+
|  user_id |   count   |
+----------+-----------+
|          |  36000000 |
+----------+-----------+
|   8409   |    485    |
+----------+-----------+
|   3503   |    482    |
+----------+-----------+
|   8619   |    476    |
+----------+-----------+
|   7172   |    475    |
+----------+-----------+
|   6680   |    472    |
+----------+-----------+
|   4439   |    470    |
+----------+-----------+
|   815    |    466    |
+----------+-----------+
|   7778   |    465    |
+----------+-----------+
|   3140   |    463    |
+----------+-----------+

The simulated data has a lot of null values

common scene

Remarks: The current example is based on the spark-sql engine

run sql

// sql执行命令和参数 ,下面的SQL 放在-e参数中执行
spark-sql --executor-memory 5g --executor-cores 2 --num-executors 8 --conf spark.sql.shuffle.partitions=50 --conf spark.driver.maxResultSize=2G -e "${sql}"

Common Optimization Configurations

spark.sql.shuffle.partitions -- improve parallelism
spark.sql.autoBroadcastJoinThreshold -- open map side join configuration, and modify the size of the broadcast table
spark.sql.optimizer.metadataOnly -- metadata query optimization
— spark-2.3.3 Afterwards,
spark.sql.adaptive.enabled automatically adjusts the degree of parallelism
spark.sql.aptive.shuffle.targetPostShuffleInputSize -- used to control the target data volume processed by each task
spark.sql.aptive.skewedJoin.enabled -- automatically handles join Data skew
spark.sql.ataptive.skewedPartitionFactor -- set the skew factor

JOIN data skew:

First close the join on the map side and configure spark.sql.autoBroadcastJoinThreshold = -1

  • Null problem
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM t_log_null a JOIN t_user b ON a.user_id = b.id;

If there are too many NULL values ​​in the associated field t1.id of the main table, it may cause data skew
image.png
image.png

The solution is as follows:

  • Filter out useless null values
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM t_log_null a JOIN t_user b ON a.user_id = b.id WHERE a.user_id != '';

image.png

  • add random value
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM (SELECT id, IF(user_id == '', rand(), user_id), method, response, url FROM t_log_null ) a LEFT JOIN t_user b ON a.user_id = b.id

image.png

  • Large tables are associated with small tables, which can be solved by map join

Open map side join configuration spark.sql.autoBroadcastJoinThreshold = 26214400

INSERT OVERWRITE TABLE t_join select a.user_id AS id, b.name, b.role, a.url, a.method from t_log a join t_user b on a.user_id = b.id

image.png
image.png

  • Some JOIN values ​​have too much data

First judge whether it can be deduplicated in the main table

-- 例子
select count(1) from t_log t1 inner join t_user t2 on t1.user_id = t2.id
-- 解决办法如下
select sum(t1.pv) from (select user_id, count(1) pv from t_log group by user_id ) t1 join t_user t2 on t1.user_id = t2.id
  • The association of different data types will also generate data skew drops!

For example, the ID field in the registration form is of int type, and the ID field in the login table has either string type or int type. When the join operation between two tables is performed according to the ID field, the default Hash operation will be allocated according to the int type ID, which will cause all the string type ID records to be allocated to a Reduce. ! ! !

Solution: convert the numeric type to a string type

on haha.ID = cast(xixi.ID as string)

GROUP BY data skew:

  • GROUP BY + COUNT DISTINCT too much duplicate data
select user_id, count(distinct role_id) from t_count group by user_id;

Run, directly report GC overhead limit.

image.png

If there is a lot of duplicate data in column_1 + column_2, you can deduplicate first and then Group By.
The solution is as follows

The distribute by keyword controls the distribution of map output results. The map output of the same field will be sent to a reduce node for processing. If the field is rand() a random number, it can ensure that the number of each partition is basically the same

select user_id, count(1) from ( select distinct user_id, role_id from t_count distribute by rand()) t group by user_id
  • Abnormal data leads to data skew
    If it does not affect the statistical results, just filter out useless data directly
  • The key distribution is extremely uneven, and some keys are over-concentrated

image.png

  • You can use key to add random value two-stage aggregation (local aggregation + global aggregation)

Distinct data skew:

The solution is as follows:
The bottom layer of distinct calls the reduceByKey() operator. If the key data is skewed, it will cause data skew in the entire calculation. At this time, you can not directly perform distinct on the data, you can add distribute by, or you can use grouping before proceeding. select operation.

-- 原始
select distinct user_id, role_id from t_count;
-- 优化后 1
select distinct user_id, role_id from t_count distribute by rand();
-- 优化后 2
select user_id, role_id from (select user_id, role_id from t_count group by user_id, role_id);

Guess you like

Origin blog.csdn.net/u013412066/article/details/129793810