data skew
Classification
join | One of the tables has a small amount of data and the keys are concentrated | The data distributed to one or several reducers is much higher than the average |
---|---|---|
Large table and small table, too many null values | These null values are handled by a reduce, which is slow | |
group by | The group by dimension is too small, and the amount of a certain field is too large | A reduce that processes a value is very slow |
count distinct | too many special values | The reduce that handles this special value is slow |
Data Skew Cause Analysis
Data Skew Performance
- The task log progress length is 99%, and the log monitoring progress bar shows that only a few reduce progress has not been completed.
- A certain task processing time > average processing time
- Executor appears Java heap space, OutOfMemoryError, executor dead, etc.
data reasons
- The main table-driven table should choose a table with even distribution as the driving table, and do a good job of column pruning.
- For large and small table join, you need to remember to use map join, the small table will enter the memory first, and the reduce will be completed on the map side.
- This situation is the most common! ! ! When a large table is joined to a large table, there are a large number of null values and null keys in the associated fields
- The data type does not match the association, convert the data type first
Common shuffle operators
- Deduplication
def distinct()
def distinct(numPartitions: Int)
- polymerization
def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)]
def groupBy[K](f: T => K, p: Partitioner):RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner):RDD[(K, Iterable[V])]
def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int): RDD[(K, U)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]
def combineByKey[C](createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) =>
- to sort
def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length): RDD[(K, V)]
def sortBy[K](f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length
- repartition
def coalesce(numPartitions: Int, shuffle: Boolean = false, partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)
- Set or table operations
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def subtractByKey[W: ClassTag](other: RDD[(K, W)]): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], numPartitions: Int): RDD[(K, V)]
def subtractByKey[W: ClassTag](other: RDD[(K, W)], p: Partitioner): RDD[(K, V)]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
Common shuffle SQL operations
- aggregate function
groupby +(sum, count, distinct count, max, min, avg等)
sum, count, distinct count, max, min, avg等
- join function
data preparation
Generate users.txt, log.tx, log.txt_nullt, count.txt data through the program
Data file size
du -sh users.txt log.txt log.txt_null count.txt
2.0G log.txt (key值=1 倾斜)
1.9G log.txt_null (含有null值)
3.7G count.txt
324K users.txt
drop table t_user;
create table t_user (
id string,
name string,
role string,
sex string,
birthday string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
drop table t_log;
create table t_log (
id string,
user_id string,
method string,
response string,
url string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
drop table t_log_null;
create table t_log_null (
id string,
user_id string,
method string,
response string,
url string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
drop table t_count;
create table t_count (
id string,
user_id string,
role_id string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
drop table t_relation;
create table t_relation (
id string,
user_id string,
role_id string,
name string,
sex string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
drop table t_join;
create table t_join (
id string,
name string,
role string,
url string,
method string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
// 导入数据
load data local inpath '/data/users.txt' into table t_user;
load data local inpath '/data/log.txt' into table t_log;
load data local inpath '/data/log.txt_null' into table t_log_null;
load data local inpath '/data/count.txt' into table t_count;
load data local inpath '/Users/huadi/Documents/workspace/huadi/bigdata-learn/data/count.txt' into table t_relation;
The amount of data
select count(0) from t_log;
+------------+
| _c0 |
+------------+
| 40000000 |
+------------+
select count(0) from t_log_null;
+------------+
| _c0 |
+------------+
| 40000000 |
+------------+
select count(0) from t_user;
+----------+
| _c0 |
+----------+
| 10000 |
+----------+
// key的分布 field : user_id
select * from (select user_id, count(*) cou from t_log group by user_id) order by cou desc limit 10;
+----------+-----------+
| user_id | count |
+----------+-----------+
| 1 | 8000000 |
+----------+-----------+
| 8797 | 3415 |
+----------+-----------+
| 9548 | 3402 |
+----------+-----------+
| 5332 | 3398 |
+----------+-----------+
| 6265 | 3395 |
+----------+-----------+
| 4450 | 3393 |
+----------+-----------+
| 3279 | 3393 |
+----------+-----------+
| 888 | 3393 |
+----------+-----------+
| 5573 | 3390 |
+----------+-----------+
| 3986 | 3388 |
+----------+-----------+
// 1值特别多
select * from (select user_id, count(*) cou from t_log_null group by user_id) order by cou desc limit 10;
+----------+-----------+
| user_id | count |
+----------+-----------+
| | 36000000 |
+----------+-----------+
| 8409 | 485 |
+----------+-----------+
| 3503 | 482 |
+----------+-----------+
| 8619 | 476 |
+----------+-----------+
| 7172 | 475 |
+----------+-----------+
| 6680 | 472 |
+----------+-----------+
| 4439 | 470 |
+----------+-----------+
| 815 | 466 |
+----------+-----------+
| 7778 | 465 |
+----------+-----------+
| 3140 | 463 |
+----------+-----------+
The simulated data has a lot of null values
common scene
Remarks: The current example is based on the spark-sql engine
run sql
// sql执行命令和参数 ,下面的SQL 放在-e参数中执行
spark-sql --executor-memory 5g --executor-cores 2 --num-executors 8 --conf spark.sql.shuffle.partitions=50 --conf spark.driver.maxResultSize=2G -e "${sql}"
Common Optimization Configurations
spark.sql.shuffle.partitions -- improve parallelism
spark.sql.autoBroadcastJoinThreshold -- open map side join configuration, and modify the size of the broadcast table
spark.sql.optimizer.metadataOnly -- metadata query optimization
— spark-2.3.3 Afterwards,
spark.sql.adaptive.enabled automatically adjusts the degree of parallelism
spark.sql.aptive.shuffle.targetPostShuffleInputSize -- used to control the target data volume processed by each task
spark.sql.aptive.skewedJoin.enabled -- automatically handles join Data skew
spark.sql.ataptive.skewedPartitionFactor -- set the skew factor
JOIN data skew:
First close the join on the map side and configure spark.sql.autoBroadcastJoinThreshold = -1
- Null problem
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM t_log_null a JOIN t_user b ON a.user_id = b.id;
If there are too many NULL values in the associated field t1.id of the main table, it may cause data skew
The solution is as follows:
- Filter out useless null values
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM t_log_null a JOIN t_user b ON a.user_id = b.id WHERE a.user_id != '';
- add random value
INSERT OVERWRITE TABLE t_join SELECT a.user_id AS id, b.name, b.role, a.url, a.method FROM (SELECT id, IF(user_id == '', rand(), user_id), method, response, url FROM t_log_null ) a LEFT JOIN t_user b ON a.user_id = b.id
- Large tables are associated with small tables, which can be solved by map join
Open map side join configuration spark.sql.autoBroadcastJoinThreshold = 26214400
INSERT OVERWRITE TABLE t_join select a.user_id AS id, b.name, b.role, a.url, a.method from t_log a join t_user b on a.user_id = b.id
- Some JOIN values have too much data
First judge whether it can be deduplicated in the main table
-- 例子
select count(1) from t_log t1 inner join t_user t2 on t1.user_id = t2.id
-- 解决办法如下
select sum(t1.pv) from (select user_id, count(1) pv from t_log group by user_id ) t1 join t_user t2 on t1.user_id = t2.id
- The association of different data types will also generate data skew drops!
For example, the ID field in the registration form is of int type, and the ID field in the login table has either string type or int type. When the join operation between two tables is performed according to the ID field, the default Hash operation will be allocated according to the int type ID, which will cause all the string type ID records to be allocated to a Reduce. ! ! !
Solution: convert the numeric type to a string type
on haha.ID = cast(xixi.ID as string)
GROUP BY data skew:
- GROUP BY + COUNT DISTINCT too much duplicate data
select user_id, count(distinct role_id) from t_count group by user_id;
Run, directly report GC overhead limit.
If there is a lot of duplicate data in column_1 + column_2, you can deduplicate first and then Group By.
The solution is as follows
The distribute by keyword controls the distribution of map output results. The map output of the same field will be sent to a reduce node for processing. If the field is rand() a random number, it can ensure that the number of each partition is basically the same
select user_id, count(1) from ( select distinct user_id, role_id from t_count distribute by rand()) t group by user_id
- Abnormal data leads to data skew
If it does not affect the statistical results, just filter out useless data directly - The key distribution is extremely uneven, and some keys are over-concentrated
- You can use key to add random value two-stage aggregation (local aggregation + global aggregation)
Distinct data skew:
The solution is as follows:
The bottom layer of distinct calls the reduceByKey() operator. If the key data is skewed, it will cause data skew in the entire calculation. At this time, you can not directly perform distinct on the data, you can add distribute by, or you can use grouping before proceeding. select operation.
-- 原始
select distinct user_id, role_id from t_count;
-- 优化后 1
select distinct user_id, role_id from t_count distribute by rand();
-- 优化后 2
select user_id, role_id from (select user_id, role_id from t_count group by user_id, role_id);