Small note: remember the data skew caused by a sort merge join

Scenes

A colleague's order table, a total of about 200 million pieces of data in three years, after joining multiple dimension tables, write back to hive. It was found that each task took about three hours. And my other table, the amount of data is also about 200 million, and also joined multiple dimension tables, which took only 6 minutes.

Colleague’s tasks:

my task:

The amount of data

Troubleshoot

First, go to the Spark history service web page, find this task, check which job takes a long time, and find that there is a job that takes 2 hours:

Looking at the execution plan of this job, I found that the large table on the left has 99G data, while the small table on the right has only 16M data, but it uses sort merge join. Automatic broadcast join has been set in spark conf when the data volume is less than 40M, but it does not take effect.

spark.sql.autoBroadcastJoinThreshold=41485760

When looking at the event time, I found that all data is in one task, and other tasks have no data at all. No wonder it takes so long.

Looking at the code, I did not find any aggregation operations such as group by and distinct, only join.

But it was found that when he joins other tables, he joins the entire table. Normally, when joining the left table to the right table, you should first query the left and right tables, and only take the columns that you need (select). And first filter out unnecessary rows (where). This can make the code more readable; second, when the amount of data is large, it can greatly reduce the time and memory occupied by data transmission, calculation, and storage; third, do sub-queries in advance, spark can sense the amount of data , When the amount of data is small, broadcast join will be used automatically, otherwise sort merge join will be used.

For example, a material dimension table has more than 200 fields. In fact, only 4 fields are needed in this calculation. Selecting these four columns in advance is equivalent to a 98% reduction in data volume.

My code:

judgment

It seems that all data is in one task, but there is no aggregation operation in the code. Although the original data (file size in hdfs) is a bit skewed (mostly 1.1G, a small part is 400+M), it will not cause This situation. It is guessed that during the sort merge join, the data is repartitioned, causing all the data to be in one task. Although this situation is extreme, there is no other explanation at present.

Measures

If all the related dimension tables are changed to sub-queries, the code changes a lot. It is recommended that colleagues use the broadcast keyword to manually specify the broadcast dimension tables. If multiple tables need to be broadcast, separate them with commas. Such as:

SELECT
/*+ BROADCAST(mat,oms,mdm,mapping,ter,org,scs) */
 tmall.countercode AS terminal_code
...
...
...
FROM 
fact_transactionlog_online_tmall_filter tmall
LEFT JOIN 
 ldldws.dim_material mat 
 ON tmall.material_code=mat.material_code AND mat.pt_month = tmall.txddate_month
LEFT JOIN 
 ldldws.dim_product_oms oms 
 ON tmall.material_code=oms.oms_code AND mat.material_code IS NULL
LEFT JOIN 
...
...
...

result

After modifying the code, the spark execution plan is shown in the figure:

event time：

Total execution time:

Caption

After the modification, the execution time is reduced by 2 hours. Because the colleague used create as to build the intermediate table in sql, the data placement increased the running time, in fact, if it is not necessary, it is best to directly do the subquery instead of creating the intermediate table, so that the running time can be further shortened.

to sum up

1. Data skew may occur in any shuffle behavior, including join.

2. Before joining between any tables, sub-queries should be performed on the tables to filter out redundant rows or columns to reduce the amount of data.