Summary of spark development issues

Record and analyze problems encountered in spark development


1 view logs

log source method of obtaining features
scheduling system Directly view the scheduling log Quickly locate simple problems, the log is the most concise, and cannot be analyzed in depth
Spark UI View by application ID on the Spark UI Graphical display, easy to analyze problems and execution process, more complete logs
yarn log yarn logs -applicationId app_ld > res.log Log text, but the log is the most detailed, in-depth analysis of problems

log lost

1. Lost logs

1) The node where the driver is located is lost, shrinking or spot recycling.


2 Insufficient resources

Both driver and executor may have insufficient memory;

1. Error code -134

Aggregate function causes memory overflow

1. Error description

Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err


2. Functions

collect_set 、 collect_list


3. Reason

A certain key value (null, unknown, empty string) is too much, and the aggregated value value is too much, causing the range (array out of range) to exceed the limit;

4. Solutions

1) Increase memory

2) Find out the key value exceeding the limit and filter it out;

2. Error code -104

Broadcast join causes memory overflow

diagnostics: Application application_id failed 2 times due to AM Container for appattempt_id exited with  exitCode: -104
Container is running beyond physical memory limits. Current usage: 2.4 GB of 2.4 GB physical memory used; 4.4 GB of 11.9 GB virtual memory used. Killing container.
 

3. Error code -137

Container out of memory

1. Error description

Job aborted due to stage failure: Task 2 in stage 26.0 failed 4 times, most recent failure: Lost task 2.3 in stage 26.0 (TID 3253, ip-10-20-68-111.eu-west-1.compute.internal, executor 43): ExecutorLostFailure (executor 43 exited caused by one of the running tasks) Reason: Container from a bad node: container_e03_1634810603944_186010_01_000121 on host: ip-10-20-68-111.eu-west-1.compute.internal. Exit status: 137. Diagnostics: [2022-05-27 01:49:51.535]Container killed on request. Exit code is 137

2. Solutions

Resolve "Container killed on request.Exit code is 137" in Spark on Amazon EMR

Spark task Container killed Exit code 137 error on AWS EMR

4. Error code - OutOfMemoryError

1 driver memory overflow

1) Error message

java.lang.OutOfMemoryError: GC overhead limit exceeded
-XX:OnOutOfMemoryError="kill -9 %p"
Executing /bin/sh -c "kill -9 29082"...
java.lang.OutOfMemoryError: Java heap space
-XX:OnOutOfMemoryError="kill -9 %p
Executing /bin/sh -c "kill -9 23463

2) Possible reasons

problem causes solution
broadcast the larger table

a. Increase memory;

b. Cancel the table to broadcast

Too many data source partitions

a. Increase memory

b. Shrink the upstream partition

There is too much data collected to the driver

a. Increase memory

b. Reduce the result set

The server where the driver is located has insufficient resources

a. Appropriately reduce the driver but memory

b. Switch to a server with sufficient resources

-

3. Cache invalidation

问题描述:spark3当cache表 where 后有 in 过滤时 不会走cache ,而再去读源数据
-- 不走cache写法:
cache table test_tamp1 as(
    select
        id
    from
        table_name_01
    where
        dt = '20230326'
        and pkg in (select pkg from table_name_02)
    group by
        id
);



解决方法
-- join替代in操作
cache table test_tamp1 as(
    select
        id
    from
        table_name_01 aa
    left semi join
        table_name_02 bb
        on aa.pkg = bb.pkg
    where
        dt = '20230326'
    group by
        id
);

-- 设置参数
set spark.sql.legacy.storeAnalyzedPlanForView=true;

4 Data Skew

Possible Causes:

1) A hotkey (null value, abnormal value) appears during association;

2) coalesce reduces the partition, resulting in data skew;

1 View the operation log

2 Find skewed primary keys 

1 Primary key distribution of the source table
Analyze the source table directly, the data will not explode, and the query is fast

2. The primary key distribution of association statistics
has been skewed, and the query speed is slow

3 Optimization measures

1 Filter out invalid and abnormal primary keys

1) id = 'abnormal primary key'
2) id is null
3) id = ''

5. Error in write phase

1) Increase parallelism

2) Increase memory

3) Spill to disk

6. Stack overflow error

java.lang.StackOverflowError

`java.lang.StackOverflowError` is usually caused by too deep recursive calls. In Spark, this can happen especially when you do complex joins or nest multiple data processing operations.

Scenarios that cause this problem:

union too many tables

In order to avoid this situation, you can try the following solutions

1. Increase the stack size of the JVM: Setting the `-Xss` parameter when starting the application can increase the stack size of the JVM. For example, `spark-submit --conf spark.driver.extraJavaOptions=-Xss4m yourApp.jar`, sets the stack size to 4MB. However, this approach may cause the JVM to use more memory, so it needs to be used with caution.

2. Use the `repartition` method: The `repartition` method can repartition the data, thereby reducing the load when joining. For example, if you have 100 partitions in your dataset, you can use `df.repartition(10)` to divide it into 10 partitions to reduce the load.

3. Use the `broadcast` variable: The `broadcast` variable can broadcast the variable to all working nodes. This approach is generally suitable for small datasets and can reduce the load on each worker node. For example, if you need to join a very large dataset with a very small dataset, you can broadcast the small dataset to all worker nodes to reduce load.


10 Other Summary

Chinese description error code Solution
cannot read the meter

Error in query: java.lang.IllegalArgumentException: Can not create a Path from an empty string;

1 recreate view on hive
cannot read the file directly catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree 1 A column that does not exist was read
Delete other partition data Dynamic partition OverWrite problem Apache Spark dynamic partition OverWrite problem – past memory
cannot read and write the same table

Error in query: Cannot overwrite a path that is also being read from.

-- Do not use hive metadata

set spark.sql.hive.convertMetastoreParquet=false;

set spark.sql.hive.convertMetastoreOrc=false;

The setting parameter does not take effect

1. Confirm in the spark environment

2. Is it overwritten again (caused by intermediate components, like kyuubi has its own default parameters)

-

20 open issues

serial number Problem Description
1 select * from table_name where dt = '20220423' After filtering the data and writing it out to the table, how to control the parallelism of writing and increase the parallelism of processing
2 How to handle exceptions in spark-sql
3 How to serially load each table when multiple tables are associated


Guess you like

Origin blog.csdn.net/weixin_40829577/article/details/120428653