[Problem] Troubleshooting time-consuming batch processing of Spark Streaming

To troubleshoot the time-consuming processing of Spark Streaming, first adjust the logs of the driver and executor to DEBUG level. Check the log directly from the driver and executor to see if there is an Exception and Warn level exception log.
For example:
20/11/10 19:45:39 DEBUG NetworkClient: Disconnecting from node 0 due to request timeout.
20/11/10 19:45:39 DEBUG ConsumerNetworkClient: Cancelled FETCH request ClientRequest(expectResponse=true, callback=org. apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler@1f913c9d, request=RequestSend(header={api_key=1,api_version=2,correlation_id=21,client_id=consumer-1}, body={replica_id=-1, max_wait_time=500,min_bytes=1,topics=[{topic=testtopic,partitions=[{partition=1,fetch_offset=23317421,max_bytes=1048576}]}]}), createdTimeMs=1605008698482, sendTimeMs=1605008698483) with correlation id 21 due to node 0 being disconnected
20/11/10 19:45:39 DEBUG Fetcher: Fetch failed
org.apache.kafka.common.errors.DisconnectException

After seeing the abnormal log in the log, analyze and troubleshoot the problem from the Spark perspective. The basic idea is divided into three steps, namely the data reading phase, the data processing phase, and the data output phase.

1. In the data reading phase, for example, reading data from upstream Kafka, some situations encountered during work,
such as: Kafka's server bandwidth limit affects data reading; Kafka performance (disk) problem of a node affects the data of a single node Pull etc.
These situations can be checked from Kafka component monitoring and server monitoring.

2. In the data processing stage, the data processing process is connected in series by different operators. Check
whether there is a dependency on external interfaces in these stages, http request interface timeout
b, data imbalance in Task (see Web UI) , In the same batch of tasks, some tasks process 1 record, and some tasks process 100 records.
Troubleshooting ideas: add a wide dependency operator (repartition, etc.), split a stage into multiple stages, and locate a time-consuming stage.

For example, the following DAG shows that it takes 30s to join Stage 0, and it is not easy to determine which link has a problem. In view of the above situation, add a wide dependency operator to split the DAG into multiple stages, and each stage has only one operator. In this case, the time-consuming stage can be quickly found and optimized.

As shown in the figure (DAG of the original and split Stage)
[Problem] Troubleshooting time-consuming batch processing of Spark Streaming

3. Data output stage

Guess you like

Origin blog.51cto.com/10120275/2549797