Spark program running common error solutions and optimization

一.org.apache.spark.shuffle.FetchFailedException
1. Problem description
This problem generally occurs when there are a large number of shuffle operations, the task is continuously failed, and then re-executed, and the cycle continues, which is very time-consuming.
2. Error message
(1) missing output location
  1. org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0 
(2) shuffle fetch faild
  1. org.apache.spark.shuffle.FetchFailedException: Failed to connect to spark047215/192.168.47.215:50268 
The current configuration uses 1cpu, 5GRAM per executor, and starts 20 executors
3. Solutions
Generally, if you encounter this kind of problem, you can increase the executor memory and increase the cpu of each executor at the same time, which will not reduce the task parallelism.
  • spark.executor.memory 15G
  • spark.executor.cores 3
  • spark.cores.max 21
The number of executed execuote is: 7
  1. execuoteNum = spark.cores.max/spark.executor.cores 
Configuration for each executor:
  1. 3core,15G RAM 
The consumed memory resources are: 105G RAM
  1. 15G*7=105G 
It can be found that the resources used have not improved, but the original configuration of the same task is still stuck for a few hours, and it ends within a few minutes after the configuration is changed.
二.Executor&Task Lost
1. Problem description
Due to network or gc reasons, the worker or executor does not receive heartbeat feedback from the executor or task
2. Error message
(1) executor lost
  1. WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, aa.local): ExecutorLostFailure (executor lost) 
(2) task lost
  1. WARN TaskSetManager: Lost task 69.2 in stage 7.0 (TID 1145, 192.168.47.217): java.io.IOException: Connection from /192.168.47.217:55483 closed 
(3) Various timeouts
  1. java.util.concurrent.TimeoutException: Futures timed out after [120 second 
  1. ERROR TransportChannelHandler: Connection to /192.168.47.212:35409 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong 
3. Solutions
Increase the value of spark.network.timeout, and change it to 300 (5min) or higher according to the situation.
The default is 120 (120s), which configures the delay of all network transmissions. If the following parameters are not actively set, their properties will be overwritten by default
  • spark.core.connection.ack.wait.timeout
  • spark.akka.timeout
  • spark.storage.blockManagerSlaveTimeoutMs
  • spark.shuffle.io.connectionTimeout
  • spark.rpc.askTimeout or spark.rpc.lookupTimeout
3. Tilt
1. Problem description
Most of the tasks are completed, and there are one or two tasks that can't be completed or run very slowly.
There are two types of data skew and task skew.
2. Error prompt
(1) Data skew
(2) Task tilt
Some tasks with a small gap run very slowly.
3. Solutions
(1) Data skew
In most cases, data skew is caused by a large number of null values ​​or "", which can be filtered out before calculation.
E.g:
  1. sqlContext.sql("...where col is not null and col != ''") 
(2) Task tilt
There are many reasons for task inclination. Network io, cpu, and mem may cause slow execution of tasks on this node. You can go to the performance monitoring of this node to analyze the reasons. I have encountered a colleague's task of running R on a worker of spark before, which caused the spark task of the node to run slowly.
Or you can turn on the speculation mechanism of spark. After the speculation mechanism is turned on, if several tasks of a certain machine are particularly slow, the speculation mechanism will allocate the tasks to other machines for execution, and finally Spark will select the fastest one as the final result.
spark.speculation true
spark.speculation.interval 100 - detection period, in milliseconds;
spark.speculation.quantile 0.75 - start speculation when the percentage of task completed
spark.speculation.multiplier 1.5 - How many times slower to start speculating than others.
4. OOM (out of memory)
1. Problem description
Insufficient memory, too much data will throw OOM Exception
Because the error message is very obvious, the error message is not given here. . .
2. Solutions
There are mainly two types of driver OOM and executor OOM
(1) driver OOM
Generally, the collect operation is used to aggregate the data of all executors to the driver. Try not to use the collect operation.
(2) executor OOM
1. You can increase the memory space used by the code according to the following memory optimization methods
2. Increase the total amount of executor memory, that is to say increase the value of spark.executor.memory
3. Increase task parallelism (large tasks are divided into small tasks), refer to the following method for optimizing parallelism
optimization
1. Memory
Of course, if your task shuffle volume is particularly large, and the rdd cache is relatively small, you can change the following parameters to further improve the task running speed.
spark.storage.memoryFraction - the ratio allocated to the rdd cache, the default is 0.6 (60%), if the cached data is less, the value can be reduced.
spark.shuffle.memoryFraction - the fraction of memory allocated to shuffle data, the default is 0.2 (20%)
The remaining 20% ​​of memory space is allocated to code generation objects, etc.
If the task runs slowly, the jvm performs frequent gc or the memory space is insufficient, or you can reduce the above two values.
"spark.rdd.compress","true" - default is false, compress serialized RDD partitions, consume some cpu and reduce space usage
If the data is only used once, do not use the cache operation, because it will not improve the running speed and will cause memory waste.
2. Parallelism
  1. spark.default.parallelism 
The degree of parallelism when shuffle occurs. In standalone mode, the number defaults to the number of cores. It can also be adjusted manually. If the number is set too large, it will cause many small tasks and increase the cost of starting tasks. Slow.
  1. spark.sql.shuffle.partitions 
The parallelism of sql aggregation operation (shuffle occurs), the default is 200, if the task runs slowly, increase this value.
The same two tasks:
  1. spark.sql.shuffle.partitions=300: 
  1. spark.sql.shuffle.partitions=500: 
The faster speed is mainly due to a large reduction in gc time.
Modifying the parallelism of the map stage is mainly performed by using rdd.repartition(partitionNum) in the code.
 
 
 
========================================================================================
 
Five, stepping on the fault tolerance mechanism of SPARK
2018-04-15 14
  • Fault Tolerance in Spark
  • question
  • Spark task restart
  • The pit of SPARK 2.1.0
Fault Tolerance in Spark
In fact, this mechanism is not too clear, many of them are seen in this brother 's blog, here are the problems encountered today and the pits stepped on.
question
Recently, I am tuning a spark program. Because the amount of data is too large, there are some performance obstacles. The previous join problem has been solved (the solution will be added in two days). I always thought that this would solve the problem, but through new data testing, it was found that the time-consuming may still be particularly serious. One problem is that the GC time of several tasks is too long, resulting in a particularly long overall running time (this problem has not been reproduced, and if it is encountered again, it can only be solved by some existing GC solutions); another problem is that even if There is no large GC time-consuming, and the calculation time is still very impressive (a task takes about 4~5h, which is not acceptable).
In order to speed up the task, on the premise that the queue resources are not particularly tight, I decided to add some machines. The specific method is to add num-executors, executor-memory and executor-cores, and then increase default-parallelism. At first observation, the speed does seem to be accelerated, and it is estimated that the speed can be doubled (and it should be, after all, the resources have also been doubled). But when the task runs to 1/4, there is an accident: an executor suddenly hangs!
Spark task restart
I haven't studied it carefully before, what happens when the executor hangs up with spark. Because after all, I have already run some results, I can't start from scratch and run it again.
First, let's take a look at the problem reported by the driver log:
FetchFailed(BlockManagerId(301, some_port, 7391, None), shuffleId=4, mapId=69, reduceId=1579, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to some_port
It turned out that an executor wanted to fetch data (should be shuffle read), but the executor with the data hung up, causing the fetch to fail. Why do I know the executor hangs? I saw it through spark-ui.
Let's think about it, if the executor hangs, what are the consequences? The executor stores the data calculated by the previous stage, and then the tasks of this stage will depend on that data, so this will affect many tasks of this stage.
Let’s look at the tasks of this stage from three perspectives: the tasks that have been calculated in this stage should not need to be recalculated; the tasks that have not been started in this stage will not be affected for the time being; what are the effects of tasks that have been started but not completed in this stage? Woolen cloth? We'll talk about this later.
Let's first see what spark does after knowing that the executor has died? Suppose that our current stage is stage 9, which is called 9.0 by default; now because the executor has died, this stage cannot continue smoothly. So, spark restarts a new stage, called 9.1. Since it has already been calculated, do not count it, so the number of tasks is the previous total minus the number that has been calculated. For tasks that 9.0 has started but not completed, 9.1 will still restart, but it seems that the two have not communicated before.
Let's see, the data that the executor has calculated is now lost, what should spark do? Due to the "blood relationship" between spark rdds, you can recalculate the rdd generation method on that executor. This only involves data on that executor, so the overhead will be small, but it is possible to recalculate multiple stages (I encountered minute-level recalculations).
Reasonably, with this fault-tolerant restart task mechanism, minute-level recalculation will not bring a lot of extra time overhead. But through the spark ui observation, at the time of 9.0, there were nearly 1000 parallel tasks, and now there are only 300~400 parallel tasks left in 9.1, and the speed becomes very slow. Adding resources is the same as not adding them. Why is this? Can you bear it?
Looking back at the executor's log, after a period of accumulation, I found that many executors have been reporting this:
java.io.IOException: Connection from some_port closed18/04/15 08:15:09 INFO RetryingBlockFetcher: Retrying fetch (1/30) for 20 outstanding blocks after 10000 ms18/04/15 08:15:09 ERROR OneForOneBlockFetcher: Failed while starting block fetchesjava.io.IOException: Connection from some_port closed18/04/15 08:15:09 INFO RetryingBlockFetcher: Retrying fetch (1/30) for 20 outstanding blocks after 10000 ms18/04/15 08:15:09 ERROR OneForOneBlockFetcher: Failed while starting block fetchesjava.io.IOException: Connection from some_port closed18/04/15 08:15:09 INFO RetryingBlockFetcher: Retrying fetch (1/30) for 20 outstanding blocks after 10000 ms18/04/15 08:15:19 INFO TransportClientFactory: Found inactive connection to some_port, creating a new one.18/04/15 08:17:26 INFO TransportClientFactory: Found inactive connection to some_port, creating a new one.18/04/15 08:17:26 ERROR RetryingBlockFetcher: Exception while beginning fetch of 20 outstanding blocks (after 1 retries)
When I was idle and bored, I just watched it for two hours before this mistake stopped. Looking closely, it seems like this one is trying to fetch data 30 times. However, the data source of fetch is the executor that has been hung up. Since it has been hung up, it has been trying there. Isn't there something wrong.
Another question is, so many retries for Mao? I feel that this should be a configuration, so search for 30 in the tab of spark ui's environment. Haha, sure enough there are 30 there:
spark.shuffle.io.maxRetries: 30
Looking at the name, it should be it! Spark has been trying to fetch the data on the hung executor for 30 times! Then there is a corresponding parameter: spark.shuffle.io.retryWait=10s, which represents the interval between two retry. After knowing this problem, I checked the document and found that the official default number of retry is 3 times. I don't know which operation and maintenance changed the default parameter to 30! What should also suffer is that retryWait has also been changed from the default 5s to 10s. The reason for the slow running is obvious. It is these two parameters that cause many executors to struggle needlessly. If they want to fetch data from a hung executor, that is, two hours, more than half of the executor's resources are wasted. .
But wait a moment, it is reasonable to multiply 30 times by 10s, and at most 300s is wasted, which is 5min. How can 2h be wasted? Here I guess:
18/04/15 08:15:19 INFO TransportClientFactory: Found inactive connection to some_port, creating a new one.
This should be a hint. When I found that the executor could not be connected, I thought of re-establishing a connection. But after all, that node has been hung up, and there must be no response, so you need to wait for the connection to time out. The connection timeout time is very long, such as 5min, then it will take almost two hours.
The pit of SPARK 2.1.0
Then the question comes again, why is spark so stupid? Obviously the extractor is dead, so I still have to try it. Can't the driver tell each executor: that one is hung up, don't go there to fetch numbers, and end the task that has already started. After consulting various sources, I found out that the people who designed it earlier did not seem to have taken this into account. The following are some jira and github issues, all complaining about this problem:
https://issues.apache.org/jira/browse/SPARK-20178https://issues.apache.org/jira/browse/SPARK-20230https://github.com/apache/spark/pull/17088
It seems that this problem was closed in May 2017, so it may not be fixed until spark 2.3.0. It can be regarded as stepping on the pit of version 2.1.0 spark, but spark in the company cannot be upgraded at will. In the future, you still need to deal with this problem manually, such as setting the number of retry yourself. But a possible pit is that the description of this parameter is:
This retry logic helps stabilize large shuffles in the face of long GC pauses or transient network connectivity issues.
So if GC is a problem, it may be adjusted upwards.
  1. Another gain is to know the meaning of (Netty only) in the document. It seems that we use this network communication library very casually.
 
 
 
6. Reasonable setting of parallelism for Spark performance tuning
1. What is the parallelism of Spark?
    In a spark job, the number of tasks in each stage represents the parallelism of the spark job in each stage!
    When the maximum resources that can be allocated are allocated, then adjust the parallelism of the program corresponding to the resources. If the parallelism does not match the resources, all the resources you allocate will be wasted. Running in parallel at the same time can also reduce the number of tasks to be processed by each task (a very simple principle. Reasonable setting of the degree of parallelism can make full use of cluster resources, reduce the amount of data processed by each task, and increase performance to speed up the operation.)
 
    Example:
        Suppose that we have allocated enough resources to our spark job in the spark-submit script, such as 50 executors, each executor has 10G memory, and each executor has 3 cpu cores. The resource limit of the cluster or yarn queue has basically been reached.
The task is not set, or there are very few settings, for example, it is set, 100 tasks. 50 executors, each executor has 3 cores, that is to say
When any stage of Application is running, there are a total of 150 cpu cores that can run in parallel. However, you only have 100 tasks now, distribute them evenly, and each executor is allocated 2 tasks, ok, then there are only 100 tasks running at the same time, and each executor will only run 2 tasks in parallel. The remaining cpu core of each executor is wasted! Although your resources are adequately allocated, the problem is that the degree of parallelism does not match the resources, causing all the resources you allocated to be wasted. A reasonable degree of parallelism should be set large enough to fully utilize your cluster resources reasonably; for example, in the above example, the cluster has a total of 150 cpu cores and can run 150 tasks in parallel. Then you should set the parallelism of your Application to at least 150, in order to fully utilize your cluster resources, let 150 tasks execute in parallel, and after the task increases to 150, you can run in parallel at the same time, It is also possible to reduce the number of tasks to be processed by each task; for example, a total of 150G of data needs to be processed, and if there are 100 tasks, each task needs to calculate 1.5G of data. Now it is increased to 150 tasks, and each task only needs to process 1G data.
2. How to improve parallelism?
   1. The number of tasks should be at least the same as the total number of cpu cores of the spark application (the most rational case, 150 cores, allocated 150 tasks, run together, and run at about the same time) Officially recommended, the number of tasks should be set to the total cpu of the spark application 2~3 times the number of cores, such as 150 cpu cores, the basic set of tasks is 300~500. Different from the rational situation, some tasks will run faster, such as 50s, and some tasks may be slower. It only runs in half, so if the number of your tasks is set to be the same as the number of cpu cores, it may cause a waste of resources, because for example, 150 tasks, 10 are finished first, and the remaining 140 are still running, but at this time, There are 10 cpu cores that are idle, resulting in waste. If it is set to 2~3 times, then after a task runs, another task will be added immediately, and try to keep the cpu core not idle. At the same time, try to improve the efficiency and speed of spark. Improve performance.
    2. How to set the parallelism of a Spark Application?
      spark.defalut.parallelism has no value by default. If a value such as 10 is set, it will only work during the shuffle process (val rdd2 = rdd1.reduceByKey(_+_) //The number of partitions of rdd2 is 10, rdd1 The number of partitions is not affected by this parameter)
      new SparkConf().set(“spark.defalut.parallelism”,”“500)
 
    3. If the read data is on HDFS, increase the number of blocks. By default, split and block are one-to-one, and split corresponds to the partition in RDD, so increasing the number of blocks increases the degree of parallelism.
    4. RDD.repartition, reset the number of partitions for RDD
    5. The reduceByKey operator specifies the number of partitions
                 val rdd2 = rdd1.reduceByKey(_+_,10)  val rdd3 = rdd2.map.filter.reduceByKey(_+_)
    6. val rdd3 = rdd1.join(rdd2) The number of partitions in rdd3 is determined by the maximum number of partitions in the parent RDD, so when using the join operator, increase the number of partitions in the parent RDD.
    7. spark.sql.shuffle.partitions //The number of partitions in the shuffle process in spark sql
 
 
 
Seven, their own summary:
 
 
 
1. Problems encountered in the project and solutions:
The split parallel query of the query mysql set at the beginning is 8. As a result, the program repeatedly removes and regenerates the executor, but the task cannot be completed, and it has not finished running overnight. Later, check yarn monitoring in clouder manager. It is found that the allocated executors of each nodemanager are very uneven. Then calculate the total number of executors is 8.
 
Looking at the picture below, the number of executor instances I set in the submission parameters of the spark task is 9, which is not correct.
In the end, the query parallelism of mysql was set to 30 (results in that the task was completed soon, it is estimated that the parallelism increased, and the data allocated to each executor was less, and there was no gc, so there would be no repeated retry, and retry would not work. restart the executor). Then I saw in the yarn that 30 containers were started. There are 15 running simultaneously. Just a container (including executor) has only one cpu core, and the corresponding memory of each executor is not as large as I set.
Please refer to the figure below. This is my setting, indicating that it does not refer to my settings for resource allocation, but allocates resources according to the parallelism of querying mysql. It is estimated that according to the parallelism of mysql, tasks, data partitions, and then calculations are generated. I originally thought it was done according to the default parallelism of 64; so the parameter of parallelism (spark.default.parallelism) has not been set. Unexpectedly, it is parallelized according to the query parallelism of mysql.
 
 
 
2. Analysis of the principle of the problem
Comprehensive 2. Executor&Task Lost, 5. The fault tolerance mechanism of stepping on the pit SPARK
It can be summed up as follows:
A. When the data set is too large (or the memory of a single executor is insufficient due to uneven distribution, too few partitions, insufficient parallelism, etc.), it will cause insufficient executor memory and frequent gc.
 
B. Frequent gc or network jitter will cause problems such as data transmission timeout and heartbeat timeout.
 
C. Due to the retry mechanism of spark, it will first retry to pull the data according to the configured time interval.
 
D. After the number of retries is exceeded, the executor will be killed, and an executor will be regenerated to re-execute. This results in repeated removes of the executors, and then regenerated. But the task is still not completed.
 
 
3. Set the parameters of spark's retry mechanism (the number of attempts, the interval between attempts, and various communication timeouts)
Every time you try to fail, you have to wait until the communication times out. The various times add up, and the repeated retry time will be very long.
A、
spark.shuffle.io.maxRetries: 30 #Number of attempts
Spark has been trying to fetch the data on the hung executor for 30 times!
B、
spark.shuffle.io.retryWait=10s #This represents the interval between two retry.
C、
spark.network.timeout=300 #Configure the delay of all network transmissions
If the following parameters are not actively set, their properties are overwritten by default
  • spark.core.connection.ack.wait.timeout
  • spark.akka.timeout
  • spark.storage.blockManagerSlaveTimeoutMs
  • spark.shuffle.io.connectionTimeout
  • spark.rpc.askTimeout or spark.rpc.lookupTimeout
 
 
 
 
 
 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326065666&siteId=291194637