Spark task two small problem notes



Today, when processing data with spark, I encountered two small problems, so I will take notes.

Both problems are related to network interaction. The general processing scenario is that a batch of data will be acquired and assembled in advance on the driver side, and then the data will be sent to the executor side for subsequent processing.




Question 1: The serialization exception

driver has a case class that needs to encapsulate some data and send it to the executor. It was originally a scala class. It is no problem to send it directly to the executor for execution, and there is no serialization annotation. The reason is because scala The function method will be serialized automatically, because this class appears in the function, so it's fine, but today I added a java bean to this class, and an exception occurred:
````
java.io.NotSerializableException
````
The reason is that the newly added java bean is not serialized, which leads to this problem. The function serialization of scala may not be deeply serialized, and the class in the class attribute will not be serialized again, so the solution is to make this java bean Implement the serialization interface of java:
````
Bean  implements Serializable
````



Question 2: The data sent by the driver is too large, which exceeds the default transmission limit of spark. The

exception is as follows:
````
User class threw exception: java.util.concurrent.ExecutionException:
org.apache.spark.SparkException
: Job aborted due to stage failure: Serialized task 523:49 was 146289487 bytes, which exceeds max allowed:
spark.rpc.message.maxSize (134217728 bytes).
Consider increasing spark.rpc.message.maxSize
or using broadcast variables for large values.
````


From the above exception prompt, it is obvious that the default driver submits a task to the executor, and its transmission data cannot exceed 128M. If it exceeds, the above exception will be thrown.


How to solve:


Method 1: Use broadcast variable transmission

Method 2: Increase the value of spark.rpc.message.maxSize, the default is 128M, we can make appropriate adjustments as needed

When using spark-submit to submit tasks, add the configuration Can:
````
--conf "spark.rpc.message.maxSize=512" //Represents that the maximum value allowed to send per partition is 512M
````


If you have any questions, you can scan the code and follow the WeChat public account: I am the siege division (woshigcs), leave a message in the background for consultation. Technical debts cannot be owed, and health debts cannot be owed. On the road of seeking the Tao, walk with you.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326400492&siteId=291194637