Big data course K9 - Spark tuning method

Email of the author of the article: [email protected] Address: Huizhou, Guangdong

 ▲ This chapter’s program

⚪ Master Spark - better serialization implementation;

⚪ Master Spark - use Kryo through code;

⚪ Master Spark - configure multiple temporary file directories;

⚪ Master Spark - enable speculative execution mechanism;

⚪ Master Spark - avoid using collect;

⚪ Master Spark - use MapPartitions instead of map for RDD operations;

⚪ Master Spark - Spark shared variables;

1. Spark Tuning—Part 1

1. Better serialization implementation

Where Spark uses serialization

1. During Shuffle, objects need to be written to external temporary files.

2. The data in each Partition must be sent to the worker. Spark first packages the RDD into a task object and sends the task to the worker through the network.

3. If RDD supports memory + hard disk, writing data to the hard disk will also involve serialization.

 The default is java serialization. But there are two problems with Java serialization:

1. The performance is relatively low,

2. The length of the serialized binary content is also relatively large, resulting in longer network transmission time.

The industry now chooses a better implementation as  kryo , which is more than 10 times faster than Java's serialization. And the generated content length is also short. Time is fast and space is small, so it is a natural choice.

Method 1: Modify the spark-defaults.conf configuration file.

set up:

spark.serializer  org.apache.spark.serializer.KryoSerializer

Note: Separate with spaces.

Method 2: Configure when starting spark-shell or spark-submit.

--conf spark.serial

Guess you like

Origin blog.csdn.net/u013955758/article/details/132438261