Spark learning from 0 to 1 (10)-Spark tuning (1)-system and code tuning

1. Resource Tuning

1.1 Configure CPU and memory when building a Spark cluster

Configure spark-env.sh in the conf of the Spark installation package

Configuration Description Defaults
SPARK_WORKER_CORES Number of CPU cores available for the job All available CPU cores
SPARK_WORKER_INSTANCES Number of workers running on each machine 1
SPARK_WORKER_CORES × SPARK_WORKER_INSTANCES Total cores per machine
SPARK_WORKER_MEMORY Memory capacity available for operation 1G
SPARK_WORKER_INSTANCES × SPARK_WORKER_MEMORY Maximum amount of memory used by each node
SPARK_DAEMON_MEMORY Memory space allocated to Spark master and worker daemons 512M

1.2 Assign more resources to Application when submitting Application

Submit command options:

每个executor使用的core数,spark on yarn 默认为1,standalone默认为worker上所有可用的core。
--executor-cores  
每个executor内存大小(例如:2G),默认1G
--executor-memory
executor使用的总核数,仅限于SparkStandalone、Spark on Mesos模式
--total-executor-cores

2. Parallel tuning

Principle: A core generally allocates 2~3 tasks, and each task generally processes 1G data.

Ways to improve parallelism:

  1. If the read data is on HDFS, reduce the block size.

  2. sc.textFile(path,numPartitions)

  3. sc.parallelize(list, numPartitions)Generally used for testing

  4. coalesce, repartitionCan increase the number of RDD partitions.

  5. Modify configuration information:

    spark.default.parallelism:4: Do not set the total number of default executor cores

    spark.sql.shuffle.partitions 200

  6. Custom partitioner

3. Code tuning

3.1 Avoid creating duplicate RDDs

val rdd1 = sc.textFile("xxx")
val rdd2 = sc.textFile("xxx")

There is no difference in execution efficiency, but the code is messy.

  1. In other jobs, the persistence operator should be used for the repeatedly used RDD

    • cache:MEMORY_ONLY

    • persist:

      MEMORY_ONLY

      MEMORY_ONLY_SER

      MEMORY_AND_DISK_SER

      Generally do not choose a persistence level with _2

    • checkpoint

      If the calculation time of an RDD is relatively long or the calculation is more complicated, generally save the calculation result of this RDD to HDFS, so that the data will be more secure.

      If an RDD has a very long dependency, checkpoint will also be used, which will cut off the dependency and improve the efficiency of fault tolerance.

3.2 Try to use broadcast variables

During the development process, you will encounter scenarios where you need to use external variables in operator functions (especially large variables, such as large collections above 100M). Then you should use Spark's broadcast variables (Broad

cast) function to improve performance. When external variables are used in the function, by default, Spark will make multiple copies of the variable and transfer them to the task via the network. At this time, each task has a variable copy. If the variable itself is relatively large (such as 100M, or even 1G), then the performance overhead of a large number of variable copies in the network transmission, and the excessive memory usage in each stage of the Executor causes frequent GC, which will greatly affect the performance. If the external variable used is relatively large, it is recommended to use the broadcast function of Spark to broadcast the variable. After the broadcast, the variable will explode in the memory of each Executor, and only one copy of the variable resides, while the task in the Executor is executed. Share the copy of the variables in the Executor. In this way, the number of variable copies can be greatly reduced, thereby reducing the performance overhead of network transmission, reducing the memory usage of Eexcutor, and reducing the frequency of GC.

Broadcasting large variable transmission method: Executor did not broadcast variables at the beginning, but the task operation needs to use the broadcast variables and will find the BlockManager of Executor. BlockManager finds BlockManagerMaster in Driver.

Using broadcast variables can greatly reduce the number of copies of variables in the cluster.

Do not use broadcast variables: The number of copies of variables is the same as the number of tasks.

Use broadcast variable: The number of copies of the variable is the same as the number of Executor.

The maximum memory occupied by broadcast variables: ExecutorMemory * 60% * 90% * 80%

3.3 Try to avoid using shuffle operators

Use broadcast variables to simulate the use of join, usage scenarios: one RDD is relatively large, and one RDD is relatively small.

join operator = broadcast variable + filter, broadcast variable + map, broadcast variable + flatMap

3.4 shuffle operation using map-side pre-aggregation

Try to use combiner's shuffle operator.

Combiner concept: On the map side, local aggregation is performed after each map task is calculated.

Benefits of combiner:

  • Reduce the amount of data written to disk by shuffle write.
  • Reduce the amount of data pulled by shuffle read.
  • Reduce the number of aggregations on the reduce side.

There are shuffle operators of commbiner:

  • reduceByKey: This operator has a combiner on the map side. In some scenarios, reduceByKey can be used instead of groupByKey.
  • aggregateByKey
  • combineByKey

3.5 Try to use high-performance operators

  • Use reduceByKeyalternativegroupBkKey。
  • Use mapPartitioninstead ofmap。
  • Use foreachPartitionalternativeforeach。
  • filterAfter use coalesceto reduce the number of partitions.
  • Using repartitionAndSortWithinPartitionsalternative repartitionand sortoperation.

3.6 Use Kryo to optimize serialization performance

3.6.1 In Spark, serialization is mainly involved in three places:

  1. When an external variable is used in an operator function, the variable will be serialized for network transmission.
  2. When using a custom type as the generic type of RDD (for example JavaRDD<User>, User is a custom type), all custom type objects will be serialized. Thus, in this case, also requires a custom class must implement Serializablethe interface.
  3. When using a serializable persistence strategy (such as MEMORY_ONLY_SER), Spark will serialize all the partitions in the RDD into a large byte array.

3.6.2 Introduction to Kryo Serializer

Spark supports the use of Kryo serialization mechanism. The Kryo serialization mechanism is faster than the default Java serialization mechanism, and the serialized data is smaller, which is about 1/10 of the Java serialization mechanism. Therefore, after Kryo serialization, the data transmitted over the network can be reduced. The memory resources consumed in the cluster are greatly reduced.

For these three places where serialization occurs, we can optimize the performance of serialization and deserialization by using the Kryo serialization library. Spark is the default Java serialization mechanism, which is ObjectOutputStream/ObjectInputStream APIto be serialized and de-serialized. However, Spark also supports the use of the Kryo serialization library. The performance of the Kryo serialization library is much higher than that of the Java serialization library. According to the official introduction, the Kryo serialization mechanism is about 10 times higher than the Java serialization mechanism. The reason why Spark does not use the Kryo serialization library by default is because Kryo requires that it is better to register the custom types that need to be serialized, so this method is more troublesome for developers.

3.6.3 Using Kryo in Spark

SparkConf conf = new SparkConf();
conf.setMaster("local");
conf .set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
conf.registerKryoClasses(new Class[]{
    
    CameraAggInfo.class});

3.7 Optimize data structure

There are three types of memory consumption in java:

  1. Object, each Java object has additional information such as object header and reference, so it takes up more memory space.
  2. String, each string contains additional information such as the length of a character array.
  3. Collection types, such as HashMap, List, etc., because some internal classes are usually used to encapsulate collection elements, such as Map.Entry.

Therefore, Spark officially recommends that in Spark coding, especially for the code in the operator function, Jinlbuy uses the above three data structures. Try to use string instead of objects, use primitive data types (such as Int, Long) instead of strings, and use arrays instead of collection types to reduce memory usage as much as possible, thereby reducing GC frequency and prompting performance.

3.8 Use the high-performance library fastutil

fastutil is a class library that extends the Java standard collection framework (Map, List, Set) and provides special types of map, set, list and queue; fastutil can provide a smaller memory footprint and faster access speed. Use the collection class provided by fastutil instead of the JDK native collection you usually use. The advantage is that the fastutil collection class can reduce the memory usage, and it is traversing the collection, obtaining the value of the element according to the index (or key) and setting the value of the element When, provide faster access speed. Each collection type of fastutil implements the corresponding standard interface in Java (for example, fastutil's map, which implements Java's Map interface), so it can be directly put into any code of the existing system.

The latest version of fastutil requires Java 7 and above.

Guess you like

Origin blog.csdn.net/dwjf321/article/details/109056145