Big data-Spark performance optimization

Spark performance optimization

Tuned mainly for memory usage

Spark performance optimization technology

  1. Use high-performance serialization libraries
  2. Optimize data structure
  3. Persist and checkpoint for multiple RDDs
  4. Persistence level: MEMORY_ONLY-> MEMORY_ONLY_SER serialization
  5. Java virtual machine garbage collection tuning
  6. Shuffle tuning

I. Judge Spark memory usage

First of all, we must see the memory usage in order to carry out targeted optimization

(1) Memory cost

  1. Every Java object has an object header, which takes up 16 bytes and contains some meta information about the object, such as a pointer to its class. If the object itself is small, such as int, but its object head is larger than the object itself
  2. Java's String object will be 40 bytes more than the original data in its memory. The char array used inside String holds the internal string sequence, and also saves information such as output length. Char uses UTF-16 encoding, each character will occupy 2 bytes. For example, a String containing 10 characters, 2 * 10 + 30 = 60 bytes
  3. Collection types in Java, such as HashMap and LinkedList, use linked list data structures internally. Each data in the linked list is wrapped with Entry objects. Entry object, not only the object header, but also the pointer that just wants the next Entry, occupying 8 bytes
  4. The element type is the original data type (int), and the packaging type (Integer) of the original data type is usually used internally to store the element
    1. (2) Judge the memory consumption

      1. Set the parallelism of RDD (parellelize and textFile two methods to create RDD, in these two methods, pass in the second parameter, set the number of RDD partitions. Set a parameter in SparkConfig: spark.default.parallelism can be set uniformly The number of partitions of all RDDs in this application)
      2. Cache RDD
      3. Observation log: driver log (under Spark's work folder)
      4. Add this MemoryStore memory information, it is RDD memory
        1. Second, use high-performance serialization class library

          1. Data serialization

          Data serialization is to convert objects or data structures into a specific format so that they can be transmitted on the network or stored in memory or files

          Deserialization is the opposite operation, which restores the object from the serialized data

          The serialized data format can be any format such as binary, xml, json, etc.

          The focus of object and data serialization is on data exchange and transmission

          In any distributed system, serialization plays an important role

          If the serialization technology is used, the operation is very slow, or the amount of data after serialization is still very large, which will reduce the performance of distributed system applications.

          Spark itself defaults to serializing data in some places, such as Shuffle. In addition, we used external data (custom type), but also to make it serializable

          By default: Spark prefers the convenience of serialization, using the serialization mechanism provided by Java itself, which is very convenient to use. However, the performance of the Java serialization mechanism is not high, the serialization speed is slow, the serialized data is large, and it takes up memory space

          2. Perform

          Spark supports serialization using kryo class library

          Fast speed, smaller footprint, 10 times smaller than Java serialized data

          3. Use the kryo serialization mechanism

          (1) Set Spark conf

          spark.master spark://6.7.8.9:7077

          spark.executor.memory 4g

          spark.eventLog.enabled true

          spark.serializer org.apache.spark.serializer.KryoSerializer

          (2) Use kryo, need to serialize the class, register in advance to get high performance

          conf.registerKryoClasses(Array(classOf[Count],...))

          4. Optimization of kryo library
          1. Optimize the size of the cache (if the registered custom type is extremely large (100 fields), it will cause the object to be serialized to be too large. At this time, the kyro itself needs to be optimized. Because the internal cache of kryo may not be stored so large Class object. Set the spark.kryoserializer.buffer.max parameter and increase it)
          2. Register custom types in advance (although kryo can work normally without registering custom types, it will save a copy of his fully qualified class name and consume memory. It is recommended to pre-register custom types to be serialized)

          Third, optimize the data structure

          Overview

          To reduce memory consumption, in addition to using efficient serialization libraries, data structures must also be optimized to avoid additional memory overhead caused by Java syntax

          Core: optimize the local data used inside the operator function or the data outside the operator function

          Purpose: To reduce the consumption and occupation of memory

          practice

          1. Prefer to use arrays and strings rather than collection classes (preferably use Array instead of ArrayList, LinkedList, HashMap), use int [] will save memory than List
          2. Convert object to string
          3. Avoid using multiple layers of nested object structures
          4. For scenarios that can be avoided, try to use int instead of String

          Fourth, the tuning of the Java virtual machine

          I. Overview

          If a large amount of data is persisted when the RDD is persisted, then the garbage collection of the Java virtual machine may become a bottleneck

          The Java virtual machine periodically performs garbage collection. At this time, all Java objects are tracked, and during garbage collection, those objects that are no longer in use are found, old objects are cleaned up, and space is made for new objects.

          The performance overhead of garbage collection is proportional to the number of objects in memory

          Before doing Java virtual machine tuning, you need to do the previous tuning work before it makes sense.

          Second, Spark GC principle

Insert picture description here

3. Monitoring garbage collection

Monitor how often the time spent on the first level of garbage collection and so on

spark-submit脚本中,添加--conf "spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimesStamps"

Worker logs: logs folder under spark

Driver log: worker folder under spark

Fourth, optimize the Executor memory ratio

Purpose: To reduce the number of GC

For GC tuning, the most important thing is to adjust the ratio of the space occupied by the RDD cache to the memory space occupied by the object created when the operator is executed

For the default case, Spark uses 60% of the memory space of each Executor to cache the RDD, and only 40% of the memory space is used to store objects created during the task.

配置:conf.set("spark.storage.memoryFraction",0.5)

Insert picture description here

Published 131 original articles · won 12 · 60,000 views +

Guess you like

Origin blog.csdn.net/JavaDestiny/article/details/97962703