Spark optimization summary (two) - coding

Spark optimization summary (two) - coding

1 Introduction

  • Write a high performance Spark should need to have skills to write good code. A bad code, usually causes BUG, ​​slow efficiency and other issues, and often need to perform part of the code block is particularly true. Reasonable code designed to avoid possible high BUG, ​​effectively improve operational efficiency, while giving a good readability.
  • In addition, in this part of most programs it is common.

2. Select the logical data structure

  • Typically, the choice of the wrong data structure result in additional memory overhead, performance and other issues. We should select the appropriate data structure according to the situation.
  • Example:
    • Try to use less nested data structure
      • For example, a PairRDD, should as far as possible by the RDD [(String, ((String, Long), (String, String, String)))] to RDD [(String, (String, Long, String, String, String))]
    • Try to choose less memory footprint, high efficiency type
      • Consider replaced with Int String, for example, with alternative names id (name string longer if any)
      • Consider introducing koloboke the Set, Map, replace the existing Set, Map
      • FastUtil consider introducing alternative type of conventional type
    • ListBuffer (list) and ArrayBuffer (array) of choice
      • When the subscript often need to do a query, do not use a linked list, an array
      • Can not determine the size of the container, need to save space, then use the list
      • Only need to traverse the whole container or later added, then use the list
    • Determining whether an object is present in the container, the type of container used Hash, e.g. HashSet
    • When required by the fast range queries, using the Tree type of container, such as TreeSet

3. Java and Scala container vessel Huzhuan

  • Likely to use Java code when developing Spark, use containers need to do conversion between different code.
  • We should try to use the iterator way to traverse the container, use less asJava, asScala conversion (will generate new container). Especially in the code block is called repeatedly, the efficiency great impact.

4. The block of interest is commonly performed

  • Block of code is executed frequently occupied a large part of Spark application uptime, optimize the good part of the code, can significantly improve operating efficiency.
  • Example:
    • Eet 算 子
      • For example in a code block map operation, clearly the number of data foregoing, it needs to be executed many times. If the more time-consuming operation on the inside, very affect performance. For example: new container, to do a remote access request, creates a JDBC connection map for each operation
    • while, for loop
      • Similarly block, while, for the inner loop is often performed
    • Common code block
      • Common code blocks, such as tools, often need to call the code is everywhere, its performance also need to pay more attention

5. Spark API

  • SparkAPI select the appropriate process data, we will get good results. I listed some examples below:
  • Example:
    • If you want to store the data generated small number of files, you can use repartition and coalesce. Note that Stage prior to storing the data processing take longer, it is recommended to use repartition, because coalesce although shuffle will not happen, but will reduce the degree of parallelism of the last Stage
    • Use reduceByKey, aggregateByKey, combineByKey alternative groupBy
    • join to find ways to utilize when broadcasting map-side join
    • When reading the table, add Schema, the reading speed can be accelerated table
    • Before processing the data, ahead of filtration, more efficient
    • Written within a logical operator, try to write together, do not write map.map similar operations. Spark from the processing logic, the problem did not write, but write multiple map it is likely due to the mistakes made by the operator, it defines each part of the data (or processing). For example, you can replace filter.map flatMap operation, with a map alternative mapValue.map operation.
    • When streaming, when required by the business time setting window, use StructuredStreaming
    • System.out.println print data in the operator is performed at each node, the node need to view the log
    • According to circumstances, the use of the collect less, use take. will collect all the data, could lead to memory corruption Driver
    • Do not re-create the same RDD, try to reuse RDD
    • Need RDD repeated use, DataFrame should do caching (choose the right level of persistence), after using the cache remember to clear the cache (call cleared in a reasonable position)
    • Then be sorted immediately after repartition, may be used instead of repartitionAndSortWithinPartitions 'repartition + sort'
    • Some operating pieces of data need not be performed once (e.g. JDBC connection) should use mapPartitions, foreachPartirions. Or consider Singleton design pattern, to create a for each JVM.
    • Other myself think about it slowly = W = I will add ... follow
    • In addition, look at an example of how to deal with data skew, see Spark code readability and performance optimization - example five, six, seven, eight

6. The broadcast problem

7. The data transfer and analysis

8. abnormal data processing

  • In development, access to data sources there is dirty data is normal.
  • Under normal circumstances, encountered dirty data, parsing errors, Spark application error stop, and then you find abnormal data sample, and then modify the code logic. Perhaps you repeat the process dozens of times, but still will occasionally cause abnormal appearance of dirty data (data source because you can not predict what kind of dirty data exist). This time, I recommend that you do this:
  • Exception Processing Data Example
    val data = List(
        "小明,18,北京,男",
        "小李,34,四川,女",
        "小王,!@#,重庆,男"
    )
    spark.sparkContext.parallelize(data)
        // 利用flatMap的特性来处理脏数据
        .flatMap { line =>
            val fields = line.split(',')
            val name = fields(0)
            val ageStr = fields(1)
            val address = fields(2)
            val gender = fields(3)
    
            try {
                // age中可能存在无法解析的脏数据
                val age = Integer.parseInt(ageStr)
                
                Some((name, age, address, gender))
            } catch {
                // 如果解析异常,直接不要该数据
                case  _: Throwable => None
            }
        }
    

9. A data synchronization problems lock

  • Processing large amounts of data, if you need an access point (target), then there is a problem of synchronization lock. Lock away large amounts of data synchronization is very affecting performance, we need to reduce the use of locks as much as possible.
  • Here are a few suggestions:
    • Reading a plurality of threads, a plurality of write threads -> with Synchronized, ReentraintLock the like, locked read / write method, or concurrent with the type of container
    • Multiple threads to read, to write a small amount of thread -> with read-write lock ReentrantReadWriteLock
    • Multiple threads to read, to write a single thread -> modify objects with the volatile keyword, broadcast
      • For example, first with a volatile modification of the original object A. When writing the thread needs to update the object, the first new object B, copy the properties of the original object A to the object B, update the object B, then object A reference to the variable name to an object B, you can.

10. design a suitable project structure

Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/103821860