Spark knowledge Summary - tuning (a)

Cluster structures: SPARK_WORKER-CORES: when the computer is a dual-threaded core 32 when the need to specify the number of 64 SPARK_WORKER_CORES

SPARK_WORKER_MEMORY :

Job submission:

./spark-submit --master node:port --executor-cores --class  ..jar xxx

--executor-cores: core Specifies the number used for each executor

--executor-memory: Specifies the memory used up to each executor

--total-executor-cores: standalone cluster total core spark application used 

--num-executor: the yarn for spark application executor start 

--Driver-cores: core driver used

--Driver-memory: Memory driver used

The above parameters are specified at the time of spark-submit submit tasks can also be configured in the spark-defaults.xml

 

Tuning spark parallelism: (generally used for testing the time)

sc.textFile (xx, memory)

sc.parallelize(seq,num)

sc.makeRDD(seq,num)

sc.parallelizePairs(List,num )

In the operator level to improve the degree of parallelism:

ReduceByKey(fun,num),join(xx,num),distinct(num),groupByKey(num)

Repartition can be increased parallelism:

repartition (num) / coalesce () repartition (num) = coalesce (whether Shuffle = num)

spark.default.parallelism: Local mode The default degree of parallelism is local [Digital] standalone / yarn: the number of all current core used executor

spark.sql.shuffle.partitions 200

Custom Partitioning is

sparkStreaming:

receiver 模式:  spark.streaming.blockInterval = 200ms 

direct mode (spark2.3 +): consistent with the partition number read topic

 

Code Tuning: 

1, to avoid creating duplicate RDD, try to reuse the same RDD 

2, RDD multiple use for persistence

Endurance of specific Sanko:  

cache (): the default data in the memory when the job across use of RDD, the data may be placed into the cache

persist():

MEMORY_ONLY: placing data directly into memory

Placing in memory after the time when a large amount of data, the data may be serialized: MEMORY_ONLY_SER

MEMORY_AND_DISK: placing data to disk

MEMORY_AND_DISK_SER:

checkpoint() 

3, to avoid operator shuffle class:

Operators + class map to get light variable manner instead join

4, there are end use map prepolymerized shuffle Class Operator

reduceByKey:

aggregateByKey:

Code demonstrates:

package com.optimize.study.spark

import org.apache.spark.{SparkConf, SparkContext}

object aggregateByKey {

  def main(args: Array[String]): Unit = {

    val conf = new SparkConf().setMaster("local").setAppName("test")

    val sc = new SparkContext(conf)

    val unit = sc.parallelize(Array[(String, Int)](
      ("zhangsan", 18),
      ("zhangsan", 19),
      ("lisi", 20),
      ("wangwu", 21),
      ("zhangsan", 22),
      ("lisi", 23),
      ("wangwu", 24),
      ("wangwu", 25)
    ), 2)

    val result = unit.aggregateByKey(" ")((s:String,i:Int)=>{s+"$"+i},(s1:String,s2:String)=>{s1+"#"+s2})

    result.foreach(println)



  }
}

 

combineByKey:

Code demonstrates:

package com.bjsxt.myscalacode

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object MyCombineByKey {
  def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setMaster("local").setAppName("test")
    val sc = new SparkContext(conf)
    val rdd1 = sc.parallelize(Array[(String, Int)](
      ("zhangsan", 18),
      ("zhangsan", 19),
      ("lisi", 20),
      ("wangwu", 21),
      ("zhangsan", 22),
      ("lisi", 23),
      ("wangwu", 24),
      ("wangwu", 25)
    ),2)

    /**
      * Partition index = 0, value = (zhangsan, 18 is) => (zhangsan, hello18) => (zhangsan, hello18 #. 19) 
      * Partition index = 0, value = (zhangsan,. 19) 
      * Partition index = 0, value = (Lisi, 20 is) => (Lisi, hello20) => (Lisi, hello20) 
      * Partition index = 0, value = (wangwu, 21 is) => (wangwu, hell21) => (wangwu, hell21) 
      * => ( zhangsna, hello18 # 19 @ hello22) 
      * => (lisi, hello20 @ hello23) 
      * => (wangwu, hello21 @ hello24 # 25) 
      *. 1 = Partition index, value = (zhangsan, 22 is) => (zhangsna, hello22) => (zhangsna, hello22)
      * partition index = 1,value = (lisi,23)         =>(lisi,hello23)        =>(lisi,hello23)
      * partition index = 1,value = (wangwu,24)       =>(wangwu,hello24)      =>(wangwu,hello24#25)
      * partition index = 1,value = (wangwu,25)
      */
    val unit: RDD[(String, String)] = rdd1.combineByKey((i:Int)=>{"hello"+i}, (s:String, i:Int)=>{s+"#"+i}, (s1:String, s2:String)=>{s1+"@"+s2})

    unit.foreach(println)

//    rdd1.mapPartitionsWithIndex((index,iter)=>{
//      val transIter = iter.map(one => {
//        s"partition index = ${index},value = $one"
//      })
//      transIter
//    }).foreach(println)


  }
}

 

map prepolymerized end has advantages: (with respect to the benefits of the direct polymerization is that: when the first polymerized directly pull data, and then reduce the polymerization at the end, but will carry out a preliminary polymerization at the polymerization end of each map and then polymerization was carried out for the data merge pull)

Reduce the amount of data of the map-side shuffle

Reduce the amount of data to reduce the pulled end

Reduce the number of polymerized reducing end

 

4, performance operator to make use of:

ForeachPartition instead of using foreach

Instead of using mappartitions map

Use coalesce to reduce large amounts of data partition after filtering

Instead of using reduceByKey GroupByKey

RepartitionAndSortWithinPartitions sort used instead of repartition and type of operation

5, the use of variable broadcast

Variables can be used to reduce the broadcast end executor memory

6, using optimized sequence Kryo performance of the spark use serialization place 

a, RDD <custom type>

b, task serialized 

c, RDD persistence can be serialized MEMOYR_AND_DISK_SER

Use spark Kryo serialization mechanism. Kryo serialization mechanism than the default java serialization mechanism lot faster, the data after the memory occupied by a sequence of smaller, about 1/10 of memory used by the java serialization, so after Kryo sequence of use, allows data to be transmitted over the network less, consume less memory resources in the cluster

Use spark kryo serialization mechanism need to be registered:

SparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer").registerKryoClasses(new class[]()SpeedSortKey.class)

7, optimization of the data structure:

Try to use native data type string in place of the spark

String object to make use instead of the spark

Try using an array instead of the set of spark

8, code optimization:

Reduce memory usage

Reduce the data transmission between nodes.

Reduce disk IO

9, the localized adjustment data - data locality level

a, PROCESS_LOCAL: data processed in the current task executor memory

B, NODE_LOCAL: data processing task on the disk of the current node, the current node or in other memory executor

C, NO_PREF: task processing data in an external database

D, RACK_LOCAL: task executor process data on memory or disk rack with other nodes in the worker

E, ANY: task processing data on another frame

Tuning parameters:

 spark.locality.process 3s - refers to the relegated from the process level to the level required waiting time node

spark.locality.node 3s - refers to the relegated from the node level to the level required waiting time pref

spark.locality.rack - refers to the time required to wait rack downgrade

Send task first driver to transmit data in accordance with the highest level of localization, when the task waiting for 3s 5 retries, if the task is still not executed, driver task will be downgraded to send the same token, in turn downgrade

10, memory tuning

To task enough memory to run, to avoid the frequent occurrence of GC, eventually leading to hair min GC or FULL GC, the JVM stops working

Shuffle less memory and less aggregation and broadcast RDD variable storage memory

parameter:

Static memory:

Reduce spark.shuffle.memoryFraction 0.2

Reduce spark.storage.memoryFraction 0.6

Unified Memory:

spark.memory.fraction 0.6

11, shuffle regulation 

spark.reducer.maxSizeInFlight: The default size of 48 M each time only pulling amount of data

spark.shuffle.io.maxRetries: pulling default number of retries failed data

spark.shuffle.io.retryWait: retry latency

spark.shuffle.sort.bypassMergeThreshold: 200

12, the external memory heap adjustment

Long wait for a connection between the nodes: --conf spark.core.connection.ack.wait.timeout = 300

During normal reduce task pulls data from maptask is:

First, the data to pull jvm storage, and store data to the card jvm buffer, then the transmission of data over the network

After the heap has an outer memory, skipped; JVM process dump data directly from the disk to the network card data buffer, and then the data is transferred outwardly

Spark off heap memory size of each executor executor memory size is 1/10, need to adjust to the size of the memory in most cases more than 2G

External memory heap adjustment parameters:

yarn follows:

--conf spark.yarn.executor.memoryOverhead = 2048 M

standalone follows:

--conf spark.executor.memoryOverhead = 2048 M 

13, the inclination data processing

Data tilt:

MR: a data processing task is greater than the other data processing task

hive: Under a tables a field in very much the same key, the other key data corresponding to a very small amount

spark: RDD data in a partition larger than the amount of data in other partitions

Data inclined to solve:

hive ETL process:

Scene: Spark hive requires frequent operation of an inclined table data, the operation will be associated with each field in accordance with the inclination

Solution: can mushroom cool whether business can be tilted forward to hive occur, so that the spark would not exist data skew "temporary solution"

Filter minority tilt key:

Scene: spark estimated whether minority tilt key to business impact, if little effect on the business, you can filter out these key directly, go to business analysis

Solution: You can use filter operators directly to filter out these slanted key

Increase the degree of parallelism:

Scene: large amount of data, less partition, different multi-key, the degree of parallelism can be raised directly

Solution: operator can be used directly increase the degree of parallelism

Twin polymerization:

Scene: partitioning less, the same key multiple, large volumes of data

Resolution: The prefix can be added to the same random key, polymerization, and the polymerization results prefix removed, go to the final result obtained by polymerization

Convert reduce join to map join:

Scene: Two RDD. A RDD large, a small RDD, the inclination data of join operations require two RDD

Solution: The small RDD Driver recovered to end, then the data broadcasted on the RDD for a large class Map arithmetic operators, this completely avoids flow generation shuffle, there is no data inclined

Sampling and split key inclined join operations: 

Scene: Two RDD. RDD a relatively large, the inclination data, another RDD is relatively large, RDD uses two join operations, the above-described operation can not be optimized

Solution: sampling and analysis with inclined split key, the random prefix, before and after expansion, then the data to solve problems in an inclined join

Random prefix and expansion RDD be join 

Scene: two RDD, a RDD large, a large amount of data KEY inclined, another RDD is relatively large, to join operation performed on two RDD

Solution: Use a random prefix and the expansion RDD operate, with the proviso that require large memory space

 

Guess you like

Origin www.cnblogs.com/wcgstudy/p/11403487.html