Points spark of optimization

Background
Why do we need tuning? ?
The speed of operation of the program is run, the cluster or clusters, but there may be another tune people and you write the code to be several times or even dozens of times

1. Development Tuning

1.1 Principle One: avoid creating duplicate RDD
we have a copy of the data, student.txt
first requirement: wordCount val stuRDD = sc.textFile ( " e: //sparkData//student.txt")
The second demand: count there how many students val stuRDD01 = sc.textFile ( "e: //sparkData//student.txt")
If you create a duplicate will be loaded twice, wasting performance. But according to our needs, the same operator, requires the use of two, how do? ?
Persistence can be:

sc.textFile("e://sparkData//student.txt").cache()

1.2 Second principle: as much as possible to use the same RDD
this is everyone in the development, which often read, read to forget
example:
Val namesRDD = starsRDD.map (_._ 1)
Val name2LengthRDD = namesRDD.map (name = > (name, name.length))

// 这两个map是可以合并的
val name2LengthRDD01 =  starsRDD.map(tuple => (tuple._1,tuple._1.length))
下面的这种方式写RDD性能更优,因为减少了一次RDD的计算

1.3 Principle III: The RDD multiple use for persistence

Note that the persistence level selected:
1. precedence MEMORY_ONLY, d but only if your memory is large enough, it may cause OOM (out of memory anomalies)
2. If memory is insufficient MEMORY_ONLY, on the use of MEMORY_ONLY_SER persistence level, the sequence after technology, the data memory occupied by fewer, but after using serialization and de-serialization was consumed by the CPU
3. the above is pure memory persistence, very fast, but if MEMORY_ONLY_SER is not enough memory, then using MEMORY_AND_DISK_SER, With this strategy, priority of the data in memory, out of memory into the disk
4 is not recommended to use pure dISK program, this is very slow, _2 in some special scenarios (Spark Streaming demanding fault-tolerant) than to use, It is generally not recommended

1.4 Principle IV: try to avoid using shuffle Class Operator

Reducing partition
Broadcast + map + filter replaced join

For join, large tables join small table, the broadcast data may be considered small table to the executor by map + filter operation is complete join function

1.5 Principle V: using the map-side prepolymerized shuffle operation
given to the use of shuffle operations, the operator can not be used to replace the class map, it may be possible to use prepolymerized map-side operator.
ReduceByKey is to use instead of groupByKey
if a similar demand, performance reduceByKey better than groupByKey lot, you can greatly reduce network traffic data

1.6 Principle VI: the use of high-performance operator
some demand, many operators can be used, but not the same performance, with higher performance operators to solve
such as:
use reduceByKey / aggregateByKey alternative groupByKey
use mapPartitions replace ordinary map, use mapPartitions will appear the OOM (out of memory) problem
using the foreach foreachPartitions Alternatively, instead of ordinary similar mapPartitions map, compared to the above, it is an operator action, the example database write
performed after the operation using filter coalesce
NOTE: use of repartition and coalesce scene
repatriation with a shuffle, generally the number of partitions increases, the purpose of improving the degree of parallelism
val rdd02 = rdd01.filter (xxx) - > some filtering many partitions, some filtering may rarely
coalesce the partition is generally the number is reduced, the number of partitions is to merge,
rdd02.coalesce (), although the degree of parallelism is reduced, but the higher resource utilization, but may improve performance

If you need to reduce the district are extremely rare,
rdd01 is 20 partitions - "rdd02: 5 Ge val rdd02 = rdd01.coalesce (5, true ) /rdd01.repartition (5) partition, the better

1.7 Seven principles: broadcast large variable
bonus:
If you use an external variable is relatively large, it is recommended to use Spark broadcasting function, the variables broadcast. Variable after the broadcast, we will ensure that each Executor of memory,
only a copy of variable resides, there are more than Executor, while the share of variable shared copy of the Executor Executor when the task execution. In this case,
you can greatly reduce the number of copies of variables, thus reducing the overhead of network transmission performance and reduce the memory occupation Executor overhead, reducing the frequency of GC.

1.8 Principle 8: Use Kryo optimize serialization performance

Serialization of spark? java Kryo
do the following configuration:

//创建SparkConf对象。
val conf =new SparkConf().setMaster(...).setAppName(...)
//设置序列化器为KryoSerializer。
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")
//注册要序列化的自定义类型。
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))

1.9 Principle 9: optimize data structure

Objects, strings, are more memory-set
target character string instead of
an array instead of a set of
a primary type (such as Int, Long) Alternatively string

Too difficult to use, not practical

2.0 Resource Tuning

In the executor inside, the memory will be divided into several parts:
the first one is used to make our own task to execute code written in default accounted for 20% Executor total memory;
second task is to block the process by pulling the shuffle after using the output of a task on the stage, polymerization and other operations, accounting for 20% executor default is the total memory; spark.shuffle.memoryFraction
spark.shuffle.memoryFraction
for adjusting the executor, the data memory size occupied shuffle The default is 0.2

The third block is used to make RDD persistence, accounting for 60% Executor default total memory. spark.storage.memoryFraction
spark.storage.memoryFraction
for adjusting executor, the persistence of the data memory size, default is 0.6

2.1 understand

NOTE: How spark Configuration Parameters
1. How to configure parameters in the code:

conf.set(key,value)
conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")

For resource tuning parameters are the following:
num-executors
Operation, the total number of executor to
parameter tuning recommendations: Spark job to run each set of generally about 50 to 100 Executor process is appropriate, too little or too much Executor setting process is not good.
Too few settings can not make full use of cluster resources; too many settings, then most of the queue may not be given sufficient resources.

== executor-memory ==
Parameter Description: This parameter is used to set the memory of each process Executor. Executor memory size, often directly determines the performance Spark job,
but with the common JVM OOM exceptions, there is a direct correlation.
640G memory 32g * 20 = 640G

20 executor
can see the maximum memory limit resource queues own team is how much, num-executors multiplied by the executor-memory, it represents the total amount of memory your application to the Spark job (that is, the sum of all Executor memory processes) ,
this amount can not exceed the maximum amount of memory queue. In addition, if you are sharing this resource queue with others in the team, then the total amount of memory the application should not exceed a maximum total memory resource queue 1/3 ~ 1/2, to avoid your own Spark job
takes up all of the queue resources, leading to other students job can not run.

== executor-cores ==
Each executor how many cpu core of
the core does not refer to a physical core, refers to a logical core
i7 4-core 8 threads

Parameter tuning recommendations: Number of CPU core Executor is set to 2 to 4 is more appropriate. According to the same resource queues have different departments to set, you can see the maximum CPU core restriction own resources is the number of queues, and then based on the number of Executor set
to determine each Executor process can be assigned to several CPU core. Also suggested that if the queue is shared with others, then the num-executors * executor-cores do not exceed the total cohort CPU core is about 1/3 ~ 1/2 is appropriate,
but also to avoid affecting the job to run other students.

driver-memory
Dirver memory allocated to the program, when there collect operation, we need to give bigger dirver memory

spark.default.parallelism
Parameter Description: This parameter is used to set the default number for each stage of the task. This parameter is extremely important, if not set may directly affect your job performance Spark.

spark.default.parallelism = num-executors * executor- cores (2-3 times)
After this set, each task 2-3 are all cpu

Each executor executor 10 which has four core, provided spark.default.parallelism = 120
run executor of how many core 40, 120/40 = core of each task

The total number of task, and certainly larger than the number assigned to the cpu core, whereas a waste of resources, it is the general 2-3 times more appropriate to
how to set the degree of parallelism:
How to set the degree of parallelism of a Spark Application?

 1.spark.defalut.parallelism   
   默认是没有值的,如果设置了值比如说10,是在shuffle的过程才会起作用(val rdd2 = rdd1.reduceByKey(_+_) 
 	//rdd2的分区数就是10,rdd1的分区数不受这个参数的影响)

      new SparkConf().set(“spark.defalut.parallelism”,”500“)
	  
 2、如果读取的数据在HDFS上,增加block数,默认情况下split与block是一对一的,而split又与RDD中的partition对应,所以增加了block数,也就提高了并行度。
 3、RDD.repartition,给RDD重新设置partition的数量
 4、reduceByKey的算子指定partition的数量
      val rdd2 = rdd1.reduceByKey(_+_,10)  val rdd3 = rdd2.map.filter.reduceByKey(_+_)
 5、val rdd3 = rdd1.join(rdd2)  
 	rdd3里面partiiton的数量是由父RDD中最多的partition数量来决定,因此使用join算子的时候,增加父RDD中partition的数量。
 6、spark.sql.shuffle.partitions 
 	//spark sql中shuffle过程中partitions的数量

3.spark inclined Tuning Tuning data

map filter this data will tilt happen? ?

Question: data skew is certainly some key data skew occurs, then how do you know which key is tilted? ?
10 million to take 20,000 out

val sampledPairs = pairs.sample(false, 0.1)
val sampledWordCounts = sampledPairs.countByKey()
sampledWordCounts.foreach(println(_))

Solution:
1. broken
2. Filter

2. The script parameters which can be configured
using the format:

./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  

For example:

./bin/spark-submit \
  --master yarn-cluster \
  --num-executors 100 \
  --executor-memory 6G \
  --executor-cores 4 \
  --driver-memory 1G \
  --conf spark.default.parallelism=1000 \
  --conf spark.storage.memoryFraction=0.5 \
  --conf spark.shuffle.memoryFraction=0.3 \

3. adjustable parameters in the configuration file
conf / spark-defaults.conf configuration file read the configuration options. In conf / spark-defaults.conf configuration file,
each row is a key-value pairs, may be divided by an intermediate space, may be divided by an equal sign directly

Question: these three places parameters can be configured for the same parameter configuration in three places, and it's not the same as the value parameter, then
in the end is that into effect? ?

Priority :
in the code of the highest priority - "means that once written, in other places can not be changed, unless modified code appears, packing operation, so is not desirable, unless some parameters are written not changed, in this which configuration is appropriate
in the script the second highest priority - "very flexible, in general, more suitable for writing parameters in it
the lowest in the configuration file -" two or more configuration parameters are parameters specific to the application , the configuration file are global parameters, the lowest priority, more appropriate to write some parameters need to use all of

2. Resource Tuning
3. Data inclined tuning
4.shuffle tuning sections

//////////////////////////////////
the Spark optimization principles

1. Try to make computing operations in a rdd inside
// wrong approach.
// there is a <Long, String> format RDD, namely rdd1.
As business needs // then perform the operation on a map rdd1, created a rdd2, and the data is only rdd1 rdd2 in the
value of the value of it, that is to say, rdd2 is a subset of rdd1.
JavaPairRDD <Long, String> = ... rdd1
JavaRDD RDD2 = rdd1.map (...)
// are executed sub-operations of different algorithms and rdd1 rdd2.
rdd1.reduceByKey (...)
rdd2.map (...)

Description: rdd2 is rdd1 of v kv type of operation after a change over

The right approach:

JavaPairRDD<Long, String> rdd1 = ...  .Cache()
rdd1.reduceByKey(...)
rdd1.map(tuple._2...)

Reduces the formation of RDD

2. minimize shuffle

// Traditional join operation will result shuffle operation.
// Since two RDD, the same key required to pull a node on the network, perform a join operation by the task.
rdd1.join rdd3 = Val (RDD2)
// Broadcast + Map of the join operation does not lead to shuffle operation.
// Broadcast will use a small amount of data broadcast as RDD variables.
rdd2.collect rdd2Data = Val ()
Val = rdd2DataBroadcast sc.broadcast (rdd2Data)
// rdd1.map the operator can obtain all the data from rdd2DataBroadcast the rdd2.
// iterate then, if it is found in the current key with the key data in a rdd2 rdd1 of pieces of data are identical, then the determination can
be join.
// At this time, the data may according to their needs embodiment, the current data and rdd2 rdd1 may be connected, spliced together (String
or Tuple).
rdd1.map rdd3 = Val (rdd2DataBroadcast ...)
// Note that the above operation, data is only recommended in rdd2 less than (M, such as a few hundred, or twelve G) in the case of use.
// Because each Executor memory, the data will reside full amount of a rdd2.

Published 18 original articles · won praise 2 · Views 374

Guess you like

Origin blog.csdn.net/CH_Axiaobai/article/details/104161795