Cluster structures: SPARK_WORKER-CORES: when the computer is a dual-threaded core 32 when the need to specify the number of 64 SPARK_WORKER_CORES
SPARK_WORKER_MEMORY :
Job submission:
./spark-submit --master node:port --executor-cores --class ..jar xxx
--executor-cores: core Specifies the number used for each executor
--executor-memory: Specifies the memory used up to each executor
--total-executor-cores: standalone cluster total core spark application used
--num-executor: the yarn for spark application executor start
--Driver-cores: core driver used
--Driver-memory: Memory driver used
The above parameters are specified at the time of spark-submit submit tasks can also be configured in the spark-defaults.xml
Tuning spark parallelism: (generally used for testing the time)
sc.textFile (xx, memory)
sc.parallelize(seq,num)
sc.makeRDD(seq,num)
sc.parallelizePairs(List,num )
In the operator level to improve the degree of parallelism:
ReduceByKey(fun,num),join(xx,num),distinct(num),groupByKey(num)
Repartition can be increased parallelism:
repartition (num) / coalesce () repartition (num) = coalesce (whether Shuffle = num)
spark.default.parallelism: Local mode The default degree of parallelism is local [Digital] standalone / yarn: the number of all current core used executor
spark.sql.shuffle.partitions 200
Custom Partitioning is
sparkStreaming:
receiver 模式: spark.streaming.blockInterval = 200ms
direct mode (spark2.3 +): consistent with the partition number read topic
Code Tuning:
1, to avoid creating duplicate RDD, try to reuse the same RDD
2, RDD multiple use for persistence
Endurance of specific Sanko:
cache (): the default data in the memory when the job across use of RDD, the data may be placed into the cache
persist():
MEMORY_ONLY: placing data directly into memory
Placing in memory after the time when a large amount of data, the data may be serialized: MEMORY_ONLY_SER
MEMORY_AND_DISK: placing data to disk
MEMORY_AND_DISK_SER:
checkpoint()
3, to avoid operator shuffle class:
Operators + class map to get light variable manner instead join
4, there are end use map prepolymerized shuffle Class Operator
reduceByKey:
aggregateByKey:
Code demonstrates:
package com.optimize.study.spark import org.apache.spark.{SparkConf, SparkContext} object aggregateByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local").setAppName("test") val sc = new SparkContext(conf) val unit = sc.parallelize(Array[(String, Int)]( ("zhangsan", 18), ("zhangsan", 19), ("lisi", 20), ("wangwu", 21), ("zhangsan", 22), ("lisi", 23), ("wangwu", 24), ("wangwu", 25) ), 2) val result = unit.aggregateByKey(" ")((s:String,i:Int)=>{s+"$"+i},(s1:String,s2:String)=>{s1+"#"+s2}) result.foreach(println) } }
combineByKey:
Code demonstrates:
package com.bjsxt.myscalacode import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object MyCombineByKey { def main(args: Array[String]): Unit = { val conf = new SparkConf().setMaster("local").setAppName("test") val sc = new SparkContext(conf) val rdd1 = sc.parallelize(Array[(String, Int)]( ("zhangsan", 18), ("zhangsan", 19), ("lisi", 20), ("wangwu", 21), ("zhangsan", 22), ("lisi", 23), ("wangwu", 24), ("wangwu", 25) ),2) /** * Partition index = 0, value = (zhangsan, 18 is) => (zhangsan, hello18) => (zhangsan, hello18 #. 19) * Partition index = 0, value = (zhangsan,. 19) * Partition index = 0, value = (Lisi, 20 is) => (Lisi, hello20) => (Lisi, hello20) * Partition index = 0, value = (wangwu, 21 is) => (wangwu, hell21) => (wangwu, hell21) * => ( zhangsna, hello18 # 19 @ hello22) * => (lisi, hello20 @ hello23) * => (wangwu, hello21 @ hello24 # 25) *. 1 = Partition index, value = (zhangsan, 22 is) => (zhangsna, hello22) => (zhangsna, hello22) * partition index = 1,value = (lisi,23) =>(lisi,hello23) =>(lisi,hello23) * partition index = 1,value = (wangwu,24) =>(wangwu,hello24) =>(wangwu,hello24#25) * partition index = 1,value = (wangwu,25) */ val unit: RDD[(String, String)] = rdd1.combineByKey((i:Int)=>{"hello"+i}, (s:String, i:Int)=>{s+"#"+i}, (s1:String, s2:String)=>{s1+"@"+s2}) unit.foreach(println) // rdd1.mapPartitionsWithIndex((index,iter)=>{ // val transIter = iter.map(one => { // s"partition index = ${index},value = $one" // }) // transIter // }).foreach(println) } }
map prepolymerized end has advantages: (with respect to the benefits of the direct polymerization is that: when the first polymerized directly pull data, and then reduce the polymerization at the end, but will carry out a preliminary polymerization at the polymerization end of each map and then polymerization was carried out for the data merge pull)
Reduce the amount of data of the map-side shuffle
Reduce the amount of data to reduce the pulled end
Reduce the number of polymerized reducing end
4, performance operator to make use of:
ForeachPartition instead of using foreach
Instead of using mappartitions map
Use coalesce to reduce large amounts of data partition after filtering
Instead of using reduceByKey GroupByKey
RepartitionAndSortWithinPartitions sort used instead of repartition and type of operation
5, the use of variable broadcast
Variables can be used to reduce the broadcast end executor memory
6, using optimized sequence Kryo performance of the spark use serialization place
a, RDD <custom type>
b, task serialized
c, RDD persistence can be serialized MEMOYR_AND_DISK_SER
Use spark Kryo serialization mechanism. Kryo serialization mechanism than the default java serialization mechanism lot faster, the data after the memory occupied by a sequence of smaller, about 1/10 of memory used by the java serialization, so after Kryo sequence of use, allows data to be transmitted over the network less, consume less memory resources in the cluster
Use spark kryo serialization mechanism need to be registered:
SparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer").registerKryoClasses(new class[]()SpeedSortKey.class)
7, optimization of the data structure:
Try to use native data type string in place of the spark
String object to make use instead of the spark
Try using an array instead of the set of spark
8, code optimization:
Reduce memory usage
Reduce the data transmission between nodes.
Reduce disk IO
9, the localized adjustment data - data locality level
a, PROCESS_LOCAL: data processed in the current task executor memory
B, NODE_LOCAL: data processing task on the disk of the current node, the current node or in other memory executor
C, NO_PREF: task processing data in an external database
D, RACK_LOCAL: task executor process data on memory or disk rack with other nodes in the worker
E, ANY: task processing data on another frame
Tuning parameters:
spark.locality.process 3s - refers to the relegated from the process level to the level required waiting time node
spark.locality.node 3s - refers to the relegated from the node level to the level required waiting time pref
spark.locality.rack - refers to the time required to wait rack downgrade
Send task first driver to transmit data in accordance with the highest level of localization, when the task waiting for 3s 5 retries, if the task is still not executed, driver task will be downgraded to send the same token, in turn downgrade
10, memory tuning
To task enough memory to run, to avoid the frequent occurrence of GC, eventually leading to hair min GC or FULL GC, the JVM stops working
Shuffle less memory and less aggregation and broadcast RDD variable storage memory
parameter:
Static memory:
Reduce spark.shuffle.memoryFraction 0.2
Reduce spark.storage.memoryFraction 0.6
Unified Memory:
spark.memory.fraction 0.6
11, shuffle regulation
spark.reducer.maxSizeInFlight: The default size of 48 M each time only pulling amount of data
spark.shuffle.io.maxRetries: pulling default number of retries failed data
spark.shuffle.io.retryWait: retry latency
spark.shuffle.sort.bypassMergeThreshold: 200
12, the external memory heap adjustment
Long wait for a connection between the nodes: --conf spark.core.connection.ack.wait.timeout = 300
During normal reduce task pulls data from maptask is:
First, the data to pull jvm storage, and store data to the card jvm buffer, then the transmission of data over the network
After the heap has an outer memory, skipped; JVM process dump data directly from the disk to the network card data buffer, and then the data is transferred outwardly
Spark off heap memory size of each executor executor memory size is 1/10, need to adjust to the size of the memory in most cases more than 2G
External memory heap adjustment parameters:
yarn follows:
--conf spark.yarn.executor.memoryOverhead = 2048 M
standalone follows:
--conf spark.executor.memoryOverhead = 2048 M
13, the inclination data processing
Data tilt:
MR: a data processing task is greater than the other data processing task
hive: Under a tables a field in very much the same key, the other key data corresponding to a very small amount
spark: RDD data in a partition larger than the amount of data in other partitions
Data inclined to solve:
hive ETL process:
Scene: Spark hive requires frequent operation of an inclined table data, the operation will be associated with each field in accordance with the inclination
Solution: can mushroom cool whether business can be tilted forward to hive occur, so that the spark would not exist data skew "temporary solution"
Filter minority tilt key:
Scene: spark estimated whether minority tilt key to business impact, if little effect on the business, you can filter out these key directly, go to business analysis
Solution: You can use filter operators directly to filter out these slanted key
Increase the degree of parallelism:
Scene: large amount of data, less partition, different multi-key, the degree of parallelism can be raised directly
Solution: operator can be used directly increase the degree of parallelism
Twin polymerization:
Scene: partitioning less, the same key multiple, large volumes of data
Resolution: The prefix can be added to the same random key, polymerization, and the polymerization results prefix removed, go to the final result obtained by polymerization
Convert reduce join to map join:
Scene: Two RDD. A RDD large, a small RDD, the inclination data of join operations require two RDD
Solution: The small RDD Driver recovered to end, then the data broadcasted on the RDD for a large class Map arithmetic operators, this completely avoids flow generation shuffle, there is no data inclined
Sampling and split key inclined join operations:
Scene: Two RDD. RDD a relatively large, the inclination data, another RDD is relatively large, RDD uses two join operations, the above-described operation can not be optimized
Solution: sampling and analysis with inclined split key, the random prefix, before and after expansion, then the data to solve problems in an inclined join
Random prefix and expansion RDD be join
Scene: two RDD, a RDD large, a large amount of data KEY inclined, another RDD is relatively large, to join operation performed on two RDD
Solution: Use a random prefix and the expansion RDD operate, with the proviso that require large memory space