The basic concept of finishing spark

app

1 based spark user program, a driver program and includes a plurality of clusters executor
 2 Driver executor and the heartbeat mechanism to ensure the survival of the presence of 
3 --conf spark.executor.instances = 5 --conf spark.executor.cores = 8 - conf spark.executor.memory = 80G

eet

A resilient distributed data set
 2 readonly partition (Partition) collection of records
 3 First-generation lineage rdd at the top level, the data recording partition information required task, each partition data reading method
 4 progeny not really rdd storage information, only information recorded lineage
 5 real data is read, it is time for a specific task to be performed, the operation is triggered when the action occurred only

Sanko

1 分为transformation和action
2 transformation: map filter flatMap union groupByKey reduceByKey sortByKey join
3 action: reduce collect count first saveAsTextFile countByKey foreach

partition

. 1 RDD similar storage mechanism hdfs, distributed storage
 2 HDFS is cut into a plurality of Block (default 128M) for storing, rdd is cut into a plurality of partition storing
 three different partition may be on different nodes
 4 read again spark under hdfs scenario, spark hdfs of the block will be read into memory partition spark abstract of
 5 to RDD to the persistence hdfs, each Partition will RDD saved as a file if the file is less than 128M, as will be appreciated corresponding to a partition of a block hdfs. Conversely, if more than 128M, and will be divided into a plurality of Block, so, it will correspond to a plurality of partition block.

job

1 an operator action triggers a job
 2 a there is a lot of job task, task execution is the logical unit of the job (task guess is divided according to the partition)
 3 a shuffle has occurred based on whether the job can be divided into a lot of stage

stage

1 dependencies (blood) into the RDD wide and narrow dependency dependency
 2 Narrow Dependence: a partition using the partition is only one sub parent RDD RDD is, no shuffle, i.e., parent-child relationship as "one pair of a " or "multi- for a "
 3 wide dependency: generating a shuffle, parent-child relationship," a pair of multiple "or" plurality of multiple "
 . 4 Spark formed DAG directed acyclic graph, DAG submitted to DAGScheduler the dependencies between rdd, DAGScheduler will DAG dividing a plurality of interdependent stage, divided according to stage is dependent on the width between the rdd
 5 encounters a wide dependent on the division stage
 . 6 each stage comprises a plurality of task or tasks
 7 these TaskScheduler submitted to the task to run as taskSet of
 8 stage is composed of a set of parallel task
 . 9 stage cutting rules: back to front, it encounters a wide-dependent cleavage stage. 
10 to an external file or a shuffle stage as a result of the start, or the end to produce a shuffle generate a final result
11 is speculation stage one to one relationship with TaskSet

task

1 divided into two types: shuffleMapTask and resultTask
 2 split task and speculation about the partition
 . 3 --conf spark.default.parallelism number = 1000 disposed parallel task
 4 concepts are abstractions of personal understanding of the above, i.e. simply understood as all occurred in driver side, only the task-related information is sent to the executor to perform the serialized

 

 

Reference links:
https://www.cnblogs.com/jechedo/p/5732951.html
https://www.2cto.com/net/201802/719956.html
https://blog.csdn.net/fortuna_i/article / the Details / 81,170,565
https://www.2cto.com/net/201712/703261.html
https://blog.csdn.net/zhangzeyuan56/article/details/80935034
https://www.jianshu.com/p/ 3e79db80c43c? from = timeline & isappinstalled = 0

 

Guess you like

Origin www.cnblogs.com/floud/p/10935523.html