app
1 based spark user program, a driver program and includes a plurality of clusters executor 2 Driver executor and the heartbeat mechanism to ensure the survival of the presence of
3 --conf spark.executor.instances = 5 --conf spark.executor.cores = 8 - conf spark.executor.memory = 80G
eet
A resilient distributed data set 2 readonly partition (Partition) collection of records 3 First-generation lineage rdd at the top level, the data recording partition information required task, each partition data reading method 4 progeny not really rdd storage information, only information recorded lineage 5 real data is read, it is time for a specific task to be performed, the operation is triggered when the action occurred only
Sanko
1 分为transformation和action 2 transformation: map filter flatMap union groupByKey reduceByKey sortByKey join 3 action: reduce collect count first saveAsTextFile countByKey foreach
partition
. 1 RDD similar storage mechanism hdfs, distributed storage 2 HDFS is cut into a plurality of Block (default 128M) for storing, rdd is cut into a plurality of partition storing three different partition may be on different nodes 4 read again spark under hdfs scenario, spark hdfs of the block will be read into memory partition spark abstract of 5 to RDD to the persistence hdfs, each Partition will RDD saved as a file if the file is less than 128M, as will be appreciated corresponding to a partition of a block hdfs. Conversely, if more than 128M, and will be divided into a plurality of Block, so, it will correspond to a plurality of partition block.
job
1 an operator action triggers a job 2 a there is a lot of job task, task execution is the logical unit of the job (task guess is divided according to the partition) 3 a shuffle has occurred based on whether the job can be divided into a lot of stage
stage
1 dependencies (blood) into the RDD wide and narrow dependency dependency 2 Narrow Dependence: a partition using the partition is only one sub parent RDD RDD is, no shuffle, i.e., parent-child relationship as "one pair of a " or "multi- for a " 3 wide dependency: generating a shuffle, parent-child relationship," a pair of multiple "or" plurality of multiple " . 4 Spark formed DAG directed acyclic graph, DAG submitted to DAGScheduler the dependencies between rdd, DAGScheduler will DAG dividing a plurality of interdependent stage, divided according to stage is dependent on the width between the rdd 5 encounters a wide dependent on the division stage . 6 each stage comprises a plurality of task or tasks 7 these TaskScheduler submitted to the task to run as taskSet of 8 stage is composed of a set of parallel task . 9 stage cutting rules: back to front, it encounters a wide-dependent cleavage stage.
10 to an external file or a shuffle stage as a result of the start, or the end to produce a shuffle generate a final result
11 is speculation stage one to one relationship with TaskSet
task
1 divided into two types: shuffleMapTask and resultTask 2 split task and speculation about the partition . 3 --conf spark.default.parallelism number = 1000 disposed parallel task 4 concepts are abstractions of personal understanding of the above, i.e. simply understood as all occurred in driver side, only the task-related information is sent to the executor to perform the serialized
Reference links:
https://www.cnblogs.com/jechedo/p/5732951.html
https://www.2cto.com/net/201802/719956.html
https://blog.csdn.net/fortuna_i/article / the Details / 81,170,565
https://www.2cto.com/net/201712/703261.html
https://blog.csdn.net/zhangzeyuan56/article/details/80935034
https://www.jianshu.com/p/ 3e79db80c43c? from = timeline & isappinstalled = 0