Spark RDD- run

Transfer function of 2.5 RDD

  In the actual development, we often need to define some of the RDD for the operation of their own, then the time required major
Yes, initialization is carried out in Driver side, while actually running the program is carried out in the end Executor, which involves
To the cross-process communication is the need for serialization. Let's look at a few examples:
 
 

 

 

2.6 RDD dependencies

2.6.1 Lineage

  RDD support only coarse-grained conversion, i.e., a single operation performed on a large number of records. The creation of a series of RDD
Lineage (descent) recorded, in order to recover lost partitions. The RDD RDD Lineage will record metadata
And switching behavior information, when part of the RDD partition data loss, and it can be re-calculation in accordance with the information
Recover lost data partition.
 

 

 

 

 

 

2.6.2 narrow dependence 

  It refers to a narrow dependence of each of the parent RDD Partition Partition up using a quilt of RDD,
We rely on narrow vivid metaphor for the one-child 

 

 

2.6.3 Wide-dependent 

  Refers to the width dependence of a plurality of sub-RDD rely Partition Partition same parent of RDD, cause shuffle,
Summary: Wide rely on our vivid metaphor to bounce back

 

 

 

 

2.6.4 DAY

  DAG (Directed Acyclic Graph) is called the directed acyclic graph, a series of original conversion to RDD to
DAG is formed, depending on the dependency between the RDD into different DAG Stage, according to the narrow
Lai, partition conversion calculation processing is completed in the Stage. For wide-dependent, due to the presence of the Shuffle, only
After the parent RDD process is complete, we can start the next calculation, and therefore is dependent on a wide basis for the demarcation of the Stage.
 
 

 

 

2.6.5 division of tasks (Interview emphasis)

RDD 任务切分中间分为:Application、Job、Stage 和 Task
1)Application:初始化一个 SparkContext 即生成一个 Application(一个 jar 包相当于一个 Application)
  一个 Application 可以有多个 Job。
 
2)Job:一个 Action 算子就会生成一个 Job
  一个 Job 中可以有多个 Stage。
 
3)Stage:根据 RDD 之间的依赖关系的不同将 Job 划分成不同的 Stage,遇到一个宽依赖 则划分一个 Stage。
 
4)Task:Stage 是一个 TaskSet,将 Stage 划分的结果发送到不同的 Executor 执行即为一个  Task。
  一个 Task 就是一个并行度,并行度和数据分片有关
 
注意:Application->Job->Stage->Task 每一层都是 1 对 n 的关系。 
 

 

 

划分 Stage 从后往前划分,遇到一个宽依赖则划分一个 Stage,将其放入栈。

运行从前往后执行。

 
 
 
 

2.7 RDD 缓存

  RDD 通过 persist 方法或 cache 方法可以将前面的计算结果缓存,默认情况下 persist()
会把数据以序列化的形式缓存在 JVM 的堆空间中。
 
  但是并不是这两个方法被调用时立即缓存,而是触发后面的 action 时,该 RDD 将会
被缓存在计算节点的内存中,并供后面重用。 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Guess you like

Origin www.cnblogs.com/LXL616/p/11144953.html