Some spark-Job, stage, Task and other basic concepts

Spark the task to shuffle dependent (width-dependent) to break up the boundary dividing the multiple Stage. The end result stage is called ResultStage, the other phase is called ShuffleMapStage.

  • 1. From the forward inference, encounters a wide dependency is disconnected, the narrow face put this dependence was added to the Stage RDD
  • Stage 2. Each Task number which is the number of the Partition by the RDD Stage last determined.
  • 3. The last Stage inside the task type is ResultTask, in front of all other types of tasks Stage is ShuffleMapTask.
  • Stage 4 represents the current operator must be the last step of a calculation of the Stage

Calculation principles

 

      spark some concepts:

         Driver : Driver using a distributed framework of this concept a lot, such as Hive and so on. Spark in the Driver that is running the Application of the main () function and create SparkContext, create SparkContext purpose is to prepare the operating environment Spark application. In Spark responsible SparkContext ClusterManager communicate with, apply resources, allocation and monitoring tasks and so on. When the Executor part of the operation is completed, Driver is also responsible for the SparkContext closed. Usually on behalf of Driver SparkContext.
       
        Executor : an Application to run a process on Worker nodes, the process responsible for running some Task, and is responsible for the data exists on memory or disk, each with its own separate group of Application Executor. In Spark on Yarn mode which is responsible for packaging into Task taskRunner, and extracts an idle thread from the thread pool to run Task.
 
        Worker : any node in the cluster can run Application code. In the middle is the Worker nodes Standalone mode configuration file by slave, in the middle of the Spark on Yarn mode is NodeManager node.
 
       The Task : is sent to a work unit on the Executor, and MapTask and concepts in ReduceTask Hadoop MapReduce, is the basic unit of operation of the Application, represents the minimum processing units on a single data partition. Task ShuffleMapTask and ResultTask into two categories. ShuffleMapTask tasks and outputs to the task division (an ArrayBuffer) in, ResultTask tasks and send to the output task (task-based data partition corresponding to) a plurality of driver bucket. Composed of a plurality of Task Stage, and Task scheduling and management is responsible for the following TaskScheduler.
 
      Taskset : on behalf of a group is not associated with the composition of the task set shuffle task dependencies. A set of tasks to be submitted with more underlying TaskScheduler management.
 
      The Stage : After the Job is determined, Spark scheduler (DAGScheduler) based on the calculating step calculates the job or the job into a plurality of Stage. Stage is divided into ShuffleMapStage and ResultStage, each Stage will contain a TaskSet.
 
      The Job : Spark is lazy calculation operation performed only when it comes to action (Action) count neutrons to trigger real computing. A Job is generated by the operation of operator computing jobs comprising one or more of the Stage.
 
      Manager Cluster : refers to the access to resources on a cluster of external services, there are three types:
Standalone: Spark native resources management, responsible for allocating resources by the Master. 

Apache Mesos: a good frame and a resource scheduling Hadoop MapReduce compatibility. 

Hadoop Yarn: Yarn mainly refers to the ResourceManager.

 

      DAGScheduler : According to Job built on Stage of DAG, and submit it to the Stage TaskScheduler. Divided Stage is based on the dependencies between the RDD, to find the minimum overhead scheduling method based on the relationship between the RDD and Stage, then submitted to the Stage TaskScheduler in the form of TaskSet. In addition, DAGScheduler also deal with failure due to the loss of data due to Shuffle, which may need to resubmit Stage before the run (non Shuffle Task data loss caused by the failure of processed TaskScheduler).
 
     TaskScheduler : Taskset will be submitted to the Worker (cluster) runs, each run what Task Executor is here assigned. TaskScheduler also maintains operational status of all Task retry failed Task

Guess you like

Origin www.cnblogs.com/tansuiqing/p/11360341.html