[TOC]

First, the execution of the program wordcount

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    //创建spark配置文件对象.设置app名称，master地址,local表示为本地模式。
    //如果是提交到集群中，通常不指定。因为可能在多个集群汇上跑，写死不方便
    val conf = new SparkConf().setAppName("wordCount")

    //创建spark context对象
    val sc = new SparkContext(conf)

    sc.textFile(args(0)).flatMap(_.split(" "))
      .map((_,1))
      .reduceByKey(_+_)
        .saveAsTextFile(args(1))

    sc.stop()
  }
}

The core code is very simple, first of all look at this function textFile

SparkContext.scala

  def textFile(
      path: String,
      minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
    assertNotStopped()
    //指定文件路径、输入的格式类为textinputformat，输出的key类型为longwritable，输出的value类型为text
    //map(pair => pair._2.toString)取出前面的value,然后将value转为string类型
    //最后将处理后的value返回成一个新的list，也就是RDD[String]
    //setName(path) 设置该file名字为路径
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString).setName(path)
  }

关键性的操作就是：
返回了一个hadoopFile，它有几个参数：
path：文件路径
classOf[TextInputFormat]：这个其实就是输入文件的处理类，也就是我们mr中分析过的TextInputFormat，其实就是直接拿过来的用的，不要怀疑，就是酱紫的
classOf[LongWritable], classOf[Text]：这两个其实可以猜到了，就是输入的key和value的类型。

接着执行了一个map(pair => pair._2.toString)，将KV中的value转为string类型

We then take a look at this method hadoopFile

 def hadoopFile[K, V](
      path: String,
      inputFormatClass: Class[_ <: InputFormat[K, V]],
      keyClass: Class[K],
      valueClass: Class[V],
      minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
    assertNotStopped()

    // This is a hack to enforce loading hdfs-site.xml.
    // See SPARK-11227 for details.
    FileSystem.getLocal(hadoopConfiguration)

    // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
    val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
    val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)

    //看到这里，最后返回的是一个 HadoopRDD 对象
    //指定sc对象，配置文件、输入方法类、KV类型、分区个数
    new HadoopRDD(
      this,
      confBroadcast,
      Some(setInputPathsFunc),
      inputFormatClass,
      keyClass,
      valueClass,
      minPartitions).setName(path)
  }

Finally, return HadoopRDD object.

Followed by flatMap ( .split ( "")) .map (( , 1)), is relatively simple

flatMap(_.split(" ")) 
就是将输入每一行，按照空格切割，然后切割后的元素称为一个新的数组。
然后将每一行生成的数组合并成一个大数组。

map((_,1))
将每个元素进行1的计数，组成KV对，K是元素，V是1

Then look .reduceByKey (_ + _ )

这个其实就是将同一key的KV进行聚合分组，然后将同一key的value进行相加，最后就得出某个key对应的value，也就是某个单词的个数

看看这个函数
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {
    reduceByKey(defaultPartitioner(self), func)
  }
 这个过程中会分区，默认分区数是2，使用的是HashPartitioner进行分区，可以指定分区的最小个数

Two, spark resource scheduling

2.1 Resource Scheduling Process

Three, spark - spark scheduling principle analysis

Figure 2.1 spark resource scheduling

1, the commit command will start the process of a spark-submit the client the client (Driver used to apply resources).
2, to apply resources to Master Driver, Driver To add information to this application in waitingDrivers collection of Master's. Master View works collection, pick out the right of Work node.
3, the process starts Driver Work in the selected node (Driver process has started, spark-submit the mission has been completed, close the process). So in fact driver also needs resources, it is only run in a thread on the executor only
4, Driver Application process for the application you want to run resource (this resource refers to the process of Executor). At this time Master of waitingApps resource information in the application you want to add this Application. Then according to the requirements apply to computing resources required to see which Worker nodes used (each node to use the amount of resources). Executor start the process in these nodes.
(Note: Polling started Executor.Executor occupy all of the core nodes 1G memory and can be managed by the Worker)
5, then you can distribute Driver Executor task to process each Worker nodes running.

Master set of three

val works = new HashSet[WorkInfo]()
  works 集合采用HashSet数组存储work的节点信息，可以避免存放重复的work节点。为什么要避免重复？首先我们要知道work节点有可能因为某些原因挂掉，挂掉之后下一次与master通信时会报告给master，这个节点挂掉了，然后master会在works对象里把这个节点去掉，等下次再用到这个节点是时候，再加进来。这样来说，理论上是不会有重复的work节点的。可是有一种特殊情况：work挂掉了，在下一次通信前又自己启动了，这时works里面就会有重复的work信息。

  val waitingDrivers = new ArrayBuffer[DriverInfo]()
  当客户端向master为Driver申请资源时，会将要申请的Driver的相关信息封装到master节点的DriverInfo这个泛型里，然后添加到waitingDrivers 里。master会监控这个waitingDrivers 对象，当waitingDrivers集合中的元素不为空时，说明有客户端向master申请资源了。此时应该先查看一下works集合，找到符合要求的worker节点，启动Driver。当Driver启动成功后，会把这个申请信息从waitingDrivers 对象中移除。

   val waitingApps = new ArrayBuffer[ApplicationInfo]()
  Driver启动成功后，会为application向master申请资源，这个申请信息封存到master节点的waitingApps 对象中。同样的，当waitingApps 集合不为空，说明有Driver向Master为当前的Application申请资源。此时查看workers集合，查找到合适的Worker节点启动Executor进程，默认的情况下每一个Worker只是为每一个Application启动一个Executor，这个Executor会使用1G内存和所有的core。启动Executor后把申请信息从waitingApps 对象中移除。

  注意点：上面说到master会监控这三个集合，那么到底是怎么监控的呢？？？
  master并不是分出来线程专门的对这三个集合进行监控，相对而言这样是比较浪费资源的。master实际上是‘监控’这三个集合的改变，当这三个集合中的某一个集合发生变化时（新增或者删除），那么就会调用schedule()方法。schedule方法中封装了上面提到的处理逻辑。

2.2 application and executor of relationship

1, by default, each Worker for each Application will start an Executor. Each Executor all the core 1G memory and can be managed by the Worker default.
2, if you want to start on a more Executor Worker, at the time of submission of Application To specify the number of core Executor used (avoid using all of the worker's core). Commit command: the Spark-the Submit --executor-Cores
3, by default, Executor way to start is to start polling, to some extent, in favor of localized data.

What is polling start? ? ? Why start it in rotation? ? ?

Polling started: Polling is a start of a start. For example, there are five people, everyone want to send an apple + a banana. Polling started distributing ideas is this: a person first five minutes of an apple, apple distribute complete redistribution bananas.

Why use polling to start it? ? ? We definitely want to expand data calculation is to compute the first to find the data. Local data storage is calculated directly, rather than the data transfer over recalculated. We have n sets Worker node, the node if the data is stored only in the calculation. Only a few Worker to calculate, most of the worker is idle. This proposal is certainly not feasible. So we use polling to start Executor, to allow a task in each node.

Since no data storage node network traffic, so it must be fast, the number of task will be performed more. So do not waste cluster resources can also be calculated in the data storage node, to a certain extent, but also conducive to localized data.

2.3 spark coarse grained scheduling

Coarse-grained (Fu II):

在任务执行之前，会先将资源申请完毕，当所有的task执行完毕，才会释放这部分资源。
优点：每一个task执行前。不需要自己去申请资源了，节省启动时间。
缺点：等到所有的task执行完才会释放资源(也就是整个job执行完成)，集群的资源就无法充分利用。

这是spark使用的调度粒度，主要是为了让stage，job，task的执行效率高一点

Fine-grained (poor second):

Application提交的时候，每一个task自己去申请资源，task申请到资源才会执行，执行完这个task会立刻释放资源。
优点：每一个task执行完毕之后会立刻释放资源，有利于充分利用资源。
缺点：由于需要每一个task自己去申请资源，导致task启动时间过长，进而导致stage、job、application启动时间延长。

2.4 spark-submit submit restrictions tasks to resources

When we submit the task, you can specify parameters of resource constraints:

--executor-cores ： 单个executor使用的core数量,不指定的话默认使用该worker所有能调用的core
--executor-memory ： 单个executor使用的内存大小，如1G。默认是1G
--total-executor-cores ： 整个application最多使用的core数量，防止独占整个集群资源

Third, the whole spark resource scheduling task scheduling process +

3.1 The overall scheduling process

https://blog.csdn.net/qq_33247435/article/details/83653584#3Spark_51

A schedule to complete the application, need to go through the following stages:
application -> Resource Scheduling -> task scheduler (Task) -> Parallel Computing -> complete
Three, spark - spark scheduling principle analysis

3.1 spark scheduling process of FIG.

It can be seen after the driver starts, there will be the following two objects:

DAGScheduler：
据RDD的宽窄依赖关系将DAG有向无环图切割成一个个的stage，将stage封装给另一个对象taskSet，taskSet=stage，然后将一个个的taskSet给taskScheduler。

taskScheduler:
taskSeheduler拿倒taskSet之后，会遍历这个taskSet，拿到每一个task，然后去调用HDFS上的方法，获取数据的位置，根据获得的数据位置分发task到响应的Worker节点的Executor进程中的线程池中执行。并且会根据每个task的执行情况监控，等到所有task执行完成后，就告诉master将所哟executor杀死

Task scheduling involves mainly relates to the following processes:

 1）、DAGScheduler：根据RDD的宽窄依赖关系将DAG有向无环图切割成一个个的stage，将stage封装给另一个对象taskSet，taskSet=stage，然后将一个个的taskSet给taskScheduler。

2）、taskScheduler：taskSeheduler拿倒taskSet之后，会遍历这个taskSet，拿到每一个task，然后去调用HDFS上的方法，获取数据的位置，根据获得的数据位置分发task到响应的Worker节点的Executor进程中的线程池中执行。

3）、taskScheduler：taskScheduler节点会跟踪每一个task的执行情况，若执行失败，TaskScher会尝试重新提交，默认会重试提交三次，如果重试三次依然失败，那么这个task所在的stage失败，此时TaskScheduler向DAGScheduler做汇报。

4）DAGScheduler：接收到stage失败的请求后，，此时DAGSheduler会重新提交这个失败的stage，已经成功的stage不会重复提交，只会重试这个失败的stage。
（注：如果DAGScheduler重试了四次依然失败，那么这个job就失败了，job不会重试

The concept behind the task:

当所有的task中，75%以上的task都运行成功了，就会每隔一百秒计算一次，计算出目前所有未成功任务执行时间的中位数*1.5，凡是比这个时间长的task都是挣扎的task。

The overall scheduling process:

=======================================资源调度=========================================
1、启动Master和备用Master（如果是高可用集群需要启动备用Master，否则没有备用Master）。
2、启动Worker节点。Worker节点启动成功后会向Master注册。在works集合中添加自身信息。
3、在客户端提交Application，启动spark-submit进程。伪代码：spark-submit --master --deploy-mode cluster --class jarPath
4、Client向Master为Driver申请资源。申请信息到达Master后在Master的waitingDrivers集合中添加该Driver的申请信息。
5、当waitingDrivers集合不为空，调用schedule()方法，Master查找works集合，在符合条件的Work节点启动Driver。启动Driver成功后，waitingDrivers集合中的该条申请信息移除。Client客户端的spark-submit进程关闭。
（Driver启动成功后，会创建DAGScheduler对象和TaskSchedule对象）
6、当TaskScheduler创建成功后，会向Master会Application申请资源。申请请求发送到Master端后会在waitingApps集合中添加该申请信息。
7、当waitingApps集合中的元素发生改变，会调用schedule()方法。查找works集合，在符合要求的worker节点启动Executor进程。
8、当Executor进程启动成功后会将waitingApps集合中的该申请信息移除。并且向TaskSchedule反向注册。此时TaskSchedule就有一批Executor的列表信息。

=======================================任务调度=========================================
9、根据RDD的宽窄依赖，切割job，划分stage。每一个stage是由一组task组成的。每一个task是一个pipleline计算模式。
10、TaskScheduler会根据数据位置分发task。（taskScheduler是如何拿到数据位置的？？？TaskSchedule调用HDFS的api，拿到数据的block块以及block块的位置信息）
11、TaskSchedule分发task并且监控task的执行情况。
12、若task执行失败或者挣扎。会重试这个task。默认会重试三次。
13、若重试三次依旧失败。会把这个task返回给DAGScheduler，DAGScheduler会重试这个失败的stage（只重试失败的这个stage）。默认重试四次。
14、告诉master，将集群中的executor杀死，释放资源。