spark knowledge summary

**

1, Spark Introduction

**
 1), Spark's history: 2012 Initial release Version 0.6 release, there have been 6 years old.
 2), the founder of Spark: AMP laboratory of the University of California, Berkeley, USA.
 3), faster than the MR Spark reasons:
   ①Spark is coarse-grained resource scheduling, multiplexing resources used.
   ②Spark support memory-based iteration, MR not supported.
   ③Spark support DAG directed acyclic graph task pipleline.
   ④Spark different scenarios can be selected according to different shuffle, spark shuffle higher performance than MR (sortShuffle)
. 5), Spark operating modes: local, standalone, yarn, mesos .
6) the development of language Spark: scala, java, python, R . (Scala and Java compatibility and efficiency are the same)

2, RDD (resilient distributed datasets) (emphasis)

1)、RDD五大特性:(重点)

     1. RDD是由一系列的Paratition组成的。(partition个数=split切片数 约等于 block数;Spark没有读文件的方法,依赖MR读文件的方法)
     2. RDD提供的每一个算子实际上是作用在每一个Paratition上的。
     3. RDD实际上是有一系列的依赖关系的,依赖于其他的RDD。(计算的容错性;体现了RDD的弹性;父RDD不一定知道子RDD是谁,子RDD一定知道父RDD是谁)
     4. 可选:分区器作用在内部计算逻辑的返回值是kv格式的RDD上。
     5. 可选:RDD会提供一系列的最佳计算位置。(计算找数据)

2), Sanko

     1. taransformation类算子
        	map(一对一)、flatMap(一对多)、filter(一对N(0、1))、join、leftouterJoin、rightouterJoin、fullouterJoin、sortBy、sortByKey、gorupBy、groupByKey、reduceBy、reduceByKey、sample、union、mappatition、mappatitionwithindex、zip、zipWithIndex。
     2. action类算子
        count、collect(将task的计算结果拉回到Driver端)、foreach(不会回收所有task计算结果,原理:将用户传入的参数推送到各个节点上去执行,只能去计算节点找结果)、saveAsTextFile(path)、reduce、foreachPatition、take、first。

(See results manner: the WEBUI, Worker working directory to view the respective nodes)
3. Control Operator Class
Cache (corresponds MEMOORY_ONLY),
the persist (MEMORY_ONLY, DISK_ONLY, MEMORY_AND_DISK)
Control Class Operator Precautions:
1), the control Operators can not keep the latter sub-class of class operator action
2), a buffer unit Partition
. 3), performing lazy, kind of operator action required to trigger execution. (If the application is only one job, no need to use the control class operator)

3, Spark probably run the process in a cluster

 1. Driver分发task到节点运行(计算找数据)。
 2. task执行结果拉回到Driver(有可能发生OOM)。
 Driver的作用:
     1)、分发任务到计算节点运行。
     2)、监控task(thread)的运行情况。
     3)、如果task失败,会重新发送(有限制)。
     4)、可以拉回结果到Driver进程。
 结论:Driver进程会和集群频繁通信。

4, submission of Application

1、Client
    提交方式:spark-submit --deploy-mode client --class jarPath args
    特点:Driver进程在客户端节点启动
    适用场景:测试环境
    大概运行流程:
        1)、在Client本地启动Driver进程。
        2)、Driver会向Master为当前Application申请资源。
        3)、Master接收到请求后,会在资源充足的节点上启动Executor进程。
        4)、Driver分发task到Executor执行。
2、Cluster
    提交方式:spark-submit --deploy-mode cluster --class jarPath args
    特点:每次启动application,Driver进程在随机一台节点启动
    适用场景:生产环境
    大概运行流程:
        1)、客户端执行spark-submit --deploy-mode cluster --class jarPath args命令,启动一个sparksubmit进程。
        2)、为Driver向Master申请资源。Driver进程默认需要1G内存,1core。
        3)、master会随机找一台Worker节点启动Driver进程。
        4)、Driver进程启动成功后,spark-submit进程关闭,然后Driver会向Master为当前Application申请资源。
        5)、Master接收到请求后,会在资源充足的节点上启动Executor进程。
        6)、Driver分发task到Executor执行。

1, before learning task scheduling point you need to know

1.1, Spark some of the technical terms
  1.1.1, tasks related
    Application: user-written applications (DriverProgram + ExecutorProgram).
    Job: operator operating a class action is triggered.
    stage: a set of tasks, such as: map task.
    task: (thread) running in the cluster, the smallest unit of execution.

1.1.2, resource-related
    Mstaer: Resource Management master node.
    Worker: Resource Management from the node.
    Executor: the process of performing a task.
    ThreadPool: the thread pool (Executor present in the process)

 1.2、RDD中的依赖关系
      1.2.1、宽依赖
              父RDD与子RDD,partition之间的关系是一对多,一般来说,宽依赖都会导致shuffle。(默认情况下,groupByKey返回的RDD的分区数与父RDD是一致的。如果你在使用groupByKey的时候,传入一个Int类型的值,那么分区数就是这个值。)
              

      1.2.2、窄依赖
              父RDD与子RDD,partition之间的依赖关系是一对一,这种依赖关系不会有shuffle。
              

      1.2.3、宽窄依赖的作用
              宽窄依赖的作用就是:把job切割成一个个的stage。
              切割stage的过程:(stage与 stage之间是宽依赖,stage内部是窄依赖)
              
              那么接下来问题来了,为什么我们需要把job切割成stage呢?
              答:把job切割成stage之后,stage内部就可以很容易的划分出一个个的task任务(用一条线把task内部有关联的子RDD与父RDD串联起来),然后就可把task放到管道中运行了。

              下一个问题:RDD存储的到底是什么样的计算逻辑呢?下面用一个例子来解释:

              在这个Application中有一个job,一个stage,2个task。
              task0:这条线贯穿所有的partition中的计算逻辑,并且以递归函数展开式的形式整合到一起,fun2(fun1(textFile(b1))),最好将这个计算逻辑发送到b1或者其副本所在节点。task1也是相同的逻辑。同时注意:task的计算模式是pipeline的计算模式。

 1.3、学习任务调度前需要了解的问题

       1.3.1、stage中的每一个task(管道计算模式)会在什么时候落地磁盘?
              1)、如果stage后面是跟的是action类算子
                  saveAsText:将每一个管道计算结果写入到指定目录。
                  collect:将每一个管道计算结果拉回到Driver端内存中。
                  count:将每一个管道计算结果,统计记录数,返回给Driver。
              2)、如果stage后面是跟的是stage
                  在shuffle write阶段会写磁盘。(为什么在shuffle write阶段写入磁盘?防止reduce task拉取文件失败,拉取失败后可以直接在磁盘再次拉取shuffle后的数据)

       1.3.2、Spark在计算的过程中,是不是特别消耗内存?
              不是。Spark是在管道中计算的,而管道中不是特别耗内存。即使有很多管道同时进行,也不是特别耗内存。

       1.3.3、什么样的场景最耗内存?
              使用控制类算子的时候耗内存,特别是使用cache时最耗内存。

       1.3.4、如果管道中有cache逻辑,他是如何缓存数据的?
              有cache时,会在一个task运行成功时(遇到action类算子时),将这个task的运行结果缓存到内存中。

       1.3.5、RDD(弹性分布式数据集),为什么他不存储数据还叫数据集?
              虽然RDD不具备存储数据的能力,但是他具备操作数据的能力。

2, task scheduling
2.1, task scheduling process

          1)、DAGScheduler:根据RDD的宽窄依赖关系将DAG有向无环图切割成一个个的stage,将stage封装给另一个对象taskSet,taskSet=stage,然后将一个个的taskSet给taskScheduler。

          2)、taskScheduler:taskSeheduler拿倒taskSet之后,会遍历这个taskSet,拿到每一个task,然后去调用HDFS上的方法,获取数据的位置,根据获得的数据位置分发task到响应的Worker节点的Executor进程中的线程池中执行。

          3)、taskSchedule:taskSchedule节点会跟踪每一个task的执行情况,若执行失败,TaskSche会尝试重新提交,默认会重试提交三次,如果重试三次依然失败,那么这个task所在的stage失败,此时TaskSchedule向DAGSchedule做汇报。

          4)、DAGScheduler:接收到stage失败的请求后,,此时DAGSheduler会重新提交这个失败的stage,已经成功的stage不会重复提交,只会重试这个失败的stage。
          (注:如果DAGScheduler重试了四次依然失败,那么这个job就失败了,job不会重试)

 2.2、配置信息使用的三种方式
          1)、在代码中使用SparkConf来配置。

          2)、在提交的时候使用 --conf来配置。
               spark-submit --master --conf k=v 如果要设置多个配置信息的值,需要使用多个–conf

          3)、在spark的配置文件spark-default.conf中配置。

 2.3、什么是挣扎(掉队)的任务?

          当所有的task中,75%以上的task都运行成功了,就会每隔一百秒计算一次,计算出目前所有未成功任务执行时间的中位数*1.5,凡是比这个时间长的task都是挣扎的task。


 2.4、关于任务调度的几个小问题

       2.4.1、如果有1T数据,单机运行需要30分钟,但是使用Saprk计算需要两个小时(4node),为什么?
          1)、发生了计算倾斜。大量数据给少量的task计算。少量数据却分配了大量的task。
          2)、开启了推测执行机制。

       2.4.2、对于ETL(数据清洗流程)类型的数据,开启推测执行、重试机制,对于最终的执行结果会不会有影响?
         有影响,最终数据库中会有重复数据。
         解决方案:
             1)、关闭各种推测、重试机制。
             2)、设置一张事务表。

1. Introduction
  
  When we run a Spark application The first step must be to write a Spark Application application, and then calls the resource scheduler for the Driver application resources. After the application is successful, the master for the Application application resources, after you sign up, call the resource scheduler to distribute tasks to the node execution. Parallel distributed computing at each node.

2, pre-knowledge
  for the Application, the resource is Executor. It is a resource for Executor memory, core.

Master There are several objects: workers, waitingDrivers, waitingApps. Here are several of these objects to do a brief introduction, these objects are declared in the source code. For more detailed cognitive, self-view look at the source code (source code is using Scala, For a quick overview, please reference "Scala fast learning").

val works = new HashSetWorkInfo
val waitingDrivers = new ArrayBufferDriverInfo
val waitingApps = new ArrayBufferApplicationInfo

In the above code, the node information is represented WorkInfo work node. DriverInfo request information is sent from the Driver. ApplicationInfo Application information is sent from.

Works new new HashSet = Val WorkInfo
  Works HashSet set using the node information stored in the array of work, the work can avoid storing duplicate nodes. Why should avoid duplication? First of all we need to know there is work node may hang up for some reason, hang up after the next report and communicate to the master when the master, the node hung up, then put the master node will remove the object in the works, the next time and then to this node is the time to come together. In this way, the theory does not duplicate work node. But there is a special case: work hung up, before the next communication and that he started, then works there will be a repeat of the work information.

= new new waitingDrivers an ArrayBuffer Val DriverInfo
  When the client application to master resources Driver, the application related information will be Driver package to the generic DriverInfo master node, and then was added to waitingDrivers. master will monitor this waitingDrivers object when waitingDrivers elements of the collection is not empty, indicating that the client has to master the application resources. At this point you should first look at a collection of works, worker nodes found to meet the requirements, start Driver. When Driver started successfully, the application information will be removed from the waitingDrivers object.

waitingApps = new new ArrayBuffer Val ApplicationInfo
  after Driver started successfully, will apply resources to master for the application, the application information storage to waitingApps target master node. Similarly, when waitingApps is not empty, indicating that the current Master Driver Application to apply for funding. This time to see a collection of workers, find suitable Worker process node start Executor, by default every Worker is just a start for each Application Executor, the Executor will use 1G of memory and all of the core. After starting the Executor application information removed from waitingApps object.

Precautions: Mentioned above master monitors these three sets, then in the end is how to monitor it? ? ?
  points out is not the master of these three sets of threads dedicated to monitor, relatively speaking this is more a waste of resources. master is actually a 'monitoring' of the three sets of change, when these three sets of one set of changes (add or delete), it will call schedule () method. a schedule processing logic encapsulation method mentioned above.

4, detailed step
  1, the commit command will start the process of a spark-submit the client the client (Driver used to apply resources).
  2, to apply resources to Master Driver, Driver To add information to this application in waitingDrivers collection of Master's. Master View works collection, pick out the right of Work node.
  3, node starts the process in the selected Driver Work (Driver process has started, spark-submit the mission has been completed, close the process).
  4, Driver Application process for the application you want to run resource (this resource refers to the process of Executor). At this time Master of waitingApps resource information in the application you want to add this Application. Then according to the requirements apply to computing resources required to see which Worker nodes used (each node to use the amount of resources). Executor start the process in these nodes.
  (Note: Polling started Executor.Executor occupy all of the core nodes 1G memory and can be managed by the Worker)
  5, then you can distribute Driver Executor task to process each Worker nodes running.

5, resource scheduling conclusion
  1, by default, each Worker for each Application will start an Executor. Each Executor all the core 1G memory and can be managed by the Worker default.
  2, if you want to start on a more Executor Worker, at the time of submission of Application To specify the number of core Executor use. Commit command: the Spark-the Submit --executor-Cores
  3, by default, Executor way to start is to start polling, to some extent, in favor of localized data.

What is polling start? ? ? Why start it in rotation? ? ?

Polling started: Polling is a start of a start. For example, there are five people, everyone want to send an apple + a banana. Polling started distributing ideas is this: a person first five minutes of an apple, apple distribute complete redistribution bananas.

Why use polling to start it? ? ? We definitely want to expand data calculation is to compute the first to find the data. Local data storage is calculated directly, rather than the data transfer over recalculated. We have n sets Worker node, the node if the data is stored only in the calculation. Only a few Worker to calculate, most of the worker is idle. This proposal is certainly not feasible. So we use polling to start Executor, to allow a task in each node.

Since no data storage node network traffic, so it must be fast, the number of task will be performed more. So do not waste cluster resources can also be calculated in the data storage node, to a certain extent, but also conducive to localized data.

6, Spark thick fine-grained
  6.1, coarse-grained (wealthy people)
    before the task execution, it first resource request is completed, when all the task is finished, we will release some of the resources.
    Pros: before each task execution. You do not need to apply their own resources, and saving start time.
    Cons: Wait until all resources executing the task will free up resources, the cluster can not be fully utilized.

6.2, fine-grained (poor second)
    when the Application submitted for each task to apply their own resources, the resources to task application will perform, performing this task will release resources immediately.
    Pros: Every task immediately after the release of resources is finished, help make full use of resources.
    Disadvantages: As each task needs to apply its own resources, resulting task start time is too long, leading to stage, job, application start-up time is extended.

7, deepen understanding
  here are a few minor problems, the benefits of computer games. . .

Prerequisites: Suppose we have five worker, each worker node provides 10G of memory, 10 core.

1, spark-submit --master ... --executor -cores 2 --executor-memory 2G ... how many Executor process will start in the cluster? ? ?
  A: 25
  Analysis: Each Executor process using two core + 2G memory. So a worker node can start station 5 Executor. Because there are five sets of worker nodes. So you can start a total of 5 * 5 = 25 Executor process.

2, spark-submit --master ... --executor -cores 3 --executor-memory 4G ... how many Executor process will start in the cluster? ? ?
  A: 10
  Analysis: According to the authors command, an Executor process needs to occupy three core + 4G memory. Worker node has a core 10 and memory 10G. If calculated according to the core, 10/3 = 3, 3 can be started Executor. When the memory according to count 10 ,, / 2 = 4, 2 can be started Executor. This point of view, insufficient memory, certainly not start three Executor. Therefore, a start Worker 2 Executor, 5 * 2 = 10.

3, spark-submit --master ... --executor -cores 2 --executor-memory 2G --total-executor-cores 10 ... how many Executor process will start in the cluster? ? ? (-Total-executor-cores: Application whole number of core most used)
  A: 5
  Analysis: This question one more constraint than the previous two, start up to 10 core. If you do not consider this condition, this question and the first question on the case, should be able to start 25 Executor, share 50 core. However, the use of up to 10 Application entire core, Executor using a 2 core, it is 10/2 = 5. You can only start five Executor.

4, spark-submit --master ... --executor -cores 2 --executor-memory 2G --total-executor-cores 4 ... Executor distribution in clusters? ? ? (-Total-executor-cores: Application whole number of core most used)
  A: Worker randomly find two nodes.
  Analysis: The title and ideas on a question of exactly the same. Because the core limit, can not start 25 Executor, can only start 4/2 = 2 Executor process. Because the spark of polling task startup mode is activated, it will be to find two random start node Worker Executor.

5, the number of start Executor formula: min (min (wm / em , wc / ec) * wn, tec / ec)
  Note:
    --executor-Cores: EC
    --executor-Memory: EM
    --total-executor- Cores: TEC
    worker_num: Wn of
    worker_memory: WM
    worker_core: WC
  analysis:
    min (WM / EM, WC / EC): the number of core and memory requirements, a node may obtain two values start the Executor. These two numbers take a small value.
    min = X1 (WM / EM, WC / EC) Wn of: Executor a number of worker nodes Total number of nodes = Executor worker can start.
    x2 = tec / ec: Executor can find out the total number of the total number of activated core requirements.
    min (min (wm / em, wc / ec) * wn, tec / ec) == min (x1, x2): two out of the above required amount Executor take a small value.

Released six original articles · won praise 5 · views 80

Guess you like

Origin blog.csdn.net/AnnerLi/article/details/104303562