Spark源码剖析——SparkSubmit提交流程

Spark源码剖析——SparkSubmit提交流程

当前环境与版本

环境 版本
JDK java version “1.8.0_231” (HotSpot)
Scala Scala-2.11.12
Spark spark-2.4.4

前言

  • 运行Spark应用时,通常我们会利用./bin/spark-submit进行提交任务,例如
    spark-submit \
    --master yarn --deploy-mode cluster \
    --num-executors 10 --executor-memory 8G --executor-cores 4 \
    --driver-memory 4G \
    --conf spark.network.timeout=300 \
    --class com.skey.spark.app.MyApp /home/jerry/spark-demo.jar
    
  • 在运行spark-submit后,程序将为我们解析参数,根据不同的部署模式,采用不同的方法提交Spark应用到集群中,例如
    • Standalone
      • client -> 在本地运行用户编写的类的Main方法,即在本地启动Driver
      • cluster -> 利用ClientApp向集群申请节点,用于启动Driver
    • ON YARN
      • client -> 在本地运行,同Standalone
      • cluster -> 利用YarnClusterApplication向集群申请节点,用于启动Driver
  • SparkSubmit的整体的提交流程图如下
    SparkSubmit流程图
  • 下面我们就来看看SparkSubmit任务提交流程的源码

Shell命令部分

  • 首先,我们会调用./bin/spark-submit的shell命令,传参并进行提交,而其中首先调用了./bin/spark-class,主要的代码部分如下
    exec "${SPARK_HOME}"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"
    
  • 要注意的是,此处向spark-class传入了参数org.apache.spark.deploy.SparkSubmit
  • ./bin/spark-class首先会利用/bin/load-spark-env.sh加载环境变量,代码如下
    . "${SPARK_HOME}"/bin/load-spark-env.sh
    
  • 其中重要的是/bin/load-spark-env.sh调用了./conf/spark-env.sh,也就是我们平常配置的各种默认环境参数,例如SPARK_MASTER_HOST、SPARK_WORKER_MEMORY、HADOOP_CONF_DIR等等。因此,我们可以知道每一个应用提交时都会重新读取该配置文件。
  • 接着,./bin/spark-class会去寻找Java命令、Jar文件、启动Java进程等,其中最主要的代码如下
    build_command() {
      # 此RUNNER是前面解析的java命令
      # LAUNCH_CLASSPATH一般是SPARK_HOME/jars/*
      # org.apache.spark.launcher.Main将会按Null字符('\0')分隔打印解析的参数
      # "$@" 此处既是前面spark-submit传入的参数,需要注意的是第一个参数是 org.apache.spark.deploy.SparkSubmit
      "$RUNNER" -Xmx128m -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
      printf "%d\0" $?
    }
    
    set +o posix
    CMD=()
    # 执行(build_command "$@"),将结果重定向到while循环中
    # while中由read命令处理接收到的字符串(按Null字符分隔)
    # 将结果添加到CMD列表中
    while IFS= read -d '' -r ARG; do
      CMD+=("$ARG")
    done < <(build_command "$@")
    
    // 省略部分代码……
    # 执行命令
    CMD=("${CMD[@]:0:$LAST}")
    exec "${CMD[@]}"
    

参数解析 Main

  • org.apache.spark.launcher.Main
  • 此类主要用于参数解析,以及按不同的模式输出运行命令,代码不多,如下
    class Main {
    
      public static void main(String[] argsArray) throws Exception {
        checkArgument(argsArray.length > 0, "Not enough arguments: missing class name.");
    
        List<String> args = new ArrayList<>(Arrays.asList(argsArray));
        // 获取接收的第一个参数,即前面传入的org.apache.spark.deploy.SparkSubmit
        String className = args.remove(0);
    
        boolean printLaunchCommand = !isEmpty(System.getenv("SPARK_PRINT_LAUNCH_COMMAND"));
        Map<String, String> env = new HashMap<>();
        List<String> cmd;
        if (className.equals("org.apache.spark.deploy.SparkSubmit")) {
          try {
            // 解析参数,例如--class、--conf等
            // 并构建命令
            AbstractCommandBuilder builder = new SparkSubmitCommandBuilder(args);
            // cmd中主要添加了java -cp classpath org.apache.spark.deploy.SparkSubmit
            cmd = buildCommand(builder, env, printLaunchCommand);
          } catch (IllegalArgumentException e) {
            // 省略部分代码
          }
        } else {
          // 如果自定义了SparkSubmit,则走此部分
          AbstractCommandBuilder builder = new SparkClassCommandBuilder(className, args);
          cmd = buildCommand(builder, env, printLaunchCommand);
        }
    
    	// 在不同操作系统环境下,按不同方式打印输出
        if (isWindows()) {
          System.out.println(prepareWindowsCommand(cmd, env));
        } else {
          List<String> bashCmd = prepareBashCommand(cmd, env);
          for (String c : bashCmd) {
            System.out.print(c);
            System.out.print('\0'); // 使用Null字符进行分隔
          }
        }
      }
      
      // 省略部分代码
    }
    
  • 最后,打印的参数(主要包括java -cp classpath org.apache.spark.deploy.SparkSubmit等)将被./bin/spark-class中的CMD列表接收,并使用exec执行。

SparkSubmit

  • org.apache.spark.deploy.SparkSubmit

  • 此类就是真正进行Spark应用提交的类了,正如前面部分所说,此类对接收的参数进行解析,并根据不同的模式进行应用提交。

  • SparkSubmit有一个class以及一个伴生对象,首先我们看到其伴生对象的main方法中,此处即是java进程的入口

    override def main(args: Array[String]): Unit = {
    // 实例化SparkSubmit,并重写部分方法
    val submit = new SparkSubmit() {
      self => // 为this定义一个别名,方便传入SparkSubmitArguments
    
      override protected def parseArguments(args: Array[String]): SparkSubmitArguments = {
        // 重写SparkSubmitArguments的日志打印方法
        // 使其打印时调用SparkSubmit的logInfo、logWarning
        new SparkSubmitArguments(args) {
          override protected def logInfo(msg: => String): Unit = self.logInfo(msg)
    
          override protected def logWarning(msg: => String): Unit = self.logWarning(msg)
        }
      }
    
      override protected def logInfo(msg: => String): Unit = printMessage(msg)
    
      override protected def logWarning(msg: => String): Unit = printMessage(s"Warning: $msg")
    
      override def doSubmit(args: Array[String]): Unit = {
        try {
          // 此处还是调用的父类的doSubmit,只不过后面添加了对异常的处理
          super.doSubmit(args)
        } catch {
          case e: SparkUserAppException =>
            exitFn(e.exitCode)
        }
      }
    
    }
    // 调用 SparkSubmit的doSubmit进行任务提交
    submit.doSubmit(args)
    }
    
  • 此处代码最终会调用SparkSubmit的doSubmit,其代码如下

    def doSubmit(args: Array[String]): Unit = {
    // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to
    // be reset before the application starts.
    val uninitLog = initializeLogIfNecessary(true, silent = true)
    
    // parseArguments会实例化SparkSubmitArguments
    // 需要注意的是前面的伴生对象中已经重写过SparkSubmitArguments中的日志方法
    val appArgs = parseArguments(args)
    if (appArgs.verbose) {
      logInfo(appArgs.toString)
    }
    // 提交任务时,现在action是走的SparkSubmitAction.SUBMIT
    // 有兴趣的朋友可以看看,SUBMIT由SparkSubmitArguments中的loadEnvironmentArguments方法解析得到
    appArgs.action match {
      case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog)
      case SparkSubmitAction.KILL => kill(appArgs)
      case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)
      case SparkSubmitAction.PRINT_VERSION => printVersion()
    }
    }
    
  • 由于我们现在是提交任务,此部分代码将会接着调用submit,代码如下

    private def submit(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
      // 定义了一个doRunMain,等下会被调用
      // 最终会调用runMain(...)方法
      def doRunMain(): Unit = {
        // proxyUser是指的代理用户,由--proxy-user指定
        // 主要用于冒充其他用户的名称,例如本用户是jerry,但是你可以冒充为tom,越过用户权限,处理tom的文件
        if (args.proxyUser != null) {
          val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,
            UserGroupInformation.getCurrentUser())
          try {
            proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {
              override def run(): Unit = {
                runMain(args, uninitLog)
              }
            })
          } catch {
            // 省略部分代码
          }
        } else {
          runMain(args, uninitLog)
        }
      }
    
      // 判断启动模式,不过最终都会调用doRunMain()方法
      if (args.isStandaloneCluster && args.useRest) {
        try {
          logInfo("Running Spark using the REST application submission protocol.")
          doRunMain()
        } catch {
          // 省略部分代码
        }
      } else {
        doRunMain()
      }
    }
    
  • 可以看到submit主要是针对是否使用代理用户进行了处理,最后调用了runMain(…)方法,此方法就是SparkSubmit的核心了,其代码如下

    private def runMain(args: SparkSubmitArguments, uninitLog: Boolean): Unit = {
      // prepareSubmitEnvironment用于解析参数,主要决定了启动的模式
      val (childArgs, childClasspath, sparkConf, childMainClass) = prepareSubmitEnvironment(args)
      
      // 省略部分代码
    
       // 决定使用的ClassLoader,由参数spark.driver.userClassPathFirst决定,默认为false
      val loader =
        if (sparkConf.get(DRIVER_USER_CLASS_PATH_FIRST)) {
          // 该ClassLoader会优先使用用户提供的jar包
          new ChildFirstURLClassLoader(new Array[URL](0),
            Thread.currentThread.getContextClassLoader)
        } else {
            // 默认的ClassLoader
            new MutableURLClassLoader(new Array[URL](0),
            Thread.currentThread.getContextClassLoader)
        }
      Thread.currentThread.setContextClassLoader(loader)
    
      for (jar <- childClasspath) {
        addJarToClasspath(jar, loader)
      }
    
      var mainClass: Class[_] = null
    
      try {
        // 根据参数childMainClass获取类对象
        mainClass = Utils.classForName(childMainClass)
      } catch {
        // 省略部分代码
      }
     
      val app: SparkApplication = if (classOf[SparkApplication].isAssignableFrom(mainClass)) {
        // 如果mainClass是SparkApplication,那就实例化它
        mainClass.newInstance().asInstanceOf[SparkApplication]
      } else {
        // 否则使用JavaMainApplication包装它
        if (classOf[scala.App].isAssignableFrom(mainClass)) {
          logWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
        }
        new JavaMainApplication(mainClass)
      }
    
      // 省略部分代码
    
      try {
        // 调用start方法
        // start中对class进行了反射,并调用了其main方法
        app.start(childArgs.toArray, sparkConf)
      } catch {
        case t: Throwable =>
          throw findCause(t)
      }
    }
    
  • 显然,runMain(…)中最为重要的变量是childMainClass,因为它决定了接下来要运行的类。为了看看它到底是什么类,我们追踪到prepareSubmitEnvironment(…)方法中来看,其中也有一个变量childMainClass,它就是等要返回的变量。下面我们将围绕childMainClass进行分析。

    • 判断Client模式
      if (deployMode == CLIENT) {
        // 如果CLIENT模式,直接将args.mainClass赋值给childMainClass
        // args.mainClass也就是我们提交时--class指定的类
        childMainClass = args.mainClass
        if (localPrimaryResource != null && isUserJar(localPrimaryResource)) {
          childClasspath += localPrimaryResource
        }
        if (localJars != null) { childClasspath ++= localJars.split(",") }
      }
      
    • 判断StandaloneCluster模式
      // 先看是否是StandaloneCluster模式
      if (args.isStandaloneCluster) {
        if (args.useRest) {
           // 如果是REST,那就使用 org.apache.spark.deploy.rest.RestSubmissionClientApp
          childMainClass = REST_CLUSTER_SUBMIT_CLASS
          // 传入 args.mainClass
          childArgs += (args.primaryResource, args.mainClass)
        } else {
          // 否则,使用 org.apache.spark.deploy.ClientApp
          childMainClass = STANDALONE_CLUSTER_SUBMIT_CLASS
          if (args.supervise) { childArgs += "--supervise" }
          Option(args.driverMemory).foreach { m => childArgs += ("--memory", m) }
          Option(args.driverCores).foreach { c => childArgs += ("--cores", c) }
          childArgs += "launch"
          // 传入 args.mainClass
          childArgs += (args.master, args.primaryResource, args.mainClass)
        }
        if (args.childArgs != null) {
          childArgs ++= args.childArgs
        }
      }
      
      • 判断YarnCluster模式
       if (isYarnCluster) {
         // 如果是YarnCluster模式,使用org.apache.spark.deploy.yarn.YarnClusterApplication
         childMainClass = YARN_CLUSTER_SUBMIT_CLASS
         if (args.isPython) {
           childArgs += ("--primary-py-file", args.primaryResource)
           childArgs += ("--class", "org.apache.spark.deploy.PythonRunner")
         } else if (args.isR) {
           val mainFile = new Path(args.primaryResource).getName
           childArgs += ("--primary-r-file", mainFile)
           childArgs += ("--class", "org.apache.spark.deploy.RRunner")
         } else {
           if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
             childArgs += ("--jar", args.primaryResource)
           }
           // 传入 args.mainClass
           childArgs += ("--class", args.mainClass)
         }
         if (args.childArgs != null) {
           args.childArgs.foreach { arg => childArgs += ("--arg", arg) }
         }
       }
      
      • 其他模式自行查看即可(MesosCluster、KubernetesCluster)
  • 我们可以看到,如果是Client模式,那么就会调用用户编写的class的main方法。如果是cluster模式,根据不同的部署情况会分别调用RestSubmissionClientApp、ClientApp、YarnClusterApplication、KubernetesClientApplication等,进行下一步处理。

  • 而在Cluster模式下,这几个SparkApplication在本地启动,分别都会去申请节点,并在申请的节点处启动Driver(调用用户编写的class的main方法),下面我们来看看这几个SparkApplication的源码。

Standalone模式的ClientApp

  • org.apache.spark.deploy.ClientApp
  • 其代码如下
    private[spark] class ClientApp extends SparkApplication {
    
      override def start(args: Array[String], conf: SparkConf): Unit = {
        // ClientArguments内部会调用parse(args.toList)进行解析
        val driverArgs = new ClientArguments(args)
    
        if (!conf.contains("spark.rpc.askTimeout")) {
          conf.set("spark.rpc.askTimeout", "10s")
        }
        Logger.getRootLogger.setLevel(driverArgs.logLevel)
    	// 创建NettyRpcEnv
        val rpcEnv =
          RpcEnv.create("driverClient", Utils.localHostName(), 0, conf, new SecurityManager(conf))
        // 利用Master的URL地址,获取到其RpcEndpointRef
        val masterEndpoints = driverArgs.masters.map(RpcAddress.fromSparkURL).
          map(rpcEnv.setupEndpointRef(_, Master.ENDPOINT_NAME))
        // 实例化ClientEndpoint,并注册
        rpcEnv.setupEndpoint("client", new ClientEndpoint(rpcEnv, driverArgs, masterEndpoints, conf))
    
        rpcEnv.awaitTermination()
      }
    
    }
    
  • 此部分代码又涉及到了我们在前面Spark源码剖析——RpcEndpoint、RpcEnv所讲的通信过程,不太了解的朋友请先看看。
  • ClientEndpoint被实例化后,其onStart方法会被调用,代码如下
    override def onStart(): Unit = {
      driverArgs.cmd match {
        case "launch" =>
          //  记住该类DriverWrapper,后面会启动它,并调用用户编写的class的main方法
          val mainClass = "org.apache.spark.deploy.worker.DriverWrapper"
    
          // 省略部分代码
    
          // 构建Command
          // 此处传入的driverArgs.mainClass就是用户编写的class
          val command = new Command(mainClass,
            Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions,
            sys.env, classPathEntries, libraryPathEntries, javaOpts)
    
          val driverDescription = new DriverDescription(
            driverArgs.jarUrl,
            driverArgs.memory,
            driverArgs.cores,
            driverArgs.supervise,
            command)
          // 向Master发送RequestSubmitDriver消息
          asyncSendToMasterAndForwardReply[SubmitDriverResponse](
            RequestSubmitDriver(driverDescription))
    
        case "kill" =>
          val driverId = driverArgs.driverId
          asyncSendToMasterAndForwardReply[KillDriverResponse](RequestKillDriver(driverId))
      }
    }
    
  • 接着,我们来看Master的receiveAndReply中接收到RequestSubmitDriver消息会做什么
    override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
      case RequestSubmitDriver(description) =>
        // 高可用模式下,Master会有多个,因此有不同状态
        if (state != RecoveryState.ALIVE) {
          // 如果不是ALIVE,那么回复携带失败消息的SubmitDriverResponse
          val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " +
            "Can only accept driver submissions in ALIVE state."
          context.reply(SubmitDriverResponse(self, false, None, msg))
        } else {
          logInfo("Driver submitted " + description.command.mainClass)
          // 根据传过来的description创建Driver
          val driver = createDriver(description) // 此处主要构建了DriverInfo
          persistenceEngine.addDriver(driver)
          waitingDrivers += driver
          drivers.add(driver)
          // 实际启动Driver的代码
          schedule()
    
          // 回复包含driver.id的消息SubmitDriverResponse
          context.reply(SubmitDriverResponse(self, true, Some(driver.id),
            s"Driver successfully submitted as ${driver.id}"))
        }
    
    	// 省略部分代码
    }
    
  • receiveAndReply中接收到RequestSubmitDriver消息后,会构建一个DriverInfo,并调用schedule()方法创建Driver。接着来看实际调用启动Driver的代码schedule()
    private def schedule(): Unit = {
      if (state != RecoveryState.ALIVE) {
        return
      }
      // 取出可用的workers,并随机打乱顺序
      // 打乱顺序主要是防止部分worker上启动太多的driver,这样做可以使drveir均匀的分布在集群中
      val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))
      val numWorkersAlive = shuffledAliveWorkers.size
      var curPos = 0
      // 遍历轮询waitingDrivers,即前面构建的DriverInfo
      for (driver <- waitingDrivers.toList) {
        var launched = false
        var numWorkersVisited = 0
        // 必须要少于可用的数量,且并且没启动
        while (numWorkersVisited < numWorkersAlive && !launched) {
          // 从前面的随机Seq中,去一个worker节点
          val worker = shuffledAliveWorkers(curPos)
          numWorkersVisited += 1
          // worker空闲的内存要大于申请的内存
          // worker空闲的core数要大于申请的core数
          if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
            // 如果都满足,那么在该worker节点启动driver
            launchDriver(worker, driver)
            // 删除已经启动的driver记录,防止重复启动
            waitingDrivers -= driver
            launched = true
          }
          curPos = (curPos + 1) % numWorkersAlive
        }
      }
      // 此处是启动Executor的代码,暂时不管
      startExecutorsOnWorkers()
    }
    
  • 此部分代码中,先获取了可用的worker,并随机打乱其顺序。接着遍历之前构建的DriverInfo,取出worker,比较是否有符合条件(memory、cores)的worker。如果有,那么调用launchDriver(…)方法,在该worker启动Driver。
    private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {
      logInfo("Launching driver " + driver.id + " on worker " + worker.id)
      worker.addDriver(driver)
      driver.worker = Some(worker)
      // 向worker节点发送LaunchDriver信息
      worker.endpoint.send(LaunchDriver(driver.id, driver.desc))
      driver.state = DriverState.RUNNING
    }
    
  • Master调用launchDriver,会向worker发送一条LaunchDriver信息,我们来看Worker接收到消息后做了什么。
    override def receive: PartialFunction[Any, Unit] = synchronized {
      // 省略部分代码
    
      case LaunchDriver(driverId, driverDesc) =>
        logInfo(s"Asked to launch driver $driverId")
        // 构建DriverRunner,并调用start启动
        val driver = new DriverRunner(
          conf,
          driverId,
          workDir,
          sparkHome,
          driverDesc.copy(command = Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
          self,
          workerUri,
          securityMgr)
        drivers(driverId) = driver
        driver.start()
    
       // 更新本worker节点已用的资源
        coresUsed += driverDesc.cores
        memoryUsed += driverDesc.mem
    
      // 省略部分代码
    }
    
  • 在Worker节点,接收到LaunchDriver消息后,会构建一个DriverRunner,并调用其start方法启动。start方法中新建了一个Thread,并启动,其中最主要的是调用prepareAndRunDriver(),利用ProcessBuilder启动一个新进程,并阻塞在此处。
  • 启动该进程的命令则是实例化DriverRunner传入的driverDesc中的command。该command由最开始在AppClient中的ClientEndpoint发送至Master,再由Master发送至Worker得到。也就是前面让大家记住的org.apache.spark.deploy.worker.DriverWrapper(你可以回去看看ClientEndpoint中的onStart方法)。并且该command还携带了用户的class。
  • 最后,我们来到DriverWrapper的main方法中
    def main(args: Array[String]) {
      args.toList match {
        case workerUrl :: userJar :: mainClass :: extraArgs =>
          val conf = new SparkConf()
          val host: String = Utils.localHostName()
          val port: Int = sys.props.getOrElse("spark.driver.port", "0").toInt
          // 创建NettyRpcENv
          val rpcEnv = RpcEnv.create("Driver", host, port, conf, new SecurityManager(conf))
          logInfo(s"Driver address: ${rpcEnv.address}")
          // 实例化WorkerWatcher,并注册
          rpcEnv.setupEndpoint("workerWatcher", new WorkerWatcher(rpcEnv, workerUrl))
    
          val currentLoader = Thread.currentThread.getContextClassLoader
          val userJarUrl = new File(userJar).toURI().toURL()
          // 此处同SparkSubmit的runMain()
          // 根据spark.driver.userClassPathFirst选择ClassLoader
          val loader =
            if (sys.props.getOrElse("spark.driver.userClassPathFirst", "false").toBoolean) {
              // 该ClassLoader优先使用用户提供的jar包
              new ChildFirstURLClassLoader(Array(userJarUrl), currentLoader)
            } else {
              new MutableURLClassLoader(Array(userJarUrl), currentLoader)
            }
          Thread.currentThread.setContextClassLoader(loader)
          setupDependencies(loader, userJar)
    
          // 反射调用用户编写的class的main方法
          val clazz = Utils.classForName(mainClass)
          val mainMethod = clazz.getMethod("main", classOf[Array[String]])
          mainMethod.invoke(null, extraArgs.toArray[String])
    
          rpcEnv.shutdown()
    
        case _ =>
          // scalastyle:off println
          System.err.println("Usage: DriverWrapper <workerUrl> <userJar> <driverMainClass> [options]")
          // scalastyle:on println
          System.exit(-1)
      }
    }
    
  • 至此,DriverWrapper在Worker中完成了启动,并且在最后利用反射的方式调用了用户编写的class中的main方法,后面的就和Client模式后面一样了。

ON YARN模式的YarnClusterApplication

  • org.apache.spark.deploy.yarn.YarnClusterApplication
  • 该类的start方法非常简单,主要就是在本地实例化了一个Client,并调用其run方法。
  • 我们来看org.apache.spark.deploy.yarn.Client的代码
    def run(): Unit = {
      // 向ResourceManager提交应用
      this.appId = submitApplication()
      
      if (!launcherBackend.isConnected() && fireAndForget) {
        // 如果失败,就抛出异常SparkException
        val report = getApplicationReport(appId)
        val state = report.getYarnApplicationState
        logInfo(s"Application report for $appId (state: $state)")
        logInfo(formatReportDetails(report))
        if (state == YarnApplicationState.FAILED || state == YarnApplicationState.KILLED) {
          throw new SparkException(s"Application $appId finished with status: $state")
        }
      } else {
        // 如果没问题,就监控该应用的状态直到应用关闭
        val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
        if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
          diags.foreach { err =>
            logError(s"Application diagnostics message: $err")
          }
          throw new SparkException(s"Application $appId finished with failed status")
        }
        if (appState == YarnApplicationState.KILLED || finalState == FinalApplicationStatus.KILLED) {
          throw new SparkException(s"Application $appId is killed")
        }
        if (finalState == FinalApplicationStatus.UNDEFINED) {
          throw new SparkException(s"The final status of application $appId is undefined")
        }
      }
    }
    
  • 此部分最主要的是调用了submitApplication方法,向YARN的ResourceManager申请资源运行ApplicationMaster。其代码如下
    def submitApplication(): ApplicationId = {
      var appId: ApplicationId = null
      try {
        // 该launcherBackend在Client实例化时被实例化
        launcherBackend.connect()
        // yarnClient在Client实例化时调用YarnClient.createYarnClient被创建
        yarnClient.init(hadoopConf)
        yarnClient.start()
    
        logInfo("Requesting a new application from cluster with %d NodeManagers"
          .format(yarnClient.getYarnClusterMetrics.getNumNodeManagers))
    
        // 向ResourceManager申请创建应用
        val newApp = yarnClient.createApplication()
        val newAppResponse = newApp.getNewApplicationResponse()
        appId = newAppResponse.getApplicationId()
    
        new CallerContext("CLIENT", sparkConf.get(APP_CALLER_CONTEXT),
          Option(appId.toString)).setCurrentContext()
    
        // 验证YARN集群中是否有足够的资源运行ApplicationMaster
        verifyClusterResources(newAppResponse)
    
        // 设置用于启动ApplicationMaster的Context
        // 主要包含了:
        // 1. 对部分参数的解析,例如--class、--jar等
        // 2. 构建包含java的commands,调用amContainer.setCommands传入ApplicationMaster中,用于启动用户编写的class文件
        // 3. ApplicationMaster的申请配置
        val containerContext = createContainerLaunchContext(newAppResponse)
        val appContext = createApplicationSubmissionContext(newApp, containerContext)
    
        logInfo(s"Submitting application $appId to ResourceManager")
        // 启动ApplicationMaster
        yarnClient.submitApplication(appContext)
        launcherBackend.setAppId(appId.toString)
        reportLauncherState(SparkAppHandle.State.SUBMITTED)
    
        appId
      } catch {
        // 省略部分代码
      }
    }
    
  • submitApplication()方法中连接了YARN,申请并创建了ApplicationMaster。其中传入了用于调用java启动用户编写的类的命令,后续在ApplicationMaster中将会启动该命令,后面的就和Client模式后面一样了。
发布了148 篇原创文章 · 获赞 57 · 访问量 18万+

猜你喜欢

转载自blog.csdn.net/alionsss/article/details/104798917