Spark2.1.0——Spark环境更新

阅读提示：本文是对SparkContext中对用户通过--jars（或spark.jars）和--files（或spark.files）参数添加的外部资源进行的分析。

用户提交任务时往往需要添加额外的jar包或其它文件，用户任务的执行将依赖这些文件。这些文件该如何指定？任务在各个节点上运行时又是如何获取到这些文件的呢？我们首先回答第一个问题。

在SparkContext的初始化过程中会读取用户指定的jar文件或其它文件，代码如下：

    _jars = Utils.getUserJars(_conf)
    _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
      .toSeq.flatten

上述代码首先读取用户设置的jar文件，然后读取用户设置的其它文件。当用户选择的部署模式是yarn时，_jars是由spark.jars属性指定的jar文件和spark.yarn.dist.jars属性指定的jar文件的并集。其它模式下只采用由spark.jars属性指定的jar文件。这里使用了Utils的getUserJars方法，其具体介绍请阅读《附录A Spark2.1核心工具类Utils》。通过spark.files属性可以指定其它文件。

回答了第一个问题，第二个问题该如何解决？

在SparkContext的初始化过程中有以下代码：

  def jars: Seq[String] = _jars
  def files: Seq[String] = _files
    // Add each JAR given through the constructor
    if (jars != null) {
      jars.foreach(addJar)
    }

    if (files != null) {
      files.foreach(addFile)
    }

上述代码中，jars和files是两个简单的方法，分别用来获取jar包的序列集合和其它文件的序列集合。上述代码还遍历每一个jar文件并调用addJar方法，遍历每一个其它文件并调用addFile方法。

addJar方法是做什么的呢？它用于将jar文件添加到Driver的Rpc环境中。addJar的实现见代码清单1。

代码清单1 addJar的实现

  def addJar(path: String) {
    if (path == null) {
      logWarning("null specified as parameter to addJar")
    } else {
      var key = ""
      if (path.contains("\\")) {
        key = env.rpcEnv.fileServer.addJar(new File(path))
      } else {
        val uri = new URI(path)
        Utils.validateURL(uri)
        key = uri.getScheme match {
          case null | "file" =>
            try {
              env.rpcEnv.fileServer.addJar(new File(uri.getPath))
            } catch {
              case exc: FileNotFoundException =>
                logError(s"Jar not found at $path")
                null
            }
          case "local" =>
            "file:" + uri.getPath
          case _ =>
            path
        }
      }
      if (key != null) {
        val timestamp = System.currentTimeMillis
        if (addedJars.putIfAbsent(key, timestamp).isEmpty) {
          logInfo(s"Added JAR $path at $key with timestamp $timestamp")
          postEnvironmentUpdate()
        }
      }
    }
  }

根据代码清单1，将调用SparkEnv的RpcEnv的fileServer（fileServer实际是《Spark内核设计的艺术》一书第5.3.5节介绍的NettyStreamManager）的addJar方法将jar文件添加到Driver本地RpcEnv的NettyStreamManager中，并将jar文件添加的时间戳信息缓存到addedJars 中。SparkEnv及fileServer的内容将在《Spark内核设计的艺术》一书第5章详细介绍。

addFile与addJar类似，其实现见代码清单2。

代码清单2 addFile的实现

  def addFile(path: String): Unit = {
    addFile(path, false)
  }
  def addFile(path: String, recursive: Boolean): Unit = {
    val uri = new Path(path).toUri
    val schemeCorrectedPath = uri.getScheme match {
      case null | "local" => new File(path).getCanonicalFile.toURI.toString
      case _ => path
    }

    val hadoopPath = new Path(schemeCorrectedPath)
    val scheme = new URI(schemeCorrectedPath).getScheme
    if (!Array("http", "https", "ftp").contains(scheme)) {
      val fs = hadoopPath.getFileSystem(hadoopConfiguration)
      val isDir = fs.getFileStatus(hadoopPath).isDirectory
      if (!isLocal && scheme == "file" && isDir) {
        throw new SparkException(s"addFile does not support local directories when not running " +
          "local mode.")
      }
      if (!recursive && isDir) {
        throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " +
          "turned on.")
      }
    } else {
      Utils.validateURL(uri)
    }

    val key = if (!isLocal && scheme == "file") {
      env.rpcEnv.fileServer.addFile(new File(uri.getPath))
    } else {
      schemeCorrectedPath
    }
    val timestamp = System.currentTimeMillis
    if (addedFiles.putIfAbsent(key, timestamp).isEmpty) {
      logInfo(s"Added file $path at $key with timestamp $timestamp")
      Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf,
        env.securityManager, hadoopConfiguration, timestamp, useCache = false)
      postEnvironmentUpdate()
    }
  }

根据代码清单2，将调用SparkEnv的RpcEnv的fileServer的addFile方法将文件添加到Driver本地RpcEnv的NettyStreamManager中，并将文件添加的时间戳信息缓存到addedFiles中。SparkEnv及fileServer（即NettyStreamManager）的内容将在《Spark内核设计的艺术》一书第5章详细介绍。

通过addJar和addFile就可以见各种任务执行所依赖的文件添加到Driver的Rpc环境中，这样各个Executor节点就可以使用RPC从Driver将文件下载到本地，以供任务执行。

在addJar和addFile方法的最后都调用了postEnvironmentUpdate方法，而且在SparkContext初始化过程的最后也会调用postEnvironmentUpdate，代码如下。

    postEnvironmentUpdate ()

由于addJar和addFile可能对应用的环境产生影响，所以在SparkContext初始化过程的最后需要调用postEnvironmentUpdate方法更新环境。postEnvironmentUpdate的实现见代码清单3。

代码清单3 postEnvironmentUpdate的实现

  private def postEnvironmentUpdate() {
    if (taskScheduler != null) {
      val schedulingMode = getSchedulingMode.toString
      val addedJarPaths = addedJars.keys.toSeq
      val addedFilePaths = addedFiles.keys.toSeq
      // 将JVM参数、Spark 属性、系统属性、classPath等信息设置为环境明细信息
      val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,
        addedFilePaths)
      // 生成SparkListenerEnvironmentUpdate事件，并投递到事件总线
      val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)
      listenerBus.post(environmentUpdate)
    }
  }

根据代码清单3，postEnvironmentUpdate的处理步骤如下。

通过调用SparkEnv的方法environmentDetails（见代码清单4）将环境的JVM参数、Spark 属性、系统属性、classPath等信息设置为环境明细信息。
生成事件SparkListenerEnvironmentUpdate（此事件携带环境明细信息），并投递到事件总线listenerBus，此事件最终将被EnvironmentListener监听，并影响EnvironmentPage页面中的输出内容。

代码清单4 投递环境更新事件

  private[spark]
  def environmentDetails(
      conf: SparkConf,
      schedulingMode: String,
      addedJars: Seq[String],
      addedFiles: Seq[String]): Map[String, Seq[(String, String)]] = {

    import Properties._
    val jvmInformation = Seq(
      ("Java Version", s"$javaVersion ($javaVendor)"),
      ("Java Home", javaHome),
      ("Scala Version", versionString)
    ).sorted

    val schedulerMode =
      if (!conf.contains("spark.scheduler.mode")) {
        Seq(("spark.scheduler.mode", schedulingMode))
      } else {
        Seq[(String, String)]()
      }
    val sparkProperties = (conf.getAll ++ schedulerMode).sorted

    val systemProperties = Utils.getSystemProperties.toSeq
    val otherProperties = systemProperties.filter { case (k, _) =>
      k != "java.class.path" && !k.startsWith("spark.")
    }.sorted

    val classPathEntries = javaClassPath
      .split(File.pathSeparator)
      .filterNot(_.isEmpty)
      .map((_, "System Classpath"))
    val addedJarsAndFiles = (addedJars ++ addedFiles).map((_, "Added By User"))
    val classPaths = (addedJarsAndFiles ++ classPathEntries).sorted

    Map[String, Seq[(String, String)]](
      "JVM Information" -> jvmInformation,
      "Spark Properties" -> sparkProperties,
      "System Properties" -> otherProperties,
      "Classpath Entries" -> classPaths)
  }

Spark2.1.0——Spark环境更新

猜你喜欢