spark source of SparkContext

SparkContext can be said engine Spark engine applications, Spark Drive around initialization initialization of this SparkContext.

SparkContext Overview

The main part of sparkcontxt

  • sparkEnv: the Spark operating environment, Executor executor is processing tasks, depending on SparkEnv environment. Driver also contains SparkEnv, task execution under Local mode in order to guarantee. In addition, SparkEnv further comprising serializerManager, RpcEnv, BlockManager, mapOutputTracker other components.
  • LiveListenerBus: SparkContext event bus. Each party to use the event to accept the different methods SparkListener calls after by matching asynchronously.
  • SparkUI: indirectly dependent on the compute engine, scheduling engine, engine storage system, Job, stage, storage, Executor like are delivered to the monitor LiveListenerBus SparkListener form of, SparkUI read data from the respective SparkListener and displayed on the Web.
  • SparkStatusTracker: providing job, Stage and other monitoring information, is lower API, can only provide consistency mechanism.
  • ConsoleProgressBar: SparkStatusTracker use the API, Stage show progress in the console. Because consistency SparkStatusTracker, showing generally delay.
  • DAGScheduler (very important): DAG scheduler, responsible for creating job, providing partitioning algorithm division stage, and so submit stage.
  • TaskScheduler (very important): task scheduler, tasks assigned to post-secondary resource scheduling cluster manager has been assigned to the application in accordance with the scheduling algorithm. The Task TaskScheduler created by DAGScheduler.
  • HeartbeatReceiver: heartbeat receiver Executor sends all information to HeartbeatReceiver heartbeat, HeartbeatReceiver receive updates Executor last seen time after, and then do TaskScheduler information to deal with .
  • ContextCleaner: Use asynchronous cleanup application scope of RDD, ShuffleDependcy and Broadcast.
  • JobProgressListener: job progress listeners.
  • EventLoggingListener (optional): The event persisted to storage listener, use is true when spark.eventLog.enabled
  • ExecutorAllocationManager: Exexcutor dynamic allocation manager.
  • ShutdownHokManager: for closing the hook function manager, when the JVM exits, perform cleanup work.

 

The following study of the initialization process SparkContext

Creating SparkEnv

// This function allows components created by SparkEnv to be mocked in unit tests:
private[spark] def createSparkEnv(
      conf: SparkConf,
      isLocal: Boolean,
      listenerBus: LiveListenerBus): SparkEnv = {
    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master, conf))
  }

private[spark] def env: SparkEnv = _env

First, create createSparkEnv () method, called createDriverEnv ()

/* ------------------------------------------------------------------------------------- *
 | Initialization. This code initializes the context in a manner that is exception-safe. |
 | All internal fields holding state are initialized here, and any error prompts the     |
 | stop() method to be called.                                                           |
 * ------------------------------------------------------------------------------------- */

private def warnSparkMem(value: String): String = {
  logWarning("Using SPARK_MEM to set amount of memory to use per executor process is " +
    "deprecated, please use spark.executor.memory instead.")
  value
}

/** Control our logLevel. This overrides any user-defined log settings.
 * @param logLevel The desired log level as a string.
 * Valid log levels include: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
 */
def setLogLevel(logLevel: String) {
  // let's allow lowercase or mixed case too
  val upperCased = logLevel.toUpperCase(Locale.ROOT)
  require(SparkContext.VALID_LOG_LEVELS.contains(upperCased),
    s"Supplied level $logLevel did not match one of:" +
      s" ${SparkContext.VALID_LOG_LEVELS.mkString(",")}")
  Utils.setLogLevel(org.apache.log4j.Level.toLevel(upperCased))
}

try {
  _conf = config.clone()
  _conf.validateSettings()

  if (!_conf.contains("spark.master")) {
    throw new SparkException("A master URL must be set in your configuration")
  }
  if (!_conf.contains("spark.app.name")) {
    throw new SparkException("An application name must be set in your configuration")
  }

  // log out spark.app.name in the Spark driver logs
  logInfo(s"Submitted application: $appName")

  // System property spark.yarn.app.id must be set if user code ran by AM on a YARN cluster
  if (master == "yarn" && deployMode == "cluster" && !_conf.contains("spark.yarn.app.id")) {
    throw new SparkException("Detected yarn cluster mode, but isn't running on a cluster. " +
      "Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.")
  }

  if (_conf.getBoolean("spark.logConf", false)) {
    logInfo("Spark configuration:\n" + _conf.toDebugString)
  }

  // Set Spark driver host and port system properties. This explicitly sets the configuration
  // instead of relying on the default value of the config constant.
  _conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))
  _conf.setIfMissing("spark.driver.port", "0")

  _conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)

  _jars = Utils.getUserJars(_conf)
  _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
    .toSeq.flatten

  _eventLogDir =
    if (isEventLogEnabled) {
      val unresolvedDir = conf.get("spark.eventLog.dir", EventLoggingListener.DEFAULT_LOG_DIR)
        .stripSuffix("/")
      Some(Utils.resolveURI(unresolvedDir))
    } else {
      None
    }

  _eventLogCodec = {
    val compress = _conf.getBoolean("spark.eventLog.compress", false)
    if (compress && isEventLogEnabled) {
      Some(CompressionCodec.getCodecName(_conf)).map(CompressionCodec.getShortName)
    } else {
      None
    }
  }

  _listenerBus = new LiveListenerBus(_conf)

  // Initialize the app status store and listener before SparkEnv is created so that it gets
  // all events.
  _statusStore = AppStatusStore.createLiveStore(conf)
  listenerBus.addToStatusQueue(_statusStore.listener.get)

  // Create the Spark execution environment (cache, map output tracker, etc)
  _env = createSparkEnv(_conf, isLocal, listenerBus)
  SparkEnv.set(_env)

  Because many of the components are delivered SparkEnv event to event LiveListenerBus bus queue, so first created LiveListenerBus, the principal function is as follows

  • Stores the message queue, the message is responsible for caching
  • Preservation registered with the listener, responsible for distributing the message
    This is a simple model of the listener.

SparkUI achieve

SparkUI involve too many components, there being no in-depth analysis, follow-up analysis alone. Here is the code to create SparkUI

_statusTracker = new SparkStatusTracker(this, _statusStore)
 
 _progressBar =
      if (_conf.get(UI_SHOW_CONSOLE_PROGRESS) && !log.isInfoEnabled) {
        Some(new ConsoleProgressBar(this))
      } else {
        None
      }

    _ui =
      if (conf.getBoolean("spark.ui.enabled", true)) {
        Some(SparkUI.create(Some(this), _statusStore, _conf, _env.securityManager, appName, "",
          startTime))
      } else {
        // For tests, do not enable the UI
        None
      }
    // Bind the UI before starting the task scheduler to communicate
    // the bound port to the cluster manager properly
    _ui.foreach(_.bind())

Creating a heartbeat Receiver

If local mode, Driver and executor again the same node, you can directly use the local interaction. Abnormal can easily know.

When the reproduction environment, often Executor and Driver is started on different nodes, therefore, to be able to control the Executor Driver, created in a heartbeat receiver Driver.

// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will
    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)
    _heartbeatReceiver = env.rpcEnv.setupEndpoint(
      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))

SetupEndpoint SparkEnv the code with the sub-assemblies NettyRpcEnv () method,

The role of this method: is a registered HeartbeatReceiver the Dispatcher RpcEnv and returns a reference NettyRpcEndPointRef of HeartbeatReceiver of.

Create and start scheduling system

TaskScheduler responsible for requesting cluster manager assigned to the application and run the client Executor (a schedule) and assigned to the task and run the task Executor (two scheduling), can be seen as task scheduling.

DAGScheduler mainly in the formal task to TaskSchedulerImp preparatory work before submission, including the creation of Job, the DAG's RDD into different stage, and so submit Stage.

// Create and start the scheduler
    val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)
    _schedulerBackend = sched
    _taskScheduler = ts
    _dagScheduler = new DAGScheduler(this)
    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's
    // constructor
    _taskScheduler.start()

createTaskScheduler () method returns the Scheduler and TaskScheduler dual (scala complement the knowledge here), represented SparkContext of _taskScheduler has been cited TAskScheduler of, HeartbeatReceiver received TaskSchedulerIsSet will get the message of sparkContext
_taskScheduler own scheduler to set attributes property.
/**
   * Create a task scheduler based on a given master URL.
   * Return a 2-tuple of the scheduler backend and the task scheduler.
   */
  private def createTaskScheduler(
      sc: SparkContext,
      master: String,
      deployMode: String): (SchedulerBackend, TaskScheduler) = {
    import SparkMasterRegex._

    // When running locally, don't try to re-execute tasks on failure.
    val MAX_LOCAL_TASK_FAILURES = 1

    master match {
      case "local" =>
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, 1)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_REGEX(threads) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*] estimates the number of cores on the machine; local[N] uses exactly N threads.
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        if (threadCount <= 0) {
          throw new SparkException(s"Asked to run locally with $threadCount threads")
        }
        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>
        def localCpuCount: Int = Runtime.getRuntime.availableProcessors()
        // local[*, M] means the number of cores on the computer with M failures
        // local[N, M] means exactly N threads with M failures
        val threadCount = if (threads == "*") localCpuCount else threads.toInt
        val scheduler = new TaskSchedulerImpl(sc, maxFailures.toInt, isLocal = true)
        val backend = new LocalSchedulerBackend(sc.getConf, scheduler, threadCount)
        scheduler.initialize(backend)
        (backend, scheduler)

      case SPARK_REGEX(sparkUrl) =>
        val scheduler = new TaskSchedulerImpl(sc)
        val masterUrls = sparkUrl.split(",").map("spark://" + _)
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        (backend, scheduler)

      case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>
        // Check to make sure memory requested <= memoryPerSlave. Otherwise Spark will just hang.
        val memoryPerSlaveInt = memoryPerSlave.toInt
        if (sc.executorMemory > memoryPerSlaveInt) {
          throw new SparkException(
            "Asked to launch cluster with %d MB RAM / worker but requested %d MB/worker".format(
              memoryPerSlaveInt, sc.executorMemory))
        }

        val scheduler = new TaskSchedulerImpl(sc)
        val localCluster = new LocalSparkCluster(
          numSlaves.toInt, coresPerSlave.toInt, memoryPerSlaveInt, sc.conf)
        val masterUrls = localCluster.start()
        val backend = new StandaloneSchedulerBackend(scheduler, sc, masterUrls)
        scheduler.initialize(backend)
        backend.shutdownCallback = (backend: StandaloneSchedulerBackend) => {
          localCluster.stop()
        }
        (backend, scheduler)

      case masterUrl =>
        val cm = getClusterManager(masterUrl) match {
          case Some(clusterMgr) => clusterMgr
          case None => throw new SparkException("Could not parse Master URL: '" + master + "'")
        }
        try {
          val scheduler = cm.createTaskScheduler(sc, masterUrl)
          val backend = cm.createSchedulerBackend(sc, masterUrl, scheduler)
          cm.initialize(scheduler, backend)
          (backend, scheduler)
        } catch {
          case se: SparkException => throw se
          case NonFatal(e) =>
            throw new SparkException("External scheduler cannot be instantiated", e)
        }
    }
  }

Initialization block manager BlockManager

BlockManager is one of SparkEnv components include all components and functions spark storage system, the storage system is the most important component. spark follow-up study of the storage system.

_applicationId = _taskScheduler.applicationId()
_env.blockManager.initialize(_applicationId)

  

Start the Metric System

In the spark has its own set of surveillance system, monitoring system can be a rich testability, performance optimization, operation and maintenance assessment, statistics and so on. spark metric system using a third-party warehouse Metrics codahale provided.

Three important concepts spark metric system:

  • Instance name, divided into Master, Worker, Application, Driver and Executor specified measurement system: Instance
  • Source: data sources into the source application metrics (ApplicationSource), Worker measure Source (WorkerSource), DAGSceduler measure Source (DAGScedulerSource), BlockManager measure Source (BlockManagerSource)
  • Sink: metric data output. Default Servlet, also provides ConsoleSink, CsvSink, JmxSink, MetricsServlet, GraphiteSink and so on.
metricsSystem for encapsulating the Source and Sink, Source will output data to a different Sink.
metricsSystem is one of the internal components SparkEnv, is a measure of the entire system spark applications.
// The metrics system for Driver need to be set spark.app.id to app ID.
    // So it should start after we get app ID from the task scheduler and set spark.app.id.
    _env.metricsSystem.start()
    // Attach the driver metrics servlet handler to the web ui after the metrics system is started.
    _env.metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

Adding ServletContextHandler system to SparkUI in.

Create an event log listener (optional)

 _eventLogger =
      if (isEventLogEnabled) {
        val logger =
          new EventLoggingListener(_applicationId, _applicationAttemptId, _eventLogDir.get,
            _conf, _hadoopConfiguration)
        logger.start()
        listenerBus.addToEventLogQueue(logger)
        Some(logger)
      } else {
        None
      }

Create and start ExecutorAllocationManager

ExecutorAllocationManager is dynamically allocated based on work load and delete Executor agent.

It internally Executor amount required timing calculation based on workload,

If the demand is greater than the number of Executor cluster manager application, then add to the cluster manager Executor. On the other hand, apply to the cluster manager to cancel part of Executor.

In addition, it will be inside a regular application (kill) the expiration of the Executor to the cluster manager.

// Optionally scale number of executors dynamically based on workload. Exposed for testing.
    val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)
    _executorAllocationManager =
      if (dynamicAllocationEnabled) {
        schedulerBackend match {
          case b: ExecutorAllocationClient =>
            Some(new ExecutorAllocationManager(
              schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf,
              _env.blockManager.master))
          case _ =>
            None
        }
      } else {
        None
      }
    _executorAllocationManager.foreach(_.start())

ContextCleaner creation and start

For cleaning applications beyond the scope of RDD, shuffle the corresponding task state map, Shuffle metadata, Broadcast objects, and the checkpoint data RDD

  • Creating ContexCleaner
_cleaner =
      if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {
        Some(new ContextCleaner(this))
      } else {
        None
      }
    _cleaner.foreach(_.start())
  • Start ContexCleaner
/** Start the cleaner. */
  def start(): Unit = {
    cleaningThread.setDaemon(true)
    cleaningThread.setName("Spark Context Cleaner")
    cleaningThread.start()
    periodicGCService.scheduleAtFixedRate(new Runnable {
      override def run(): Unit = System.gc()
    }, periodicGCInterval, periodicGCInterval, TimeUnit.SECONDS)
  }

In addition to GC timer, ContextCleaner the remaining works and listenerBus as (using listening mode, handled by the asynchronous thread).

Spark Environmental Update

  •  Additional jar package or other documents when submitting the task of adding users to specify how they do?

SparkContext initialized when the user reads the specified files or other files Jar

_jars = Utils.getUserJars(_conf)
    _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.nonEmpty))
      .toSeq.flatten

When Jar file is read first, and then read other documents set by the user.

When used Yarn mode, _jars a Jar file and spark.yarn.dist.jars spark.jars and set.

When other modes, using only spark.jars specified Jar file.

  • How to get the jar task and file it?
def jars: Seq[String] = _jars
def files: Seq[String] = _files

// Add each JAR given through the constructor
if (jars != null) {
jars.foreach(addJar)
}

if (files != null) {
files.foreach(addFile)
}

addJar Jar file to add the Driver RPC environment.

Because addJar and addFile may have an impact on the application environment, so SparkContext initialization last updated on environment

mail environment Update ()

SparkContext finishing work

postEnvironmentUpdate()
  postApplicationStart()

  // Post init
  _taskScheduler.postStartHook() // 等待SchedulerBackend准备完成
  // 向度量系统注册Source
  _env.metricsSystem.registerSource(_dagScheduler.metricsSource)
  _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))
  _executorAllocationManager.foreach { e =>
    _env.metricsSystem.registerSource(e.executorAllocationManagerSource)
  }

  // Make sure the context is stopped if the user forgets about it. This avoids leaving
  // unfinished event logs around after the JVM exits cleanly. It doesn't help if the JVM
  // is killed, though.
  // 添加SparkContext的关闭钩子
  logDebug("Adding shutdown hook") // force eager creation of logger
  _shutdownHookRef = ShutdownHookManager.addShutdownHook(
    ShutdownHookManager.SPARK_CONTEXT_SHUTDOWN_PRIORITY) { () =>
    logInfo("Invoking stop() from shutdown hook")
    try {
      stop()
    } catch {
      case e: Throwable =>
        logWarning("Ignoring Exception while stopping SparkContext from shutdown hook", e)
    }
  }
} catch {
  case NonFatal(e) =>
    logError("Error initializing SparkContext.", e)
    try {
      stop()
    } catch {
      case NonFatal(inner) =>
        logError("Error stopping SparkContext after init error.", inner)
    } finally {
      throw e
    }
}


// In order to prevent multiple SparkContexts from being active at the same time, mark this
// context as having finished construction. 
// NOTE: this must be placed at the end of the SparkContext constructor.
SparkContext.setActiveContext(this, allowMultipleContexts)

  

Common methods of providing SparkContext

broadcast

/**
   * Broadcast a read-only variable to the cluster, returning a
   * [[org.apache.spark.broadcast.Broadcast]] object for reading it in distributed functions.
   * The variable will be sent to each cluster only once.
   *
   * @param value value to broadcast to the Spark nodes
   * @return `Broadcast` object, a read-only variable cached on each machine
   */
  def broadcast[T: ClassTag](value: T): Broadcast[T] = {
    assertNotStopped()
    require(!classOf[RDD[_]].isAssignableFrom(classTag[T].runtimeClass),
      "Can not directly broadcast RDDs; instead, call collect() and broadcast the result.")
    val bc = env.broadcastManager.newBroadcast[T](value, isLocal)
    val callSite = getCallSite
    logInfo("Created broadcast " + bc.id + " from " + callSite.shortForm)
    cleaner.foreach(_.registerBroadcastForCleanup(bc))
    bc
  }

 Essentially calling newBroadcast BroadcastManager of SparkEnv's () method to generate broadcast object.

addSparkListener

It used to achieve the qualities SparkListenerInterface listener to LiveListenerBus in Tina Jia

/**
   * :: DeveloperApi ::
   * Register a listener to receive up-calls from events that happen during execution.
   */
  @DeveloperApi
  def addSparkListener(listener: SparkListenerInterface) {
    listenerBus.addToSharedQueue(listener)
  }

 runjob

SparkContext overloaded runjob method. Eventually call the following runjob.

/**
   * Run a function on a given set of partitions in an RDD and pass the results to the given
   * handler function. This is the main entry point for all actions in Spark.
   *
   * @param rdd target RDD to run tasks on
   * @param func a function to run on each partition of the RDD
   * @param partitions set of partitions to run on; some jobs may not want to compute on all
   * partitions of the target RDD, e.g. for operations like `first()`
   * @param resultHandler callback to pass each result to
   */
  def runJob[T, U: ClassTag](
      rdd: RDD[T],
      func: (TaskContext, Iterator[T]) => U,
      partitions: Seq[Int],
      resultHandler: (Int, U) => Unit): Unit = {
    if (stopped.get()) {
      throw new IllegalStateException("SparkContext has been shutdown")
    }
    val callSite = getCallSite
    val cleanedFunc = clean(func)
    logInfo("Starting job: " + callSite.shortForm)
    if (conf.getBoolean("spark.logLineage", false)) {
      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)
    }
    //调用sparkContext之前初始化时创建的DAGScheduler的runJob()方法
    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)
    progressBar.foreach(_.finishAll())
    rdd.doCheckpoint() // 保存检查点
  }

setCheckPoint

To work in the specified directory RDD checkpoint save is enabled checkpoint mechanism premise.

/**
   * Set the directory under which RDDs are going to be checkpointed.
   * @param directory path to the directory where checkpoint files will be stored
   * (must be HDFS path if running in cluster)
   */
  def setCheckpointDir(directory: String) {

    // If we are running on a cluster, log a warning if the directory is local.
    // Otherwise, the driver may attempt to reconstruct the checkpointed RDD from
    // its own local file system, which is incorrect because the checkpoint files
    // are actually on the executor machines.
    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {
      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +
        s"must not be on the local filesystem. Directory '$directory' " +
        "appears to be on the local filesystem.")
    }

  

reference

1. "art architecture design and implementation of core design of the Spark"

 2.Spark2.4.3 source

Guess you like

Origin www.cnblogs.com/qinglanmei/p/11209281.html