Spark source code reading-streaming module job generation and submission

Usually we use the following code to develop spark-streaming:

val sparkConf = new SparkConf()
    .set("xxx", "")
    ...
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(5))

//然后我们使用ssc来创建一个InputDStream
val stream = ssc.socketTextStream("localhost", 9090)
stream.map(...)
    .filter(...)
    .reduce(...)
    .foreachRDD(...)
ssc.start()
ssc.awaitTermination()

We know that a series of call chains of stream are actually the process of building a dag. We go deep into the foreachRDD code and see:

  private def foreachRDD(
      foreachFunc: (RDD[T], Time) => Unit,
      displayInnerRDDOps: Boolean): Unit = {
    new ForEachDStream(this,
      context.sparkContext.clean(foreachFunc, false), displayInnerRDDOps).register()
  }

DStream::register

  private[streaming] def register(): DStream[T] = {
    ssc.graph.addOutputStream(this)
    this
  }

DStreamGraph::addOutputStream

  def addOutputStream(outputStream: DStream[_]) {
    this.synchronized {
      outputStream.setGraph(this)
      outputStreams += outputStream
    }
  }

Here DStream registers itself with DStreamGraph. Subsequent job generation will traverse this outputStreams. At this point, the dag has been constructed and registered. Let's see how the StreamingContext generates a job every once in a while (start).

  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

The core code here is to call JobScheduler.start(). JobScheduler is initialized when StreamingContext is created.

  //JobScheduler::start
  def start(): Unit = synchronized {
    if (eventLoop != null) return // scheduler has already been started

    logDebug("Starting JobScheduler")
    eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
      override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e)
    }
    eventLoop.start()

    // attach rate controllers of input streams to receive batch completion updates
    for {
      inputDStream <- ssc.graph.getInputStreams
      rateController <- inputDStream.rateController
    } ssc.addStreamingListener(rateController)

    listenerBus.start()
    receiverTracker = new ReceiverTracker(ssc)
    inputInfoTracker = new InputInfoTracker(ssc)
    executorAllocationManager = ExecutorAllocationManager.createIfEnabled(
      ssc.sparkContext,
      receiverTracker,
      ssc.conf,
      ssc.graph.batchDuration.milliseconds,
      clock)
    executorAllocationManager.foreach(ssc.addStreamingListener)
    receiverTracker.start()
    jobGenerator.start()
    executorAllocationManager.foreach(_.start())
    logInfo("Started JobScheduler")
  }

Here we focus on the startup of jobGenerator. Take a look at JobGenerator.start

  def start(): Unit = synchronized {
    if (eventLoop != null) return // generator has already been started

    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.
    // See SPARK-10125
    checkpointWriter

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {
      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)

      override protected def onError(e: Throwable): Unit = {
        jobScheduler.reportError("Error in job generator", e)
      }
    }
    eventLoop.start()

    if (ssc.isCheckpointPresent) {
      restart()
    } else {
      startFirstTime()
    }
  }

We see that an event loop thread is also initialized and started, accepting events of type JobGeneratorEvent, and calling the startFirstTime method when it is first started,

  private def startFirstTime() {
    val startTime = new Time(timer.getStartTime())
    graph.start(startTime - graph.batchDuration)
    timer.start(startTime.milliseconds)
    logInfo("Started JobGenerator at " + startTime)
  }

DStreamGraph and timer are started in this method, let's see what timer is:

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

It is found that it is actually a timer, which sends a GenerateJobs event to the event loop thread at regular intervals, and this period of time is the time interval parameter passed in when we created the StreamingContext. Then let's see how the event loop thread handles this event:

  private def processEvent(event: JobGeneratorEvent) {
    logDebug("Got event " + event)
    event match {
      case GenerateJobs(time) => generateJobs(time)
      case ClearMetadata(time) => clearMetadata(time)
      case DoCheckpoint(time, clearCheckpointDataLater) =>
        doCheckpoint(time, clearCheckpointDataLater)
      case ClearCheckpointData(time) => clearCheckpointData(time)
    }
  }
  
  private def generateJobs(time: Time) {
    // Checkpoint all RDDs marked for checkpointing to ensure their lineages are
    // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847).
    ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")
    Try {
      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
      graph.generateJobs(time) // generate jobs using allocated block
    } match {
      case Success(jobs) =>
        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
      case Failure(e) =>
        jobScheduler.reportError("Error generating jobs for time " + time, e)
        PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
    }
    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

The focus here is to call the generateJobs method of DStreamGraph to return the job set and subsequently call the JobScheduler.submitJobSet method

  //DStreamGraph::generateJobs
  def generateJobs(time: Time): Seq[Job] = {
    logDebug("Generating jobs for time " + time)
    val jobs = this.synchronized {
      outputStreams.flatMap { outputStream =>
        val jobOption = outputStream.generateJob(time)
        jobOption.foreach(_.setCallSite(outputStream.creationSite))
        jobOption
      }
    }
    logDebug("Generated " + jobs.length + " jobs for time " + time)
    jobs
  }

Seeing that the outputStreams mentioned above are traversed here, and the generateJob method is called for each registered DStream to generate a job

  //DStream::generateJob
  private[streaming] def generateJob(time: Time): Option[Job] = {
    getOrCompute(time) match {
      case Some(rdd) =>
        val jobFunc = () => {
          val emptyFunc = { (iterator: Iterator[T]) => {} }
          context.sparkContext.runJob(rdd, emptyFunc)
        }
        Some(new Job(time, jobFunc))
      case None => None
    }
  }

The getOrCompute method will call the compute method to return an RDD, and this compute method is an abstract method that needs to be implemented by subclasses. If you are interested, you can see how the DirectInputDStream connected to the kafka module is implemented. After returning an rdd here, a closure is created to encapsulate it and return it in the Job object. The logic of the SparkContext submitting the job is taken in the closure. Then we will see when the closure in the Job is executed. Return some list of call stacks from here back to the JobScheduler.submitJobSet of the JobGenerator.generateJobs method

  def submitJobSet(jobSet: JobSet) {
    if (jobSet.jobs.isEmpty) {
      logInfo("No jobs added for time " + jobSet.time)
    } else {
      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))
      jobSets.put(jobSet.time, jobSet)
      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
      logInfo("Added jobs for time " + jobSet.time)
    }
  }

In submitJobSet, jobs are encapsulated in JobHandler and submitted to jobExecutor for execution. JobHandler is an implementation of the Runnable interface. In its run method, the core logic is as follows:

var _eventLoop = eventLoop
if (_eventLoop != null) {
  _eventLoop.post(JobStarted(job, clock.getTimeMillis()))
  PairRDDFunctions.disableOutputSpecValidation.withValue(true) {
    job.run()
  }
  _eventLoop = eventLoop
  if (_eventLoop != null) {
    _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))
  }
}

The job.run method is called here, and what is actually executed in the run method is the closure we passed to the Job object. The operation of the entire job is a blocking process, which will monopolize one thread of the jobExecutor, and the number of threads of the JobExecutor is specified by judging the spark.streaming.concurrentJobs parameter when it is initialized, and the default is 1. Well, the logic of the job generated by the streaming module has been roughly analyzed.

Spark source code reading-streaming module job generation and submission

Spark source code reading-streaming module job generation and submission

Guess you like