1. NetworkWordCount

package yk.streaming

import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}

object NetworkWordCount {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    val lines = ssc.socketTextStream("localhost",9999, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

2. 初始化StreamingContext

对于一个Spark Streaming应用程序，首先要做的就是初始化StreamingContext，在这里使用了SparkConf的对象和Duration（批处理的时间间隔）作为参数传入StreamingContext的辅助构造器中，需要注意的是，在这里Checkpoint默认被设置为null，代码如下：

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

接着调用主构造器，主构造器会执行类定义中的所有语句（即初始化所有字段，方法只有显式调用才会执行）。在这个过程中，也即StreamingContext的初始化过程中，会对成员变量进行初始化。在这些成员变量中重要的有DStreamGraph、JobScheduler和StreamingTab等。其中DStreamGraph跟RDD的有向无环图类似，用于存放DStream以及DStream之间的依赖关系等信息；JobScheduler的作用是定时查看DStreamGraph，然后根据流入的数据生成运行作业；StreamingTab是在SparkStreaming的作业运行的时候，提供对流数据处理的监控。相关代码如下：

  private[streaming] val graph: DStreamGraph = {
    if (isCheckpointPresent) {
      _cp.graph.setContext(this)
      _cp.graph.restoreCheckpointData()
      _cp.graph
    } else {
      require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
      val newGraph = new DStreamGraph()
      newGraph.setBatchDuration(_batchDur)
      newGraph
    }
  }

  private[streaming] val scheduler = new JobScheduler(this)

  private[streaming] val uiTab: Option[StreamingTab] =
    if (conf.getBoolean("spark.ui.enabled", true)) {
      Some(new StreamingTab(this))
    } else {
      None
    }

3. 创建InputDStream

在例子中接着调用StreamingContext的socketTextStream方法生成具体的InputDStream。在socketTextStream方法中有三个参数，其中hostname和port分别表示要连接服务器的主机名和端口号，而StorageLevel.MEMORY_AND_DISK_SER是数据的存储等级。

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

继续跟踪socketStream方法，在其中创建了一个SocketInputDStream。通过SparkStreaming之images中的DStream类图可以看到SocketInputDStream类继承自ReceiverInputDStream，而ReceiverInputDStream继承自InputDStream，InputDStream继承自DStream。
另外，在SocketInputDStream内部重写了ReceiverInputDStream中的getReceiver方法。该方法是用来生成接收器的

4. 启动Spark Streaming应用程序

创建完InputDStream后，调用StreamingContext的start方法进行Spark Streaming应用程序的启动，其中最重要的就是启动JobScheduler。代码如下：

  /**
   * Start the execution of the streams.
   *
   * @throws IllegalStateException if the StreamingContext is already stopped.
   */
  def start(): Unit = synchronized {
    state match {
      case INITIALIZED =>
        startSite.set(DStream.getCreationSite())
        StreamingContext.ACTIVATION_LOCK.synchronized {
          StreamingContext.assertNoOtherContextIsActive()
          try {
            validate()

            // Start the streaming scheduler in a new thread, so that thread local properties
            // like call sites and job groups can be reset without affecting those of the
            // current thread.
            ThreadUtils.runInNewThread("streaming-start") {
              sparkContext.setCallSite(startSite.get)
              sparkContext.clearJobGroup()
              sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
              savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
              // 启动JobScheduler(关键代码)
              scheduler.start()
            }
            state = StreamingContextState.ACTIVE
          } catch {
            case NonFatal(e) =>
              logError("Error starting the context, marking it as stopped", e)
              scheduler.stop(false)
              state = StreamingContextState.STOPPED
              throw e
          }
          StreamingContext.setActiveContext(this)
        }
        logDebug("Adding shutdown hook") // force eager creation of logger
        shutdownHookRef = ShutdownHookManager.addShutdownHook(
          StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
        // Registering Streaming Metrics at the start of the StreamingContext
        assert(env.metricsSystem != null)
        env.metricsSystem.registerSource(streamingSource)
        uiTab.foreach(_.attach())
        logInfo("StreamingContext started")
      case ACTIVE =>
        logWarning("StreamingContext has already been started")
      case STOPPED =>
        throw new IllegalStateException("StreamingContext has already been stopped")
    }
  }

Spark Streaming运行原理

时序图

1. NetworkWordCount

2. 初始化StreamingContext

3. 创建InputDStream

4. 启动Spark Streaming应用程序

猜你喜欢