1. NetworkWordCount
package yk.streaming
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.{Seconds, StreamingContext}
object NetworkWordCount {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("NetworkWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(1))
val lines = ssc.socketTextStream("localhost",9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}
2. 初始化StreamingContext
对于一个Spark Streaming应用程序,首先要做的就是初始化StreamingContext,在这里使用了SparkConf的对象和Duration(批处理的时间间隔)作为参数传入StreamingContext的辅助构造器中,需要注意的是,在这里Checkpoint默认被设置为null,代码如下:
/**
* Create a StreamingContext by providing the configuration necessary for a new SparkContext.
* @param conf a org.apache.spark.SparkConf object specifying Spark parameters
* @param batchDuration the time interval at which streaming data will be divided into batches
*/
def this(conf: SparkConf, batchDuration: Duration) = {
this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
}
接着调用主构造器,主构造器会执行类定义中的所有语句(即初始化所有字段,方法只有显式调用才会执行)。在这个过程中,也即StreamingContext的初始化过程中,会对成员变量进行初始化。在这些成员变量中重要的有DStreamGraph、JobScheduler和StreamingTab等。其中DStreamGraph跟RDD的有向无环图类似,用于存放DStream以及DStream之间的依赖关系等信息;JobScheduler的作用是定时查看DStreamGraph,然后根据流入的数据生成运行作业;StreamingTab是在SparkStreaming的作业运行的时候,提供对流数据处理的监控。相关代码如下:
private[streaming] val graph: DStreamGraph = {
if (isCheckpointPresent) {
_cp.graph.setContext(this)
_cp.graph.restoreCheckpointData()
_cp.graph
} else {
require(_batchDur != null, "Batch duration for StreamingContext cannot be null")
val newGraph = new DStreamGraph()
newGraph.setBatchDuration(_batchDur)
newGraph
}
}
private[streaming] val scheduler = new JobScheduler(this)
private[streaming] val uiTab: Option[StreamingTab] =
if (conf.getBoolean("spark.ui.enabled", true)) {
Some(new StreamingTab(this))
} else {
None
}
3. 创建InputDStream
在例子中接着调用StreamingContext的socketTextStream方法生成具体的InputDStream。在socketTextStream方法中有三个参数,其中hostname和port分别表示要连接服务器的主机名和端口号,而StorageLevel.MEMORY_AND_DISK_SER是数据的存储等级。
/**
* Creates an input stream from TCP source hostname:port. Data is received using
* a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
* lines.
* @param hostname Hostname to connect to for receiving data
* @param port Port to connect to for receiving data
* @param storageLevel Storage level to use for storing the received objects
* (default: StorageLevel.MEMORY_AND_DISK_SER_2)
* @see [[socketStream]]
*/
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
继续跟踪socketStream方法,在其中创建了一个SocketInputDStream。通过SparkStreaming之images中的DStream类图可以看到SocketInputDStream类继承自ReceiverInputDStream,而ReceiverInputDStream继承自InputDStream,InputDStream继承自DStream。
另外,在SocketInputDStream内部重写了ReceiverInputDStream中的getReceiver方法。该方法是用来生成接收器的
4. 启动Spark Streaming应用程序
创建完InputDStream后,调用StreamingContext的start方法进行Spark Streaming应用程序的启动,其中最重要的就是启动JobScheduler。代码如下:
/**
* Start the execution of the streams.
*
* @throws IllegalStateException if the StreamingContext is already stopped.
*/
def start(): Unit = synchronized {
state match {
case INITIALIZED =>
startSite.set(DStream.getCreationSite())
StreamingContext.ACTIVATION_LOCK.synchronized {
StreamingContext.assertNoOtherContextIsActive()
try {
validate()
// Start the streaming scheduler in a new thread, so that thread local properties
// like call sites and job groups can be reset without affecting those of the
// current thread.
ThreadUtils.runInNewThread("streaming-start") {
sparkContext.setCallSite(startSite.get)
sparkContext.clearJobGroup()
sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false")
savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get()))
// 启动JobScheduler(关键代码)
scheduler.start()
}
state = StreamingContextState.ACTIVE
} catch {
case NonFatal(e) =>
logError("Error starting the context, marking it as stopped", e)
scheduler.stop(false)
state = StreamingContextState.STOPPED
throw e
}
StreamingContext.setActiveContext(this)
}
logDebug("Adding shutdown hook") // force eager creation of logger
shutdownHookRef = ShutdownHookManager.addShutdownHook(
StreamingContext.SHUTDOWN_HOOK_PRIORITY)(stopOnShutdown)
// Registering Streaming Metrics at the start of the StreamingContext
assert(env.metricsSystem != null)
env.metricsSystem.registerSource(streamingSource)
uiTab.foreach(_.attach())
logInfo("StreamingContext started")
case ACTIVE =>
logWarning("StreamingContext has already been started")
case STOPPED =>
throw new IllegalStateException("StreamingContext has already been stopped")
}
}