Streaming Context

To initialize a Spark Streaming program, 
a StreamingContext object has to be created which is the main entry point of all Spark Streaming functionality.

构造方法

  /**
   * Create a StreamingContext using an existing SparkContext.
   * @param sparkContext existing SparkContext
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  /**
   * Create a StreamingContext by providing the configuration necessary for a new SparkContext.
   * @param conf a org.apache.spark.SparkConf object specifying Spark parameters
   * @param batchDuration the time interval at which streaming data will be divided into batches
   */
  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

The appName parameter is a name for your application to show on the cluster UI.
 master is a Spark, Mesos or YARN cluster URL,
 or a special “local[*]” string to run in local mode.
 In practice, when running on a cluster, you will not want to hardcode master in the program, 
but rather launch the application with spark-submit and receive it there.
 However, for local testing and unit tests, you can pass “local[*]” to run Spark Streaming in-process (detects the number of cores in the local system). 
Note that this internally creates a SparkContext (starting point of all Spark functionality) which can be accessed as ssc.sparkContext.

The batch interval must be set based on the latency requirements of your application and available cluster resources. See the Performance Tuning section for more details.

batch interval 可以根据应用需求的延时要求，和集群可用的资源情况来设置。

After a context is defined, you have to do the following.

Define the input sources by creating input DStreams.
Define the streaming computations by applying transformation and output operations to DStreams.
Start receiving data and processing it using streamingContext.start().
Wait for the processing to be stopped (manually or due to any error) using streamingContext.awaitTermination().
The processing can be manually stopped using streamingContext.stop().

DStream----Discretized Streams

a DStream is represented by a continuous series of RDDs

Each RDD in a DStream contains data from a certain interval

对DStream操作算计，比如map/flatmap,其实底层会被翻译为对DStream中每个RDD做相同的操作，因为一个Dstream是由

不同批次的RDD来构成的

Input Dstream 和Receivers

Every input DStream (except file stream, discussed later in this section) is associated with a Receiver (Scala doc, Java doc) object which receives the data from a source and stores it in Spark’s memory for processing.

except file stream, discussed later in this section

  /**
   * Creates an input stream from TCP source hostname:port. Data is received using
   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited
   * lines.
   * @param hostname      Hostname to connect to for receiving data
   * @param port          Port to connect to for receiving data
   * @param storageLevel  Storage level to use for storing the received objects
   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)
   * @see [[socketStream]]
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

abstract class ReceiverInputDStream[T: ClassTag](_ssc: StreamingContext)
  extends InputDStream[T](_ssc) {

 /**
   * Create an input stream that monitors a Hadoop-compatible filesystem
   * for new files and reads them as text files (using key as LongWritable, value
   * as Text and input format as TextInputFormat). Files must be written to the
   * monitored directory by "moving" them from another location within the same
   * file system. File names starting with . are ignored.
   * @param directory HDFS directory to monitor for new file
   */
  def textFileStream(directory: String): DStream[String] = withNamedScope("text file stream") {
    fileStream[LongWritable, Text, TextInputFormat](directory).map(_._2.toString)
  }

Spark Streaming provides two categories of built-in streaming sources.

Basic sources: Sources directly available in the StreamingContext API. Examples: file systems, and socket connections.
Advanced sources: Sources like Kafka, Flume, Kinesis, etc. are available through extra utility classes. These require linking against extra dependencies as discussed in the linking section.

When running a Spark Streaming program locally, do not use “local” or “local[1]” as the master URL. 
Either of these means that only one thread will be used for running tasks locally. 
If you are using an input DStream based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single thread will be used to run the receiver, 
leaving no thread for processing the received data. 
Hence, when running locally, always use “local[n]” as the master URL,
 where n > number of receivers to run (see Spark Properties for information on how to set the master).

Extending the logic to running on a cluster, 
the number of cores allocated to the Spark Streaming application must be more than the number of receivers.
 Otherwise the system will receive data, but not be able to process it.

Transformations on DStreams

Similar to that of RDDs, transformations allow the data from the input DStream to be modified. 
DStreams support many of the transformations available on normal Spark RDD’s. Some of the common ones are as follows.

Output Operations on DStreams

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems.

Spark Streaming 处理socket数据

package com.rachel

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}

object StatefulWordCount {
  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf().setAppName("StatefulWordCount")
      .setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf,Seconds(5))
    val lines = ssc.socketTextStream("192.168.1.6",1111)
    val result = lines.flatMap(_.split(" ")).map((_,1))
    result.print()
    ssc.start()
    ssc.awaitTermination()
  }
}

Spark Streaming 第二部分