【Spark Streaming】3、Spark Streaming入门

Spark Streaming入门

将不同的数据源的数据经过Spark Streaming处理之后将结果输出到外部文件系统

特点：

低延时
能从错误中高效的恢复：fault-tolerant
能够将批处理、机器学习、图计算等子框架和Spark Streaming综合起来使用

one stack to rule them all：一栈式

WordCount

Spark-submit方式运行NetworkWordCount程序

## 生产上常用这种方式
./spark-submit --master local[2] \
--class org.apache.spark.examples.streaming.NetworkWordCount \
--name NetworkCount \
/home/hadoop/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/jars/spark-examples_2.11-2.2.0.jar hadoop000 9999

**spark-shell方式运行程序(经常在测试当中用到) **

./spark-shell --master local[2]
## 直接以交互式命令行的方式启动了 随后一行行的输入程序就行了
import org.apache.spark.streaming.{Seconds,StreamingContext}
val ssc=new StreamingContext(sc,Seconds(1)) //每一秒执行一次
val lines=ssc.socketTextStream("hadoop000",9999)
val words=lines.flatMap(_.split(" "))
val wordCounts=words.map(x=>(x,1)).reduceByKey(_+_)
wordCounts.print()
ssc.start()
ssc.awaitTermination()

工作原理

粗粒度：

Spark Streaming接收到实时数据流，把数据按照指定的时间段切成一片片小的数据块，然后把小的数据块传给Spark Engine处理

细粒度：

Spark Stream核心

StreamingContext

StreamingContext是所有流功能的主要入口点。

常用的两种生成StreamingContext的构造函数

def this(sparkContext: SparkContext, batchDuration: Duration) = {
    this(sparkContext, null, batchDuration)
  }

  def this(conf: SparkConf, batchDuration: Duration) = {
    this(StreamingContext.createNewSparkContext(conf), null, batchDuration)
  }

import org.apache.spark.streaming._

val sc = ...                // existing SparkContext
val ssc = new StreamingContext(sc, Seconds(1))

其中batchDuration可以根据你的应用程序需求的延迟要求以及集群可用的资源情况来设置

一旦StreamingContext定义好之后，就可以做一些事情：

通过创建输入DStream定义输入源。
通过将转换和输出操作应用于DStream来定义流计算。
接收数据并使用streamingContext.start()处理它。
等待使用streamingContext.awaitTermination()停止处理（手动或由于任何错误）。
可以使用streamingContext.stop()手动停止处理。

需要注意的是：

一旦启动streamingContext.上下文，就无法设置新的流计算或将其添加到该流计算中。
streamingContext.上下文一旦停止，就无法重新启动。
JVM中只能同时激活一个StreamingContext。
StreamingContext上的stop()也会停止SparkContext。要仅停止StreamingContext，请将名为stopSparkContext的stop()的可选参数设置为false。
只要在创建下一个StreamingContext之前停止了上一个StreamingContext（不停止SparkContext），就可以重新使用SparkContext创建多个StreamingContext。

DStreams(Discretized Streams)

离散流或DStream是Spark Streaming提供的基本抽象。它表示连续的数据流，可以是从源接收的输入数据流，也可以是通过转换输入流生成的已处理数据流。在内部，DStream由一系列连续的RDD表示，这是Spark对不可变的分布式数据集的抽象.DStream中的每个RDD都包含来自特定间隔的数据

对Dstream操作算子，比如map/flatMap，其实底层会被翻译为对Dstream中的每个RDD都做相同的操作。因为一个DStream是由不同批次的RDD所构成的

Input Dstreams and Receivers

Input Dstreams是表示从流源接收的输入数据流的DStream。每个输入DStream（文件流除外）都与一个Receiver对象（Scala doc，Java doc）相关联，该对象从源接收数据并将其存储在Spark的内存中以进行处理。

Spark Streaming提供了两类内置的流媒体源。

基本来源：直接在StreamingContext API中可用的来源。示例：文件系统和套接字连接。
高级资源：可以通过其他实用程序类获得诸如Kafka，Flume，Kinesis等资源。如链接部分所述，它们需要针对额外的依赖项进行链接。

注意:

在本地运行Spark Streaming程序时，请勿使用“ local”或“ local [1]”作为主URL。这两种方式均意味着仅一个线程将用于本地运行任务。如果您使用基于接收方的输入DStream（例如套接字，Kafka，Flume等），则将使用单个线程来运行接收方，而不会留下任何线程来处理接收到的数据。因此，在本地运行时，请始终使用“ local [n]”作为主URL，其中n>要运行的接收器数。

Transformations on DStreams

与RDD相似，转换允许修改来自输入DStream的数据。 DStream支持普通Spark RDD上可用的许多转换。

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)	Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()	Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)	Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
countByValue()	When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])	When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark’s default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property `spark.default.parallelism`) to do the grouping. You can pass an optional `numTasks` argument to set a different number of tasks.
join(otherStream, [numTasks])	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])	When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)	Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)	Return a new “state” DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.

Output Operations on DStreams

输出操作允许将DStream的数据推出到外部系统，例如数据库或文件系统。由于输出操作实际上允许外部系统使用转换后的数据，因此它们会触发所有DStream转换的实际执行（类似于RDD的操作）

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging. Python API This is called pprint() in the Python API.
saveAsTextFiles(prefix, [suffix])	Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”.
saveAsObjectFiles(prefix, [suffix])	Save this DStream’s contents as `SequenceFiles` of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.
saveAsHadoopFiles(prefix, [suffix])	Save this DStream’s contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: “prefix-TIME_IN_MS[.suffix]”. Python API This is not available in the Python API.
foreachRDD(func)	The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

实战案例

Spark Streaming处理Socket数据

object SparkStreamingDemo{
  def main(args:Arrays[String]):Unit={
    val sparkConf=new SparkConf().setAppName("SparkStreamingDemo")
    														 .setMaster("local[2]")
    
    val ssc=new StreamingContext(sparkConf,Seconds(2))
    
    val lines=ssc.socketTextStream("localhost","9999")
    
    val wordCounts=line.flapMap(line=>line.split(" "))
    									 .map(word=>(word,1))
    									 .reduceByKey(_+_)
    
    wordCounts.print()
    
    ssc.start()
    ssc.awaitTermination()
  }
}

nc -lk 9999

(输入）

a b c d a d c d a a

控制台输出

Time: 1570849330000 ms

(d,3)
(b,1)
(a,4)
(c,2)

Spark Streaming处理文件数据

object SparkStreamingDemo {
  def main(args: Array[String]): Unit = {

    val sparkConf=new SparkConf().setAppName("SparkStreamingDemo")
      .setMaster("local[2]")
    //每一秒跑批一次
    val ssc=new StreamingContext(sparkConf,Seconds(5))

    
    val lines= ssc.textFileStream("/Users/wojiushiwo/spark")

    val wordCounts=lines.flatMap(line=>line.split(" ")).map(word=>(word,1))
      .reduceByKey(_+_)

    wordCounts.print()

    //start the computation
    ssc.start()
    //Wait for the computation to terminate
    ssc.awaitTermination()
  }
}

注意：textFileStream方法的使用，该方法描述为

创建一个监视Hadoop兼容文件系统的输入流用于新文件，并将其读取为文本文件（使用键作为LongWritable，值作为Text，输入格式作为TextInputFormat）。必须通过从同一文件系统内的其他位置“移动”文件来将文件写入受监视的目录。文件名以.开头被忽略

以/Users/wojiushiwo/spark作为工作目录，将文件从其他路径cp或mv到该目录才会被认为是新文件才会被处理。如果该工作目录默认有文件存在，也不会被处理；如果在该工作目录中修改已有的文件，也不会被处理

码农的进阶之路

发布了116 篇原创文章 · 获赞 23 · 访问量 8万+

私信关注