Spark learning 9 Spark Streaming streaming data learning processing component

SparkStreaming related concepts

Outline

SparkStreaming mainly used for real-time processing of streaming data, such as: real-time web log analysis, real-time tracking page views and other statistics.
Data flow characteristics are:

  • Data is always changing
  • Data can not be rolled back
  • Data is always flooding the

Spark Streaming is built on the framework Spark scalable high-throughput real-time processing of streaming data, the data can be from many different sources, e.g. kafka, Flume, Twitter, ZeroMQ TCP Socket or the like. In this framework, the various calculations to support streaming data, such as map, reduce, join the like. After processing the data may be stored in a file system or database.

Streaming tutorial framework is given in FIG such, a data stream source is on the left, the right is the output data storage target after treatment.

In other words, hdfs file system before use in applications where sparkStreaming still work.

Basic data SparkStreaming abstract DStream

Dstream this component is the basic data abstraction, its Chinese meaning is "discrete flow", a concept for SparkStreaming, like the RDD for the Spark itself.
It represents a lapse of time and by a continuous data stream, which may be received from the data source input data stream, the stream may be generated by converting the processed input data stream. In its interior, DStream is represented by a series of consecutive RDD, each RDD contains data from a particular time interval.

Any conversion operations are applied to the bottom of the RDD DStream operation. Here is an example of tutorial images can be seen form the "line" RDD every moment with a lineDStream, then, did this lineDStream flatMap after the operation, on the whole "line" DStream into a word DStream, actual this is the time corresponding to each of the RDD each underlying flatMap are converted into words each time only RDD achieved.

Processing mode

At the lowest level, a method of handling data Spark Streaming employed Stream data is time sliced into small data segments, the data segments through a similar process in batch mode.
As a whole is concerned, it is to deal with the idea Spark Streaming: the continuous data persistence, discrete and batch processing.

  • Data Persistence: the received data is temporarily stored, to ensure that the error handling when retained for processing source data again
  • Discrete: use to store data specific time interval (I understand that at this time there DStream, DStream is discrete RDD every moment of receiving the data stream becomes together)
  • Batch Processing: The RDD batch processing mode data

DStream again corresponds to RDD package, which provides a conversion operation and the output operation of two methods of operation.

Operation details of the process

StreamingContext

This is the "stream data" context object, manipulating the same function as before Sparkcontext objects (sc) to provide the RDD, this context object provides the ability to receive and process DStream creation and transformation as well as data. No StreamingContext objects will not be able to use these features.

Creating StreamingContext objects

import org.apache.spark._
import org.apache.spark.streaming._

val conf = new SparkConf().setAppName(appName).setMaster(master) 
val ssc = new StreamingContext(conf, Seconds(1)) //这里把conf配置对象传给对应的方法

For manipulating RDD is to use sparkContext, if need to use SparkContext to manipulate the RDD, it will need to know the internal StreamingContext creates SparkContext
call is to call the dot symbol

ssc.sparkContext

StreamingContext main usage

  • Create a custom data source DStreams
  • Use conversion operation and output DStreams
  • Receiving data: StreamingContext.start ()
  • Pending results: StreamingContext.awaitTermination ()
  • Stop program: StreamingContext.stop ()

Input Source

Each Input DStreams (data source), in addition to file stream, are associated with a Receiver (receiver) the object, the receiver can receive the object data source and stored in memory.

Spark Streaming provides two types of built-in data sources:

The underlying data source. StreamingContext can be used directly in the API, such as file system, socket connector, Akka. For simple file, you can use textFileStream streamingContext method of treatment.
Advanced data source. For example Flume, Kafka, Kinesis, Twitter and other tools may be used in the data source. Using these data sources requires corresponding dependent

DStream two transformations

DStream conversion into non-operating state and the state of two kinds.

Stateless conversion action

Each batch of data is not dependent on the previous batch of data.

Conversion actions meaning
map(func) Generating a new function func according DStream
flatMap(func) Map with similar methods, but each can return multiple values. The return value is a collection of function func
union(otherStream) Take two DStream and set to give a new DStream
count() RDD number of all the calculations DStream
reduce(func) All RDD DStream calculated by the function func polymerization results
countByValue() If DStream of type K, then returns a new DStream, this new DStream element type is (K, Long), K is the original value of DStream, Long indicate how many times the Key
reduceByKey(func, [numTasks]) For DStream key-value pair (K, V), and returns to a new DStream K is a bond, each of the polymerization results obtained in function func value used to value
join(otherStream, [numTasks]) Based on (K, V) of the key DStream, if the key-value pair (K, W) of the join operation using DStream, may be generated (K, (V, W)) on the key DSTREAM
cogroup(otherStream, [numTasks]) Join with a similar method, but is based on (K, V) of DStream, cogroup based (K, W) of DSTREAM, generating (K, (Seq [V], Seq [W])) of DSTREAM
transform(func) Each RDD based DStream func function call, the function func parameter RDD is, the return value is also RDD
updateStateByKey(func) For each key will call the function func handle all previous state and new state

There state conversion action

需要使用之前批次的数据或中间结果来计算当前批次的数据。
有状态转化操作包括 Window 操作(基于窗口的转化操作) 和 UpdateStateByKey 操作(追踪状态变化的转化操作)

  • Window操作用作把几个批次的DStream合并成一个DStream,换句话说就是把这个窗口之内的DStream变成一个DStream
    每个 window 操作都需要 2 个参数:
    window length。顾名思义,就是窗口的长度。每个 window 对应的批次数(下图中是 3,time1-time3 是一个 window, time3-time5 也是一个 window)
    sliding interval。顾名思义,窗口每次滑动的间隔。每个 window 之间的间隔时间,下图下方的 window1,window3,window5 的间隔。图中这个值为 2。

  • UpdateStateByKey 操作
    使用 UpdateStateByKey 方法需要做以下两步:
    定义状态:状态可以是任意的数据类型
    定义状态更新函数:这个函数需要根据输入流把先前的状态和所有新的状态
    不管有没有新数据进来,在每个批次中,Spark 都会对所有存在的 key 调用 func 方法,如果 func 函数返回 None,那么 key-value 键值对不会被处理。

以一个例子来讲解 updateStateByKey 方法,这个例子会统计每个单词的个数在一个文本输入流里:
runningCount 是一个状态并且是 Int 类型,所以这个状态的类型是 Int,runningCount 是先前的状态,newValues 是所有新的状态,是一个集合,函数如下:

def updateFunction(newValues: Seq[Int], runningCount: Option[Int]): Option[Int] = {
    val newCount = ...  // add the new values with the previous running count to get the new count
    Some(newCount)
}

updateStateByKey 方法的调用

val runningCounts = pairs.updateStateByKey[Int](updateFunction _)

输出操作

DStream 中的数据一般会输出到数据库、文件系统等外部系统中

输出操作 含义
print() 打印出 DStream 中每个批次的前 10 条数据
saveAsTextFiles(prefix, [suffix]) 把 DStream 中的数据保存到文本文件里。每次批次的文件名根据参数 prefix 和 suffix 生成:”prefix-TIME_IN_MS[.suffix]”
saveAsObjectFiles(prefix, [suffix]) 把 DStream 中的数据按照 Java 序列化的方式保存 Sequence 文件里,文件名规则跟 saveAsTextFiles 方法一样
saveAsHadoopFiles(prefix, [suffix]) 把 DStream 中的数据保存到 Hadoop 文件里,文件名规则跟 saveAsTextFiles 方法一样
foreachRDD(func) 遍历 DStream 中的每段 RDD,遍历的过程中可以将 RDD 中的数据保存到外部系统中

注:
foreachRDD 方法会遍历 DStream 中的每段 RDD,遍历的过程中可以将 RDD 中的数据保存到外部系统中。将数据写到外部系统通常都需要一个 connection 对象,一种很好的方式就是使用 ConnectionPool,ConnectionPool 可以重用 connection 对象在多个批次和 RDD 中。示例代码如下:

dstream.foreachRDD { rdd =>
  rdd.foreachPartition { partitionOfRecords =>
    // ConnectionPool is a static, lazily initialized pool of connections
    val connection = ConnectionPool.getConnection()
    partitionOfRecords.foreach(record => connection.send(record))
    ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
  }
}

DStream 的输出操作也是延迟执行的(惰性操作),就像 RDD 的 action 操作一样。RDD 的 action 操作在 DStream 的输出操作内部执行的话会强制 Spark Streaming 执行从而获得输出。

实践(最简单的wordCount)

项目例子中我们通过实现一个 Spark Streaming 应用连接给定 TCP Socket,接收字符串数据并对数据进行 MapReduce 计算单词出现的频次。这个例子来自官方文档并做了一些修改。
Spark Streaming 上构建应用与 Spark 相似,都要先创建 Context 对象,并对抽象数据对象进行操作,Streaming 中处理的数据对象是 DStream。

创建StreamingContext对象

// 引入spark.streaming中的StreamingContext模块
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
// 注:最后一项在Spark 1.3及其之后的版本中不是必需的

// 下面这一句在 Spark Shell 中是不必输入的,因为 Spark Context 对象在 Spark Shell 启动过程中就已经创建好
// 创建本地的SparkContext对象,包含2个执行线程,APP名字命名为StreamWordCount
// val conf = new SparkConf().setMaster("local[2]").setAppName("SteamWordCount")

// 直接通过 Spark Context 对象 sc ,创建本地的StreamingContext对象,第二个参数为处理的时间片间隔时间,设置为1秒
val ssc = new StreamingContext(sc, Seconds(1))

所以两个关键的东西就是,SparkContext对象和数据处理的时间间隔。这里把数据处理的时间间隔设置成一秒。

创建DStream对象

这里是从一个socket获取数据的,对应的指定数据源的函数是“socketTextStream”,注意这步在没有nc命令让计算机开启9999端口的监听时,也可以输入并且不会报错,原因在于在ssc.start()输入之前所有的业务逻辑或者说执行操作的代码都可以被看做是惰性的操作,只有调用了start()之后Streamingcontext才运行之前输入的表示业务逻辑的代码,也就是说,按照我们的代码,只要在start()调用之前用nc开启监听就可以让程序正确运行。

// 创建DStream,指明数据源为socket:来自localhost本机的9999端口
val lines = ssc.socketTextStream("localhost", 9999)

lines 数据源可以来自各种不同的数据源,但原理都类似,调用不同的创建函数去连接 Kafka、 Flume、HDFS/S3、Kinesis 和 Twitter 等数据源。

对DStream对象操纵

对表示每行的DStream分割成每个单词的DStream。
这个操作和之前学的对RDD的操作基本没有什么区别,最大的不同是调用flatMap方法的对象不再是一个RDD对象而是一个DStream对象。

// 使用flatMap和Split对1秒内收到的字符串进行分割
val words = lines.flatMap(_.split(" "))

然后做最简单的求和

// map操作将独立的单词映射到(word,1)元组
val pairs = words.map(word => (word, 1))

// reduceByKey操作对pairs执行reduce操作获得(单词,词频)元组
val wordCounts = pairs.reduceByKey(_ + _)

写入文件系统

// 输出文件夹的前缀,Spark Streaming会自动使用当前时间戳来生成不同的文件夹名称
val outputFile = "/tmp/ss"

// 将结果输出
wordCounts.saveAsTextFiles(outputFile)

要注意的是,到这里上面这些语句其实都还没有被执行,只有StreamingContext对象调用了start()方法,上面这些语句才作为业务逻辑不停的被执行,而且,时间间隔是调用StreamingContext对象的时候指定好的,这里就是一秒。就是说,调用了start方法之后,9999端口上监听到的数据每隔一秒就会被sparkStreaming计算一次单词数,然后保存到对应的文件系统路径下。
每隔一秒还是挺快的,我就上了个厕所


就这样了,

Guess you like

Origin www.cnblogs.com/ltl0501/p/12232332.html