Spark DStream related operations

And RDD Similarly, DSTREAM also provides a method of their series of operations, these operations can be divided into three categories: normal switching operation, switching operation and the output operation window.

Normal conversion operation

Common conversion operation as shown in Table 1

Table 1 Common conversion operation
His description
map(func) Each source element DStream a new DStream returned by the function func.
flatMap(func) Map similar operation except that each of the input elements may be mapped to zero or more output elements
filter(func) Func function selected on the source DSTREAM return value is only true of the element, eventually returns a new DStream.
distribution (numPartitions) DStream partition size is changed by the value of the parameter inputted numPartitions
union(otherStream) New DStream return element contains the source DStream combined with other DStream after
count() RDD on the number of elements contained in an internal source DStream RDD counts, an internal return contains only one element DStream
reduce(func) New DStream using the function func (two parameters and returns a result) for each element in the source DStream RDD polymerization operation is performed, a return RDD contains only one internal element
countByValue() Calculation DSTREAM RDD elements appearing within each frequency and returns the new DStream (<K, Long>), where, K is the RDD type elements, Long is the frequency of occurrence of elements
reduceByKey(func,[numTasks]) When a type is <K, V> DStream time value pairs is called, the return type for the new key-value pairs DStream, wherein each key value V are aggregated using a polymerization function func. You may be provided by a different number of concurrent tasks configuration numTasks
join(otherStream,[numTasks]) When called type are <K, V> and <K, W> key on two DStream, a return type <K, <V, W >> DSTREAM new key-value pairs
cogroup(otherStream,[numTasks]) When two DStream invoked each containing <K, V> and <K, W> key-value pairs, returns a <K, Seq [V], seq [W]> new type DStream
transform(func) Through the application of each RDD RDD-to-KDD source DSTREAM function returns a new DSTREAM, which can be used for any operation DSTREAM in RDD
updateStateByKey(func) It returns a new state DStream, wherein the new state of each key is based on a previous state and its new value is derived by calculating the function func. This method may be used to maintain any status data of each key

In operation listed in Table 1, transform (func) method and updateStateByKey (fhnc) method is worth further in-depth look into.

1. transform (func) Method

The transform method and similar transformWith (func) method allows an application to any RDD-to-RDD functions on DStream, they can be applied to any operation is not exposed in RDD DStream API in the.

For example, each data stream with another batch of data sets can not directly exposed to the Connectivity DStream API but can easily use transform (func) method to do this, which makes it very powerful DSTREAM.

For example, by connecting the input data stream spam messages calculated beforehand, the real-time data to do the clean-up filter. In fact, machine learning algorithms may be used in computing and graphics transform (func) method.

2. updateStateByKey(func) 方法

updateStateByKey (func) The method of any state can be maintained, while allowing continued updating new information. To use this feature, you must perform the following two steps.

1) the definition states: state may be any type of data.

2) define the update function: Specifies how to use the function with a previous state and the state of the updated new values acquired from the input stream.

An example to illustrate, a text data stream is assumed to be the word count. Here, the count is running state and it is an integer. Update function is defined as follows.

updateFunction DEF (newValues: SEQ [Int], runningCount: Option [Int]);
Option-[Int] = {
Val newCount = ... // runningCount to add a new value to the preamble, acquiring new COUNT
s Some (newCount)
}

This function is applied DStream comprising value pairs (e.g., word count in the previous examples, DStream containing <word, 1> key-value pairs). It calls for each element inside (such as WordCount in Word) update function, which, newValues ​​is the latest value, runningCount before value.

val runningCounts = pairs.updateStateByKey[Int](updateFunction._)

Window conversion operations

Spark Streaming also provides a calculation window, which allows the data converted by the sliding window, the window switching operation shown in Table 2

Table 2 window switching operation
Change description
window(windowLength,slideInterval) Returns a new DStream batch obtained window is calculated based on the source DStream
countByWindow(windowLength,slideInterval) Returns the number of elements based on a sliding window DStream of
reduceByWindow(func,windowLength,slideInterval) Sliding window in the source DStream elements based polymerization operation to obtain a new DStream
reduceByKeyAndWindow(func,windowLength,slideInterval,[numTasks]) Sliding window based on the key-value type DStream the function func polymerization using a polymerization proceed to K <K, V>, get a new DStream
reduceByKeyAndWindow(func,invFunc,windowLength,slideInterval,[numTasks]) To achieve a more efficient version of the first data within the sliding window new increment intervals polymerization, and then removing the first statistics data in the same time interval.
For example, calculation of t + 4 seconds during the last 5 seconds WordCount this time window, the time may be. 3 t + 5 seconds past statistics plus [t + 3, t + 4 ] statistic minus [T- 2, t-1] of the statistic, this method may be multiplexed intermediate 3 seconds statistics, statistical efficiency
countByValueAndWindow(windowLength,slideInterval,[numTasks]) Based on the frequency of occurrence of each element within the sliding window is calculated for each source DSTREAM RDD, and returns DStream [<K, Long>], where, K is the RDD type elements, Long is the frequency of the element. Reduce the number of tasks can be configured by an optional parameter

In Spark Streaming, data processing is carried out in batches, and data collection is carried out one by one, so the first batch interval set in Spark Streaming will, when the time interval over a batch of the collected data will be aggregated up become a batch of data to the system to deal with.

对于窗口操作而言,在其窗口内部会有 N 个批处理数据,批处理数据的大小由窗口间隔决定,而窗口间隔指的就是窗口的持续时间。

在窗口操作中,只有窗口的长度满足了才会触发批数据的处理。除了窗口的长度,窗口操作还有另一个重要的参数,即滑动间隔,它指的是经过多长时间窗口滑动一次形成新的窗口。滑动间隔默认情况下和批次间隔相同,而窗口间隔一般设置得要比它们两个大。在这里必须注意的一点是,滑动间隔和窗口间隔的大小一定得设置为批处理间隔的整数倍。

如图 1 所示,批处理间隔是 1 个时间单位,窗口间隔是 3 个时间单位,滑动间隔是 2 个时间单位。对于初始的窗口(time 1~time 3),只有窗口间隔满足了才会触发数据的处理。

这里需要注意,有可能初始的窗口没有被流入的数据撑满,但是随着时间的推进/窗口最终会被撑满。每过 2 个时间单位,窗口滑动一次,这时会有新的数据流入窗口,窗口则移去最早的 2 个时间单位的数据,而与最新的 2 个时间单位的数据进行汇总形成新的窗口(time 3~ time 5)。

DStream的批处理间隔示意
图 1  DStream的批处理间隔示意

对于窗口操作,批处理间隔、窗口间隔和滑动间隔是非常重要的 3 个时间概念,是理解窗口操作的关键所在。

输出操作

Spark Streaming 允许 DStream 的数据被输出到外部系统,如数据库或文件系统。输出操作实际上使 transformation 操作后的数据可以被外部系统使用,同时输出操作触发所有 DStream 的 transformation 操作的实际执行(类似于 RDD 操作)。表 3 列出了目前主要的输出操作。

转换 描述
print() 在 Driver 中打印出 DStream 中数据的前 10 个元素
saveAsTextFiles(prefix,[suffix]) 将 DStream 中的内容以文本的形式保存为文本文件,其中,每次批处理间隔内产生的文件以 prefix-TIME_IN_MS[.suffix] 的方式命名
saveAsObjectFiles(prefix,[suffix]) 将 DStream 中的内容按对象序列化,并且以 SequenceFile 的格式保存,其中,每次批处理间隔内产生的文件以 prefix—TIME_IN_MS[.suffix]的方式命名
saveAsHadoopFiles(prefix,[suffix]) 将 DStream 中的内容以文本的形式保存为 Hadoop 文件,其中,每次批处理间隔内产生的文件以 prefix-TIME_IN_MS[.suffix] 的方式命名
foreachRDD(func) 最基本的输出操作,将 func 函数应用于 DStream 中的 RDD 上,这个操作会输出数据到外部系统,例如,保存 RDD 到文件或者网络数据库等。需要注意的是,func 函数是在该 Streaming 应用的 Driver 进程里执行的

dstream.foreachRDD 是一个非常强大的输出操作,它允许将数据输出到外部系统。但是,如何正确高效地使用这个操作是很重要的,下面来讲解如何避免一些常见的错误。

通常情况下,将数据写入到外部系统需要创建一个连接对象(如 TCP 连接到远程服务器),并用它来发送数据到远程系统。出于这个目的,开发者可能在不经意间在 Spark Driver 端创建了连接对象,并尝试使用它保存 RDD 中的记录到 Spark Worker 上,代码如下。

dstream.foreachRDD { rdd =>
val connection = createNewConnection() //在 Driver 上执行
rdd.foreach { record =>
connection.send(record) // 在 Worker 上执行
}
}

这是不正确的,这需要连接对象进行序列化并从 Driver 端发送到 Worker 上。连接对象很少在不同机器间进行这种操作,此错误可能表现为序列化错误(连接对不可序列化)、初始化错误(连接对象需要在 Worker 上进行初始化)等,正确的解决办法是在 Worker 上创建连接对象。

通常情况下,创建一个连接对象有时间和资源开销。因此,创建和销毁的每条记录的连接对象都可能会导致不必要的资源开销,并显著降低系统整体的吞吐量。

一个比较好的解决方案是使用 rdd.foreachPartition 方法创建一个单独的连接对象,然后将该连接对象输出的所有 RDD 分区中的数据使用到外部系统。

还可以进一步通过在多个 RDDs/batch 上重用连接对象进行优化。一个保持连接对象的静态池可以重用在多个批处理的 RDD 上,从而进一步降低了开销。

需要注意的是,在静态池中的连接应该按需延迟创建,这样可以更有效地把数据发送到外部系统。另外需要要注意的是,DStream 是延迟执行的,就像 RDD 的操作是由 Actions 触发一样。默认情况下,输出操作会按照它们在 Streaming 应用程序中定义的顺序逐个执行。

持久化

与 RDD 一样,DStream 同样也能通过 persist() 方法将数据流存放在内存中,默认的持久化方式是 MEMORY_ONLY_SER,也就是在内存中存放数据的同时序列化数据的方式,这样做的好处是,遇到需要多次迭代计算的程序时,速度优势十分的明显。

而对于一些基于窗口的操作,如 reduceByWindow、reduceByKeyAndWindow,以及基于状态的操作,如 updateStateBykey, 其默认的持久化策略就是保存在内存中。

对于来自网络的数据源(Kafka、Flume、Sockets 等),默认的持久化策略是将数据保存在两台机器上,这也是为了容错性而设计的。

推荐学习目录:40.Spark RDD
41.Spark总体架构和运行流程
42.Spark生态圈
43.Spark开发实例
44.Spark Streaming简介
45.Spark Streaming系统架构
46.Spark Streaming编程模型
47.Spark DStream相关操作
48.Spark Streaming开发实例

Guess you like

Origin blog.csdn.net/dsdaasaaa/article/details/94182127