【Spark八十四】Spark Streaming中DStream和RDD之间的关系

问题：在一个时间间隔中，Spark Streaming接收到的数据会生成几个RDD？

测试发现，在一个batchInterval中，会产生一个RDD，但是这个结论只是看到的现象。

如果在给定的batchInterval中，数据量非常大，Spark Streaming会产生多少个RDD，目前还不确定，只能通过看源代码才能确定了。

答案很确定，一个batchInterval产生且仅仅产生一个RDD。如果在这个时间间隔内，那么DStreaming仍然会对应1个RDD，不过这个RDD没有元素，即调用RDD.isEmpty为true。

当一个新的时间窗口（batchInterval)开始时，此时产生一个空的block，此后在这个窗口内接受到的数据都会累加到这个block上，当这个时间窗口结束时，停止累加，这个block对应的数据就是

这个时间窗口对应的RDD包含的数据

如下代码在控制台每隔五秒中输出一次，五秒钟就是Spark Stream进行处理数据接受的时间间隔

package spark.examples.streaming

import java.util.concurrent.atomic.AtomicInteger

import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object SparkStreamingForPartition {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("NetCatWordCount")
    conf.setMaster("local[3]")
    val ssc = new StreamingContext(conf, Seconds(5))
    val dstream = ssc.socketTextStream("192.168.26.140", 9999)
    val count = new AtomicInteger(0)
    dstream.foreachRDD(rdd => {
      //      rdd.foreach(record => println("This is the content: " + record + "," + rdd.hashCode()))
      //      count.incrementAndGet()
      println(count.incrementAndGet() + "," + rdd.hashCode() + ", " + System.currentTimeMillis() + ", logFlag")
    })
    ssc.start()
    ssc.awaitTermination()
  }
}

下图截取自Spark官方文档：

其中有句话是，

 Each RDD in a DStream contains data from a certain interval

这里的certain interval就应该指的是生成的RDD包含的数据接收时间，上例的五秒钟。

这就导致需要思考一个问题，如果时间间隔比较大或者收取数据的速度非常快，那么必然导致在这个时间间隔里面，获取到的数据量非常大，那么会不会对内存和GC产生很大的压力？

Memory Tuning

Tuning the memory usage and GC behavior of Spark applications have been discussed in great detail in the Tuning Guide. It is strongly recommended that you read that. In this section, we discuss a few tuning parameters specifically in the context of Spark Streaming applications.

The amount of cluster memory required by a Spark Streaming application depends heavily on the type of transformations used. For example, if you want to use a window operation on last 10 minutes of data, then your cluster should have sufficient memory to hold 10 minutes of worth of data in memory. Or if you want to use updateStateByKey with a large number of keys, then the necessary memory will be high. On the contrary, if you want to do a simple map-filter-store operation, then necessary memory will be low.

In general, since the data received through receivers are stored with StorageLevel.MEMORY_AND_DISK_SER_2, the data that does not fit in memory will spill over to the disk. This may reduce the performance of the streaming application, and hence it is advised to provide sufficient memory as required by your streaming application. Its best to try and see the memory usage on a small scale and estimate accordingly.

上面说到，默认情况下，接收到的数据尽量保存到内存，但是由于接受到的数据的持久化级别是内存+磁盘，因此在内存不够的情况下，数据会持久化到磁盘，但是当数据持久化到磁盘时，会带来严重的性能降级，因此对于Spark Streaming应用来说，应该分配足够大的内存，尤其是对于window操作以及有多个stateful key/value情形下的updateStateByKey操作

Another aspect of memory tuning is garbage collection. For a streaming application that require low latency, it is undesirable to have large pauses caused by JVM Garbage Collection.

There are a few parameters that can help you tune the memory usage and GC overheads.

Persistence Level of DStreams: As mentioned earlier in the Data Serialization section, the input data and RDDs are by default persisted as serialized bytes. This reduces both, the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo serialization further reduces serialized sizes and memory usage. Further reduction in memory usage can be achieved with compression (see the Spark configuration spark.rdd.compress), at the cost of CPU time.
Clearing old data: By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transformations that are used. For example, if you are using window operation of 10 minutes, then Spark Streaming will keep around last 10 minutes of data, and actively throw away older data. Data can be retained for longer duration (e.g. interactively querying older data) by setting streamingContext.remember.
CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce the overall processing throughput of the system, its use is still recommended to achieve more consistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).
Other tips: To further reduce GC overheads, here are some more tips to try.
- Use Tachyon for off-heap storage of persisted RDDs. See more detail in the Spark Programming Guide.
- Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.

很可惜上面没有提到，一个时间间隔的数据是不是都放在一个RDD中（应该是的），只是说尽量放到内存(因为存储级别是 StorageLevel.MEMORY_AND_DISK_SER_2,所以内存不够后会spill到磁盘上，但是会带来较大的性能degrade)

【Spark八十四】Spark Streaming中DStream和RDD之间的关系

Memory Tuning

猜你喜欢