SparkStreaming的应用及其原理

1.应用

这里写图片描述

从这张图中,可以知道整个都是一个循环啊,数据源循环发送数据给Spark,Spark循环处理数据,然后循环存储,最后页面循环取出数据并且展示。

2.原理

这里写图片描述

这里写图片描述

3.例子

这里以socketTextStream来讲解

  1. 我们创建一个流很简单

import org.apache.spark.streaming.StreamingContext
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds

object reduceByKeyAndWindowDemo {
  def main(args: Array[String]): Unit = {
    val conf=new SparkConf().setMaster("local[2]").setAppName("TransformaDemo")
    val  ssc=new StreamingContext(conf,Seconds(2));
    val fileDS=ssc.socketTextStream("192.168.32.110", 9999);
    val wordcountDS=fileDS.flatMap { line => line.split("\t") }
      .map { word => (word,1) }  //(_+_)
      .reduceByKeyAndWindow((x:Int,y:Int) =>{x+ y},Seconds(6),Seconds(4))

    wordcountDS.print();
    ssc.start();
    ssc.awaitTermination();
    /*
     * hadoop   hadoop 不在统计范围之内
     * hadoop   hadoop
     * hadoop   hadoop  hadoop,6
     * hadoop   hadoop  hadoop,6
     * hadoop   hadoop
     * hadoop   hadoop
     *
     * */
  }
}

这就是创建了一个流处理,但是哪里是接收数据的呢?这个可以看看
Spark Streaming自定义接收器

SparkStream中已经内置了多种receive(注意这里学习可以看看,但是receiver可能会造成数据的不一致,建议慎用,或者不要用,可以用kafka 等)

2.1 首先看看receiver

自己去看抽象类 org.apache.spark.streaming.receiver.Receiver
这个抽象类中,主要定义了接收数据的
1. 启动方法 onStart
2. 结束方法 onStop
3. 存储数据的方法 store
4. 重启方法 restart
5. 以及很重要的添加网络receiver到这个类中 attachSupervisor

2.2 类图

这里写图片描述

解释:

BlockGenerator
类主要是将数据放到缓冲区ArrayBuffer中,然后根据设定的时间,生成块

扫描二维码关注公众号,回复: 1492387 查看本文章

receiver:
一个可以在工作节点上运行的接收器的抽象类,以接收外部数据,负责接收,存储数据

ReceiverSupervisor :
Receiver的主管,负责调度,开启,停止,推送数据到receiver中,相当于管理BlockGenerator与receiver的协调管理类

2.3 序列图

该图我觉得还有未完善的地方

这里写图片描述

2.4 然后跟踪代码

 *  从TCP源 hostname:port 创建输入流。使用TCP套接字接收数据,接收字节被解释为UTF8编码的以“\n”分隔的行数据。
    *  @param hostname :连接到接收数据的主机名。
    *  @param port:用于连接接收数据的端口。
    *  @param storageLevel:用于存储接收的存储级别。
    *  对象(默认值:StorageLevel.MEMORY_AND_DISK_SER_2)
   */
  def socketTextStream(
      hostname: String,
      port: Int,
      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
  }

这一点调用了SocketReceiver.bytesToLines,该方法主要是 将数据从inputstream(例如,从套接字)转换为’\n’分隔字符串,并返回一个迭代器来访问字符串。

然后可以看到调用了socketStream方法

def socketStream[T: ClassTag](
      hostname: String,
      port: Int,
      converter: (InputStream) => Iterator[T],
      storageLevel: StorageLevel
    ): ReceiverInputDStream[T] = {
    new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
  }

socketStream方法中new一个SocketInputDStream,而SocketInputDStream继承自ReceiverInputDStream,ReceiverInputDStream又继承自InputDStream(这是所有输入流的抽象基类。)最终继承自DStream

DStream

一个离散化的流(DStream),Spark流中的基本抽象,是一个连续的RDDs(相同类型)的序列, 代表连续的数据流

InputDStream

这是所有输入流的抽象基类。这仍然是一个抽象类,并没有实现start() stop()方法

ReceiverInputDStream

ReceiverInputDStream任然是一个抽象类,主要有三个方法

  1. getReceiver()
  2. def start() {} 尉氏县
  3. stop() {}
  4. 根据流的receiver收到的数据块生成RDD的方法compute

SocketInputDStream

SocketInputDStream方法是一个类,最终定义了getReceiver方法,并且方法里创建SocketReceiver

SocketReceiver

SocketReceiver继承了Receiver,该类onStart方法中创建了Socket,并且启动了一个线程,启动线程调用了receive() 方法,接收数据,并且存储数据。

private[streaming]
class SocketInputDStream[T: ClassTag](
    _ssc: StreamingContext,
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends ReceiverInputDStream[T](_ssc) {

  def getReceiver(): Receiver[T] = {
    new SocketReceiver(host, port, bytesToObjects, storageLevel)
  }
}
private[streaming]
class SocketReceiver[T: ClassTag](
    host: String,
    port: Int,
    bytesToObjects: InputStream => Iterator[T],
    storageLevel: StorageLevel
  ) extends Receiver[T](storageLevel) with Logging {

  private var socket: Socket = _

  def onStart() {

    logInfo(s"Connecting to $host:$port")
    try {
      socket = new Socket(host, port)
    } catch {
      case e: ConnectException =>
        restart(s"Error connecting to $host:$port", e)
        return
    }
    logInfo(s"Connected to $host:$port")

    // Start the thread that receives data over a connection
    new Thread("Socket Receiver") {
      setDaemon(true)
      override def run() { receive() }
    }.start()
  }

  def onStop() {
    // in case restart thread close it twice
    synchronized {
      if (socket != null) {
        socket.close()
        socket = null
        logInfo(s"Closed socket to $host:$port")
      }
    }
  }

  /** Create a socket connection and receive data until receiver is stopped
    * 创建套接字连接并接收数据,直到接收端停止
    * */
  def receive() {
    try {
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        /**
          * 真正存储数据
          */
        store(iterator.next())
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      onStop()
    }
  }
}

然后看看存储数据store(iterator.next())

 /**
   * Store a single item of received data to Spark's memory.
   * These single items will be aggregated together into data blocks before
   * being pushed into Spark's memory.
    *
    * 将接收到的数据存储到Spark的内存中。这些单独的项目将被聚合成数据块,然后被推送到Spark的内存中。
   */
  def store(dataItem: T) {
    supervisor.pushSingle(dataItem)
  }

这里的supervisor点击到


  /** Get the attached supervisor. */
  private[streaming] def supervisor: ReceiverSupervisor = {
    assert(_supervisor != null,
      "A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting " +
        "some computation in the receiver before the Receiver.onStart() has been called.")
    _supervisor
  }

supervisor.pushSingle(dataItem)点进去是抽象类 ReceiverSupervisor

 /** Push a single data item to backend data store.
    * 将一个数据项推到后端数据存储。
    * */
  def pushSingle(data: Any): Unit

所以应该看其实现类ReceiverSupervisorImpl

 /** Push a single record of received data into block generator.
    * 将收到数据的单一记录按入块生成器。
    * */
  def pushSingle(data: Any) {
    defaultBlockGenerator.addData(data)
  }

这里的defaultBlockGenerator是

private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)

createBlockGenerator方法如下

/** Divides received data records into data blocks for pushing in BlockManager.
    * 把接收的数据记录到数据块在BlockManager推动。
    * */
  private val defaultBlockGeneratorListener = new BlockGeneratorListener {
    def onAddData(data: Any, metadata: Any): Unit = { }

    def onGenerateBlock(blockId: StreamBlockId): Unit = { }

    def onError(message: String, throwable: Throwable) {
      reportError(message, throwable)
    }

    def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
      pushArrayBuffer(arrayBuffer, None, Some(blockId))
    }
  }

最终defaultBlockGenerator.addData(data)将数据存储到ArrayBuffer中了

/**
   * Push a single data item into the buffer.
    * 将一个数据项推入缓冲区。
   */
  def addData(data: Any): Unit = {
    if (state == Active) {
      waitToPush()
      synchronized {
        if (state == Active) {
          currentBuffer += data     // 添加到ArrayBuffer中
        } else {
          throw new SparkException(
            "Cannot add data as BlockGenerator has not been started or has been stopped")
        }
      }
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
  }

猜你喜欢

转载自blog.csdn.net/qq_21383435/article/details/80530349