1.应用
从这张图中,可以知道整个都是一个循环啊,数据源循环发送数据给Spark,Spark循环处理数据,然后循环存储,最后页面循环取出数据并且展示。
2.原理
3.例子
这里以socketTextStream来讲解
- 我们创建一个流很简单
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
object reduceByKeyAndWindowDemo {
def main(args: Array[String]): Unit = {
val conf=new SparkConf().setMaster("local[2]").setAppName("TransformaDemo")
val ssc=new StreamingContext(conf,Seconds(2));
val fileDS=ssc.socketTextStream("192.168.32.110", 9999);
val wordcountDS=fileDS.flatMap { line => line.split("\t") }
.map { word => (word,1) } //(_+_)
.reduceByKeyAndWindow((x:Int,y:Int) =>{x+ y},Seconds(6),Seconds(4))
wordcountDS.print();
ssc.start();
ssc.awaitTermination();
/*
* hadoop hadoop 不在统计范围之内
* hadoop hadoop
* hadoop hadoop hadoop,6
* hadoop hadoop hadoop,6
* hadoop hadoop
* hadoop hadoop
*
* */
}
}
这就是创建了一个流处理,但是哪里是接收数据的呢?这个可以看看
Spark Streaming自定义接收器
SparkStream中已经内置了多种receive(注意这里学习可以看看,但是receiver可能会造成数据的不一致,建议慎用,或者不要用,可以用kafka 等)
2.1 首先看看receiver
自己去看抽象类 org.apache.spark.streaming.receiver.Receiver
这个抽象类中,主要定义了接收数据的
1. 启动方法 onStart
2. 结束方法 onStop
3. 存储数据的方法 store
4. 重启方法 restart
5. 以及很重要的添加网络receiver到这个类中 attachSupervisor
2.2 类图
解释:
BlockGenerator:
类主要是将数据放到缓冲区ArrayBuffer中,然后根据设定的时间,生成块
receiver:
一个可以在工作节点上运行的接收器的抽象类,以接收外部数据,负责接收,存储数据
ReceiverSupervisor :
Receiver的主管,负责调度,开启,停止,推送数据到receiver中,相当于管理BlockGenerator与receiver的协调管理类
2.3 序列图
该图我觉得还有未完善的地方
2.4 然后跟踪代码
* 从TCP源 hostname:port 创建输入流。使用TCP套接字接收数据,接收字节被解释为UTF8编码的以“\n”分隔的行数据。
* @param hostname :连接到接收数据的主机名。
* @param port:用于连接接收数据的端口。
* @param storageLevel:用于存储接收的存储级别。
* 对象(默认值:StorageLevel.MEMORY_AND_DISK_SER_2)
*/
def socketTextStream(
hostname: String,
port: Int,
storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2
): ReceiverInputDStream[String] = withNamedScope("socket text stream") {
socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)
}
这一点调用了SocketReceiver.bytesToLines
,该方法主要是 将数据从inputstream(例如,从套接字)转换为’\n’分隔字符串,并返回一个迭代器来访问字符串。
然后可以看到调用了socketStream方法
def socketStream[T: ClassTag](
hostname: String,
port: Int,
converter: (InputStream) => Iterator[T],
storageLevel: StorageLevel
): ReceiverInputDStream[T] = {
new SocketInputDStream[T](this, hostname, port, converter, storageLevel)
}
socketStream方法中new一个SocketInputDStream,而SocketInputDStream继承自ReceiverInputDStream,ReceiverInputDStream又继承自InputDStream(这是所有输入流的抽象基类。)最终继承自DStream
DStream
一个离散化的流(DStream),Spark流中的基本抽象,是一个连续的RDDs(相同类型)的序列, 代表连续的数据流
InputDStream
这是所有输入流的抽象基类。这仍然是一个抽象类,并没有实现start() stop()
方法
ReceiverInputDStream
ReceiverInputDStream任然是一个抽象类,主要有三个方法
- getReceiver()
- def start() {} 尉氏县
- stop() {}
- 根据流的receiver收到的数据块生成RDD的方法compute
SocketInputDStream
SocketInputDStream方法是一个类,最终定义了getReceiver方法,并且方法里创建SocketReceiver
SocketReceiver
SocketReceiver继承了Receiver,该类onStart方法中创建了Socket,并且启动了一个线程,启动线程调用了receive() 方法,接收数据,并且存储数据。
private[streaming]
class SocketInputDStream[T: ClassTag](
_ssc: StreamingContext,
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends ReceiverInputDStream[T](_ssc) {
def getReceiver(): Receiver[T] = {
new SocketReceiver(host, port, bytesToObjects, storageLevel)
}
}
private[streaming]
class SocketReceiver[T: ClassTag](
host: String,
port: Int,
bytesToObjects: InputStream => Iterator[T],
storageLevel: StorageLevel
) extends Receiver[T](storageLevel) with Logging {
private var socket: Socket = _
def onStart() {
logInfo(s"Connecting to $host:$port")
try {
socket = new Socket(host, port)
} catch {
case e: ConnectException =>
restart(s"Error connecting to $host:$port", e)
return
}
logInfo(s"Connected to $host:$port")
// Start the thread that receives data over a connection
new Thread("Socket Receiver") {
setDaemon(true)
override def run() { receive() }
}.start()
}
def onStop() {
// in case restart thread close it twice
synchronized {
if (socket != null) {
socket.close()
socket = null
logInfo(s"Closed socket to $host:$port")
}
}
}
/** Create a socket connection and receive data until receiver is stopped
* 创建套接字连接并接收数据,直到接收端停止
* */
def receive() {
try {
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
/**
* 真正存储数据
*/
store(iterator.next())
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
onStop()
}
}
}
然后看看存储数据store(iterator.next())
/**
* Store a single item of received data to Spark's memory.
* These single items will be aggregated together into data blocks before
* being pushed into Spark's memory.
*
* 将接收到的数据存储到Spark的内存中。这些单独的项目将被聚合成数据块,然后被推送到Spark的内存中。
*/
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
这里的supervisor点击到
/** Get the attached supervisor. */
private[streaming] def supervisor: ReceiverSupervisor = {
assert(_supervisor != null,
"A ReceiverSupervisor has not been attached to the receiver yet. Maybe you are starting " +
"some computation in the receiver before the Receiver.onStart() has been called.")
_supervisor
}
而 supervisor.pushSingle(dataItem)
点进去是抽象类 ReceiverSupervisor
/** Push a single data item to backend data store.
* 将一个数据项推到后端数据存储。
* */
def pushSingle(data: Any): Unit
所以应该看其实现类ReceiverSupervisorImpl
/** Push a single record of received data into block generator.
* 将收到数据的单一记录按入块生成器。
* */
def pushSingle(data: Any) {
defaultBlockGenerator.addData(data)
}
这里的defaultBlockGenerator是
private val defaultBlockGenerator = createBlockGenerator(defaultBlockGeneratorListener)
createBlockGenerator方法如下
/** Divides received data records into data blocks for pushing in BlockManager.
* 把接收的数据记录到数据块在BlockManager推动。
* */
private val defaultBlockGeneratorListener = new BlockGeneratorListener {
def onAddData(data: Any, metadata: Any): Unit = { }
def onGenerateBlock(blockId: StreamBlockId): Unit = { }
def onError(message: String, throwable: Throwable) {
reportError(message, throwable)
}
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
}
最终defaultBlockGenerator.addData(data)
将数据存储到ArrayBuffer中了
/**
* Push a single data item into the buffer.
* 将一个数据项推入缓冲区。
*/
def addData(data: Any): Unit = {
if (state == Active) {
waitToPush()
synchronized {
if (state == Active) {
currentBuffer += data // 添加到ArrayBuffer中
} else {
throw new SparkException(
"Cannot add data as BlockGenerator has not been started or has been stopped")
}
}
} else {
throw new SparkException(
"Cannot add data as BlockGenerator has not been started or has been stopped")
}
}