[Spark Streaming] (a) structure and working principle

I. Introduction

Spark Streaming Streaming is a high-throughput system, the fault-tolerant real-time data stream may be a variety of data sources (e.g., Kafka, Flume, Twitter, Zero and TCP sockets) similar Map, Reduce and Join like complex operation, and save the results to an external file system, database or application to real-time dashboards.

It is a framework is an extension Spark core API, can achieve high throughput, fault-tolerant real-time streaming data processing mechanism.

It supports multiple data sources to obtain data:

Here Insert Picture Description
Spark Streaming receiving real-time input data from various sources Kafka, Flume, HDFS etc., after processing, stored in the HDFS processing structure,

DataBase and other places.

Dashboards : graphical monitoring interface, Spark Streaming can be output on the front page of the monitoring.

Second, the stream processing architecture

Here Insert Picture Description

三、Micro-Batch Architecture

Spark processing of bulk data (offline data), Spark Streaming actually be treated not as a Strom as a processing data, external data stream but after docking segmentation by time, batch file one by one segmentation , and Spark processing logic is the same.

Spark Streaming received real-time streaming data, according to a certain time interval, splitting the data, to Spark Engine engine, batch results finally obtained.

Here Insert Picture Description

DSTREAM : Streaming provides a representation of the Spark continuous data stream, is referred to as a highly abstract DSTREAM discrete stream.

If the influx of external data, in accordance with one minute sections, inside every one-minute data is continuous (continuous data stream), and one minute and one minute it is independent sections (discrete stream).

  • DStream a specific data type Spark Streaming

  • Spark 2.3.1 Start 1 millisecond delay (about 100 milliseconds before)

  • Each micro-batch is an RDD – can share code between batch and streaming

Fourth, the working principle

4.1 Streaming Context

Streaming Context consumes a stream of data in Spark.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a local StreamingContext with two working threads and a batch interval of 2 seconds
# local[2] – spark executor runs on 2 cores.

sc = StreamingContext( SparkContext("local[2]", "NetWordCount"), 2 )

# ….
sc.start()

Here Insert Picture Description

  • JVM can only activate a StreamingContext.
  • StreamingContext not be restarted after the stop, but can be re-created.

4.2 DStream

DStream by a series of successive discrete stream composition RDD , each RDD contains data determined within the time interval:

Here Insert Picture Description
Spark of RDD can be understood as spatial dimensions, Dstream the RDD understood to be on the spatial dimension has added a time dimension .

For example figure, data flow is divided into four slices cut into the internal processing logic is the same, but in different time scales.

# Create a local StreamingContext with two working threads and a batch interval of 2 seconds

sc = StreamingContext( SparkContext("local[2]", "NetWordCount"), 2 )
lines = ssc.socketTextStream("localhost", 3333) # Create a DStream
words = lines.flatMap(lambda line: line.split(" ")) # Split each line into words
pairs = words.map(lambda word: (word, 1)) # Count each word in each batch
wordCounts = pairs.reduceByKey(lambda x, y: x + y)

ssc.start() # Start the computation
ssc.awaitTermination() # Wait for the computation to terminate

Spark and Spark Streaming difference:

Spark -> eet: transformation action + eet DAG

Streaming the Spark -> DSTREAM: Transformation Output (it does not allow activation of the intermediate data, input data must ensure that the output) + DStreamGraph

Any of the operations are converted into DStream underlying the operation of RDD (through the operator):

Here Insert Picture Description
Summary: The continuous data persistence, discrete and batch processing.

  • Persistence: the received data is temporarily stored.

Why persistence: do fault tolerance, when the data out of the wrong, because there is no calculation, the data need to be back at the source, temporary storage of the data can be restored.

  • Discretization: time slice, the processing unit is formed.

Fragment processing: batch process.

4.3 Input DStreams & Receivers

  • Input DStreams represent the stream of input data received from streaming sources.
  • Each Input DStream (except for the file stream) receives side object associated with the Receiver, and receiving object data received from the processing source is stored in the memory Spark.
  • You can create multiple Input DStreams at the same StreamingContext

Here Insert Picture Description

Five, DStream operation

1.1 normal switching operation : map, flatMap, flter, union , count, join , etc.

1.2 transform (func) Operation: On DStream allows any application function RDD-to-RDD

1.3 updateStateByKey operation:

1.4 switching operation window: allows you to convert, such as countByWindow, reduceByKeyAndWindow other by a sliding window of data, (batch intervals, and sliding window interval spacing)

2. Output Operation : DStream allows data is output to the external system, such as a database or file system, with a print (), foreachRDD (func) , saveAsTextFiles (), saveAsHadoopFiles () , etc.

3. Persistence : The persist () method of the data stream stored in memory, enabling the efficient iterative calculation

Six, Spark Streaming Architecture

Here Insert Picture Description
Master : Record-dependent relationship or kinship between Dstream, and is responsible for task scheduling to generate a new RDD

The Worker : receiving data from the network, store and perform calculation RDD

Client : responsible for data poured in to Spark Streaming

Scheduling: triggered by time.

Master : maintains DStream Graph this picture. (Not the node level, are task-level)

Worker : to perform in accordance with the diagram.

Worker which has a major role: receiver, receiving external data stream, the data stream is then passed through the entire inside Spark Streaming receiver (receiver packaged into the final data stream format can handle Spark Streaming)

Receiver : The receiver receives different data sources, targeted acquisition, Spark Streaming also provides a process different receivers distributed on different nodes, each receiver is a specific, each node receives as part of the input. , Receiver do not accept the complete calculation immediately, before it is stored in an internal buffer. Streaming is because according to the time constant slicing, it is necessary to wait, once the timer expires, the buffer will converting data into data blocks Block (action buffer: user-defined intervals cutting), then the data block into a queue to go inside, then block manager data block from the queue out to convert the data blocks into data blocks of a spark can handle.

Why is a process?

container -> Executor it is a process

Spark Streaming job submission:

• Network Input Tracker: tracking each network received data, and maps it to the appropriate input Dstream

• Job Scheduler: periodic visit DStream Graph and generates Spark Job, Job Manager to execute it

• Job Manager: Gets the job queue, and perform tasks Spark

Spark Streaming window operation:

• Spark provides a set of window operation, statistical analysis of large-scale incremental update data through a sliding window technique

• Window Operation: Timing for data processing certain period of time

Here Insert Picture Description

Any need to specify the operation window based two parameters :

  • The total length of the window (window length): You want to calculate how long the data

  • Sliding time interval (slide interval): How long will you go to update every time

Seven, do what Spark Streaming

For now SparkStreaming supports the following three main business scenarios

1, stateless operation: only concerned with real-time data in the current batch, for example:

  • Opportunities headings, classification http request end -> kafka -> Spark Streaming -> http request end Map -> response results
  • Nginx access logs network library collection, flume-> kafka -> Spark Streaming -> hive / hdfs
  • Data synchronization, the master station data through the network library "Master" -> kafka-> Spark Streaming -> hive / hdfs

2, stateful operation: When DStream stateful operation, the data before the need to rely on small batches of data in addition to the current new generation, but also need to use all of the historical data previously generated. Newly generated data with historical data merged into a water table of the full amount of data such as:

  • Real-time statistics net total library visits each site
  • Total views real-time statistics for each network library commodities, trading volume, turnover

3, the operation window: the timing of the data within a specified period of time DStream range of operation, for example:

  • Malicious access network library master, reptiles, statistics every 10 minutes within 30 minutes of most visited users
Published 345 original articles · won praise 456 · views 270 000 +

Guess you like

Origin blog.csdn.net/BeiisBei/article/details/105053482