SparkStreaming operating principle

Spark Streaming applications are Spark application, Spark Streaming generated DStream ultimately converted into RDD, then calculation of RDD, so the final calculation Spark Streaming is calculated RDD, then Spark Streaming principle of course also contains a general principle Spark application . Spark Streaming as a real-time computing technology, real-time computing and other technologies (such as Storm) are not the same, we can be understood as Spark Streaming real-time calculation micro-batch mode, which means that Spark Streaming is essentially a batch, is this batch time interval between the processing is very small, this interval is 500ms minimum, substantially 80% of the enterprise can be suitably calculated in real time scenarios.
In the real-time calculation step, Spark Streaming real-time course comprising receiving data, transformation of data and the process result data output process of three basic processes. Spark Streaming data comprises receiving portion based on the principle of mode and Direct mode Receiver (Kafka Direct), the following detailed explanation Spark Streaming Mode Receiver-based applications.
When Executor we use spark-submit submit a Spark Streaming application when the cluster apply to resources and initialization needed, execution Spark Streaming application consists of two parts: one is the initialization StreamingContext, one is Spark Streaming applications Receiver real time real-time calculation of the received data. The following were introduced
StreamingContext initialization:
StreamingContext initialization time, and initializes DStreamGraph JobScheduler two modules, which contain InputDStream DStreamGraph and OutputDStream two DStream, InputDStream Receiver contains information, output information containing OutputDStream end result is a two between DSTREAM series of business Transformations. JobScheduler contains JobGenerator and ReceiverTracker, JobGenerator has a timer for timing a timing triggered and generates the batch task, ReceiverTracker Receiver for the received tracking data, when ReceiverTracker Receiver initialization will get from the DStreamGraph InputDStream, then start the Receiver on a Executor, so far StreamingContext completion of the initialization
Spark Streaming real-time computing applications Receiver data received in real time
Receiver will receive real-time data stored in the Executor's memory by BlockManager management, after storing complete data will tell ReceiverTracker location data blocks are stored for easy ReceiverTracker tracking and positioning; when we set batch interval time to time, JobGenerator on ReceiverTracker will tell all the positioning data collected batch interval, and generates a timed task, the scheduled task is generated based on all of the data blocks a ReceiverTracker positioned to BlockRDD (RDD this is the first to be executed in the chain) and generates a series of business Transformations between InputDStream and OutputDStream two DStream RDD chain, the last generation RDD DAG, perform computing tasks RDD's submission, this time came principle Spark RDD tasks submitted, can refer to Spark Core content of

 

RDD Spark Streaming applications batch interval in each of the received data consisting of the number of partitions.
BlockRDD the number of partitions (or parallelism) = batch interval / block interval
Where the batch interval is initialized when we StreamingContext time interval specified batch
block interval data generating means receives a data block time interval, this time interval can be configured by spark.streaming.blockInterval, default is 200ms, the minimum value may be set to 50ms

 

Guess you like

Origin www.cnblogs.com/tesla-turing/p/11959342.html