Before discussing SparkSreaming, we must first understand what batch processing is and what is stream processing.
- Batch processing: also called offline processing T+1 mode, mainly Hive + Spark
- Stream processing: Water flow is also called real-time processing, SparkStreaming (second level, micro batch processing) + StructeStreaming (millisecond level) + Flink (millisecond level)
- Streaming trial calculation: Continuous processing, aggregation and analysis of unbounded data .
Introduction to SparkStreaming:
-
Similar to Apache Storm, used for streaming data processing. It has the characteristics of high throughput and strong fault tolerance .
-
Support multiple data sources , such as: Kafka, Flume, Twitter, ZeroMQ The result output can also be saved in many places, such as: HDFS MySQL
-
Spark Streaming can also be perfectly integrated with MLlib (machine learning) and Graphx.
-
In the current version of SparkStreaming, the minimum Batch Size is selected between 0.5 and 5 seconds,
which can satisfy streaming quasi-real-time calculations . It is not suitable for scenarios with very high real-time requirements such as high-frequency real-time trading. -
Spark Streaming decomposes streaming computing into multiple Spark Jobs , and the processing of each piece of data will go through the Spark DAG graph decomposition and the scheduling process of Spark's task set .
-
An important framework in the Spark ecosystem, built on SparkCore
- Divide the streaming data into many parts according to the time interval BatchInterval , each part of Batch (batch), for each batch of data Batch as RDD for rapid analysis and processing
Streaming calculation mode
Stream processing tasks are a very important branch of big data processing. Different stream processing frameworks have different calculation methods, which are mainly divided into two types:
- Native stream processing (Native): Take Flink and Storm as examples, all input records will be processed one by one
- Micro batch processing (Batch): Spark Streaming and Structured Streaming as an example: divide the input data into multiple micro batch data at a certain time interval T, and then process each batch data