Introduction to SparkStreaming

Insert picture description here
Before discussing SparkSreaming, we must first understand what batch processing is and what is stream processing.

  • Batch processing: also called offline processing T+1 mode, mainly Hive + Spark
  • Stream processing: Water flow is also called real-time processing, SparkStreaming (second level, micro batch processing) + StructeStreaming (millisecond level) + Flink (millisecond level)
  • Streaming trial calculation: Continuous processing, aggregation and analysis of unbounded data .

Introduction to SparkStreaming:

  • Similar to Apache Storm, used for streaming data processing. It has the characteristics of high throughput and strong fault tolerance .

  • Support multiple data sources , such as: Kafka, Flume, Twitter, ZeroMQ The result output can also be saved in many places, such as: HDFS MySQL

  • Spark Streaming can also be perfectly integrated with MLlib (machine learning) and Graphx.
    Insert picture description here

  • In the current version of SparkStreaming, the minimum Batch Size is selected between 0.5 and 5 seconds,
    which can satisfy streaming quasi-real-time calculations . It is not suitable for scenarios with very high real-time requirements such as high-frequency real-time trading.

  • Spark Streaming decomposes streaming computing into multiple Spark Jobs , and the processing of each piece of data will go through the Spark DAG graph decomposition and the scheduling process of Spark's task set .

  • An important framework in the Spark ecosystem, built on SparkCore

Insert picture description here

  • Divide the streaming data into many parts according to the time interval BatchInterval , each part of Batch (batch), for each batch of data Batch as RDD for rapid analysis and processing

Streaming calculation mode

Stream processing tasks are a very important branch of big data processing. Different stream processing frameworks have different calculation methods, which are mainly divided into two types:

  • Native stream processing (Native): Take Flink and Storm as examples, all input records will be processed one by one
    Insert picture description here
  • Micro batch processing (Batch): Spark Streaming and Structured Streaming as an example: divide the input data into multiple micro batch data at a certain time interval T, and then process each batch data
    Insert picture description here

Guess you like

Origin blog.csdn.net/m0_49834705/article/details/112861218