Spark Streaming Principles and Practice

Author: Zen and the Art of Computer Programming

1 Introduction

Apache Spark is a distributed computing framework based on memory computing open sourced by the Apache Foundation. It can quickly process massive amounts of data and perform real-time analysis. Due to Spark's ability to process real-time streaming data, more and more people are beginning to use Spark to develop streaming applications. Currently, some stream processing tools have emerged in the field of stream computing, such as Storm, Flink and Kafka Streams. But these tools have their own unique programming models, and the languages ​​and ecosystems they support are inconsistent. Therefore, in this case, Apache Spark Streaming (SS for short) came into being. SS is a module in Apache Spark that provides high-throughput, low-latency processing of real-time streaming data. This article will elaborate on the background, architecture and characteristics of SS, and combine it with practical cases to share knowledge about the usage, principles and optimization techniques of SS.

2. What is Spark Streaming?

Spark Streaming is a module in Apache Spark used to process real-time streaming data (Streaming Data). It leverages the speed and fault tolerance of Spark to collect data from multiple sources simultaneously and transfer the data to the target system in batches or continuously. Spark Streaming provides high-throughput, low-latency processing capabilities for real-time data, and is suitable for real-time data analysis, reporting, search engines, recommendation engines and other application scenarios. Its architecture is shown in the figure below:

The Spark Streaming module consists of three main components:

  1. Input data sources: Spark Streaming can read data from multiple data sources (such as Kafka, Flume, Kinesis, etc.).
  2. Data receiver (Receiver): Receiver reads data from the input data source and

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132867822