Big data familiarity: practical analysis of Spark Streaming

Author: Zen and the Art of Computer Programming

1 Introduction

In recent years, with the emergence of emerging big data such as the Internet and the Internet of Things, people are facing huge challenges in the collection, processing, storage and other related technologies of big data. How to effectively process massive data and quickly respond to user requests has become an indispensable problem in reality. Apache Spark is an open source big data computing framework that combines distributed computing capabilities with memory storage to provide high-performance parallel computing and real-time streaming data analysis capabilities. It is the de facto benchmark for big data processing. Spark Streaming provides Spark with streaming data processing functions, allowing developers to perform real-time big data analysis more flexibly.

This article will start from the basic knowledge of Apache Spark Streaming, first introduce the main concepts and architecture of Spark Streaming, then go into the principles and applications of Spark Streaming, and finally provide solutions to some common scenarios. I hope that by reading this article, readers can better understand the features and applications of Spark Streaming.

2. Explanation of concepts and terminology

2.1 Spark Streaming

Apache Spark Streaming is a sub-project of Apache Spark, which is used to quickly process real-time data streams. Due to the limitations of Hadoop MapReduce, MapReduce is only suitable for batch processing of static data collections and cannot meet the needs of fast processing of real-time data. Spark Streaming takes micro-batch data streams as input and uses a highly optimized iteration (shuffling) mechanism to achieve real-time data processing.

The main components of Spark Streaming are as follows:

  1. Input Sources: Data sources, such as Kafka, Flume, Kinesis, etc.
  2. Processing Logic

Guess you like

Origin blog.csdn.net/universsky2015/article/details/132931716