Spark series (xiii) - Spark Streaming and Streaming

First, the stream processing

1.1 Static data processing

Prior stream processing, data is typically stored in a database, file system or other form of storage systems. Application to query data or calculated data as needed. This is the traditional static data processing architecture. Using Hadoop HDFS data storage, data query or using MapReduce analysis, this is a typical static data processing architecture.

Streaming 1.2

While the streaming data is the direct treatment of motion calculated data directly when data is received.

Most of the data are continuous streams: sensor events, user activity on the site, financial transactions and so on, all of this data over time are created.

System receive and transmit data stream and execute the application logic is called or analysis stream processor . Primary responsibility is to ensure that the data stream processor effective flow, along with scalability and fault tolerance, Storm and Flink is representative implementation.

Stream processing brings many advantages static data processing are not available:

  • Application immediately make the data response : Reduce the lag of the data, making the data more timely, better reflect the expectations of the future;
  • Stream processing can handle larger amounts of data : the direct processes the data stream, and keep only meaningful subset of the data, and transmits it to the next processing unit, progressively filtering data, reducing the amount of data to be processed, thereby receiving a greater amount of data;
  • Stream processing more realistic data model : In a real environment, all data is constantly changing, in order to be able to infer the trend of the future by the past data, must be constantly revised to ensure the continuous model and input data, it is typical financial markets, stock markets, stream processing features to better respond to the needs of continuity and timeliness of the data;
  • Stream processing dispersion and separation of infrastructure : Streaming reduces the need for large databases. Instead, each stream processing program maintains its own data and status via streaming framework, which makes it more suitable for micro-stream processing program service architecture.

Two, Spark Streaming

2.1 Introduction

Spark Streaming 是 Spark 的一个子模块,用于快速构建可扩展,高吞吐量,高容错的流处理程序。具有以下特点:

  • 通过高级 API 构建应用程序,简单易用;
  • 支持多种语言,如 Java,Scala 和 Python;
  • 良好的容错性,Spark Streaming 支持快速从失败中恢复丢失的操作状态;
  • 能够和 Spark 其他模块无缝集成,将流处理与批处理完美结合;
  • Spark Streaming 可以从 HDFS,Flume,Kafka,Twitter 和 ZeroMQ 读取数据,也支持自定义数据源。

2.2 DStream

Spark Streaming 提供称为离散流 (DStream) 的高级抽象,用于表示连续的数据流。 DStream 可以从来自 Kafka,Flume 和 Kinesis 等数据源的输入数据流创建,也可以由其他 DStream 转化而来。在内部,DStream 表示为一系列 RDD

storm 和 Flink 都是真正意义上的流计算框架,但 Spark Streaming 只是将数据流进行极小粒度的拆分,拆分为多个批处理,使得其能够得到接近于流处理的效果,但其本质上还是批处理(或微批处理)。

参考资料

  1. Spark Streaming Programming Guide
  2. What is stream processing?

更多大数据系列文章可以参见 GitHub 开源项目大数据入门指南

Guess you like

Origin www.cnblogs.com/heibaiying/p/11354815.html