Structured Streaming编程向导

简介

  Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the Dataset/DataFrame API in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write Ahead Logs. In short, Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.

  Structured Streaming是一个可缩放、容错的流逝处理引擎,基于Spark SQL引擎构建。当你在处理流计算时,可以像处理静态数据批计算一样。Spark SQL引擎负责不断地连续运行它,并随着流数据持续到达而更新最终结果。你可以在Scala、Java、Python或者R中使用Dataset/DataFrame API来表示流集合(aggregations)、事件时间窗口(event-time windows)、流到批连接(stream-to-batch joins)等。计算在同一个优化的Spark SQL引擎上被执行。最终,该系统通过检查点(checkpoint)和预先写日志(Write Ahead Logs)来确保端到端一次性执行的容错保证(ensures end-to-end exactly-once guarantees)。简而言之,Structured Streaming提供了快速、可伸缩、容错、端到端一次性流处理,而用户无需对流进行推理。

  Internally, by default, Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.

  在内部,默认情况下,Structured Streaming(结构化流)查询使用微批处理引擎(a micro-batch procession engine),该微批处理引擎将数据流处理为一系列小批作业,从而实现低至100毫秒的端到端延迟,并且具有一次性执行容错保证(and exactly-once fault-tolerance guarantees)。

  However, since Spark 2.3, we have introduced a new low-latency processing mode called Continuous Processing, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.

  然而,从Spark2.3,我们引入了一种低延迟处理模式,称为连续处理,它可以实现端到端延迟低至1毫秒,并提供至少一次性能保证。在查询中不需要修改Dataset/DataFrame操作的情况下,你将能够基于你的系统需求选择这种模式。

猜你喜欢

转载自www.cnblogs.com/yy3b2007com/p/9463426.html