【大数据之Samza流式处理】

1、What is Samza?

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

2、Samza特性

Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.

Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).

Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.

Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.

Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.

Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

Samza是一个流式计算框架,它有以下特性:

简单的API:和绝大多数低层次消息系统API不同,相比MapReduce,Samza提供了一个非常简单的“基于回调(callback-based)”的消息处理API;

管理状态:samza管理快照和流处理器的状态恢复。当处理器重启,samza恢复其状态一致的快照。samza的建立是为了处理大量的状态;

容错性:当集群中有一台机器宕机了,基于Yarn管理的Samza会立即将你的任务导向另一台机器;

持久性:Samza通过kafka保证消息按顺序写入对应分区,并且不会丢失消息;

扩展性:Samza在每一层都做了分区和分布。kafka提供了顺序的、分区、可复制的、容错的流。Yarn则为Samza的运行提供了一个分布式环境;

可插拔:虽然Samza在Kafka和YARN的外部工作,但是Samza提供了可以让你在其它消息系统和执行环境里运行的可插拔的API;

处理器隔离:运行在YARN上的Samza同样支持Hadoop安全模型以及通过linux CGroups进行资源隔离

目前流行的开源流式计算方案都很年轻,并且没有一个单一系统能提供一个全面的解决方案。在这个领域面临的新难题包括如下几个:1.一个流式计算的状态应该怎样管理;2.流是否应该被缓冲到远程机器的磁盘上;3.当重复的信息被接受或者信息丢失该做什么;4.如何建立底层消息传递系统;

Samza的主要区别在于以下几个方面:

Samza支持局部状态的容错。状态自己作为一个流被构造。如果因为机器宕机本地状态丢失,那么状态流会回放重新存储它。

流是有序、分区的、可回放的并且是容错的;

YARN用来处理隔离、安全和容错;

任务之间是解耦的:如果有一个任务慢了并且造成了消息的积压,系统其它部分不会受到影响;



 

Samza的主要区别在于以下几个方面:

Samza支持局部状态的容错。状态自己作为一个流被构造。如果因为机器宕机本地状态丢失,那么状态流会回放重新存储它。

流是有序、分区的、可回放的并且是容错的;

YARN用来处理隔离、安全和容错;

任务之间是解耦的:如果有一个任务慢了并且造成了消息的积压,系统其它部分不会受到影响;

猜你喜欢

转载自gaojingsong.iteye.com/blog/2324055