【Apache之 Samza 介绍】

一、What is messaging?

Apache Samza是LinkedIn最近开源的一款流处理器。

Samza是由LinkedIn开源的一个技术,它是一个开源的分布式流处理系统,非常类似于Storm。不同的是它运行在Hadoop之上,并且使用了自己开发的Kafka分布式消息处理系统。

这是Linkin开发的一个小而美的项目,如何美呢?

1. 只有几千行代码,完成的功能就可以和Storm媲美,当然目前还有很多的不足

2. 和Kafka结合紧密,更方便的处理数据

3. 运行在Yarn上

Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream consumers read messages from these systems, and process them or take actions based on the message contents.

Suppose you have a website, and every time someone loads a page, you send a “user viewed page” event to a messaging system. You might then have consumers which do any of the following:

1)Store the message in Hadoop for future analysis

2)Count page views and update a dashboard

3)Trigger an alert if a page view fails

4)Send an email notification to another user

5)Join the page view event with the user’s profile, and send the message back to the messaging system

A messaging system lets you decouple all of this work from the actual web page serving.

二、What is stream processing?

A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.

Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.) What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?

Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.

三、What is Samza?

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

1)Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.

2)Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).

3)Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.

4)Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

5)Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.

6)Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.

7)Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

四、Architecture

Samza is made up of three layers:

A streaming layer.

An execution layer.

A processing layer.

Samza主要包含三层,

1.流处理层 --> Kafka

2.执行层     --> YARN

3.处理层    --> Samza API

Samza provides out of the box support for all three layers.

Streaming: Kafka

Execution: YARN

Processing: Samza API

These three pieces fit together to form Samza:



 

猜你喜欢

转载自gaojingsong.iteye.com/blog/2372888