[Introduction to Samza of Apache]

一、What is messaging?

Apache  Samza is a stream processor recently open sourced by LinkedIn.

Samza, a technology open sourced by LinkedIn, is an open source distributed stream processing system very similar to Storm. The difference is that it runs on Hadoop and uses the Kafka distributed message processing system developed by itself.

This is a small but beautiful project developed by Linkin. How beautiful is it?

1. With only a few thousand lines of code, the completed functions can be comparable to Storm. Of course, there are still many shortcomings.

2. It is closely integrated with Kafka, and it is more convenient to process data

 

3. Running on Yarn

Messaging systems are a popular way of implementing near-realtime asynchronous computation. Messages can be added to a message queue (ActiveMQ, RabbitMQ), pub-sub system (Kestrel, Kafka), or log aggregation system (Flume, Scribe) when something happens. Downstream consumers read messages from these systems, and process them or take actions based on the message contents.

 

Suppose you have a website, and every time someone loads a page, you send a “user viewed page” event to a messaging system. You might then have consumers which do any of the following:

 

1)Store the message in Hadoop for future analysis

2)Count page views and update a dashboard

3)Trigger an alert if a page view fails

4)Send an email notification to another user

5)Join the page view event with the user’s profile, and send the message back to the messaging system

A messaging system lets you decouple all of this work from the actual web page serving.

 

二、What is stream processing?

A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer. Samza aims to help with these problems.

 

Consider the counting example, above (count page views and update a dashboard). What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover? Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.) What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?

 

Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.

 

 

三、What is Samza?

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.

 

1)Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce.

2)Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition).

3)Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine.

4)Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost.

5)Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.

6)Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments.

7)Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

 

 

4. Architecture

Samza is made up of three layers:

A streaming layer.

An execution layer.

A processing layer.

Samza mainly consists of three layers,

1. Stream processing layer --> Kafka

2. Execution layer --> YARN

3. Processing layer --> Samza API

Samza provides out of the box support for all three layers.

Streaming: Kafka

Execution: YARN

Processing: Samza API

These three pieces fit together to form Samza:



 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326485716&siteId=291194637