Three Frameworks for Streaming Big Data Processing: Storm, Spark and Samza

Many distributed computing systems can process large data streams in real-time or near real-time. This article will briefly introduce each of the three Apache frameworks, and then attempt a quick, high-level overview of their similarities and differences.

Apache Storm

In Storm , we first design a graph-like structure for real-time computing, which we call topology. This topology will be submitted to the cluster, where the master node in the cluster distributes the code and assigns tasks to the worker nodes for execution. A topology includes two roles, spout and bolt. Spout sends messages and is responsible for sending data streams in the form of tuples; bolt is responsible for converting these data streams. In bolt, operations such as calculation and filtering can be completed. It can also randomly send data to other bolts. The tuple emitted by the spout is an immutable array, corresponding to a fixed key-value pair.


 

Apache Spark

Spark Streaming is an extension of the core Spark API. It does not process data streams one at a time like Storm, but pre-slices them into batch jobs at time intervals before processing. Spark’s abstraction for persistent data streams is called DStream (Discretized Stream). A DStream is a micro-batching RDD (Resilient Distributed Data Set); while an RDD is a distributed data set that can be Two methods operate in parallel, namely arbitrary function and transformation of sliding window data.

 

Apache Samza

When Samza processes the data stream, it processes each received message individually. Samza's stream unit is neither a tuple nor a Dstream, but a message. In Samza, the data stream is split, and each part consists of an ordered sequence of read-only messages, each of which has a specific ID (offset). The system also supports batch processing, where multiple messages from the same data stream partition are processed one after the other. Samza's execution and dataflow modules are pluggable, although Samza's features rely on Hadoop's Yarn (another resource scheduler) and Apache Kafka.

 

common ground

The above three real-time computing systems are open source distributed systems with many advantages such as low latency, scalability and fault tolerance. Their common feature is that they allow you to distribute tasks to a series of fault-tolerant run in parallel on the computer. Furthermore, they both provide simple APIs to simplify the complexity of the underlying implementation.

The three frameworks have different terminology, but the concepts they represent are very similar:

 

Comparison chart

The following table summarizes some of the differences:

 

There are three types of data transfer:

 

  1. At-most-once: Messages may be lost, which is usually the least desirable outcome.
  2. At-least-once: The message may be sent again (no loss, but redundancy). is sufficient in many use cases.
  3. Exactly-once: Every message is sent exactly once (no loss, no redundancy). This is the best case, although it is difficult to guarantee that it will be implemented in all use cases.

 

Another aspect is state management: there are different strategies for storing state, Spark Streaming writes data to a distributed file system (such as HDFS); Samza uses an embedded key-value store; and in Storm, or rolling state management to the application level, or use the higher level abstraction Trident.

example

All three frameworks perform well and efficiently when dealing with continuous large amounts of real-time data, so which one to use? There are no hard and fast rules when it comes to choosing, just a few guidelines at most.

If what you want is a high-speed event processing system that allows incremental computation, Storm would be the best choice. It can cope with your need for further distributed computing while the client is waiting for the result, using the out-of-the-box distributed RPC (DRPC). Last but not least the reason: Storm uses Apache Thrift and you can write topologies in any programming language. If you need state persistence and/or exactly-once delivery, you should look at the higher-level Trdent API, which also provides a micro-batching approach.

 

Companies using Storm include: Twitter , Yahoo, Spotify , and The Weather Channel , among others.

Speaking of micro-batching, if you must have stateful computations, exactly-once delivery, and don't mind high latency, then consider Spark Streaming, especially if you're also planning graph operations, machine learning, or access to SQL, Apache Spark's The stack allows you to combine some libraries with dataflows (Spark SQL, Mllib, GraphX), which provide a convenient all-in-one programming model. In particular, data streaming algorithms (eg: K-means streaming) allow Spark to facilitate real-time decision making.

Companies using Spark include: Amazon, Yahoo, NASA JPL , eBay , and Baidu.

If you have a lot of state to process, like many gigabytes per partition, then Samza is the way to go. Because Samza puts storage and processing on the same machine, processing is kept efficient without additional memory loading. This framework provides a flexible and pluggable API: its default execution, message sending and storage engine operations can be replaced at any time according to your choice. Additionally, Samza's fine-grained work feature is especially useful if you have a large number of data stream processing stages from different teams in different codebases, as they can add or remove work with minimal impact .

Companies using Samza are: LinkedIn , Intuit , Metamarkets , Quantiply , Fortscale , etc.

in conclusion

In this article, we only have a brief understanding of these three Apache frameworks, and have not covered a large number of functions and more subtle differences in these frameworks. At the same time, the comparison of these three frameworks in this paper is also limited, because these frameworks are constantly developing, which we should keep in mind.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326314469&siteId=291194637
Recommended