Large data processing three frameworks: Storm, Spark differences and associated selection and Samza

Many distributed computing systems in real-time or near-real time processing of large data streams . This article will be three kinds Apache framework were brief, and then try to fast, high level overview of their similarities and differences.

Apache Storm

In Storm, the first to design a structure for FIG calculated in real time, we call the topology (topology). This topology will be submitted to the cluster by cluster master node (master node) to distribute the code, assign tasks to worker nodes (worker node) execution. Topology comprises a bolt and two roles spout, wherein the spout message transmission, responsible for the data stream sent in the form of a tuple of tuples; and is responsible for forwarding data stream bolt, the bolt can be done in the calculation, such as filtration, bolt itself You may randomly send data to other bolt. In storm, the variables are each tuple is not set, corresponding to fixed pairs.

 

 

Apache Spark

Spark Streaming Spark API is an extension of the core, it does not process the data as a stream like a Storm, but the processing time interval before its previously cut into a section of the batch job. Spark abstract for continuous data stream called DStream (DiscretizedStream), a micro-batch DSTREAM is a (micro-batching) the RDD (elastic distributed data sets); and RDD is a distributed set of data, can be It operates in parallel in two ways, namely, the conversion of any function and a sliding window of data.

 

 

Apache Samza

When Samza processes the data stream, respectively in each processing messages received per-view. Samza stream tuple unit is neither nor DSTREAM, but a message. In Samza, the data stream is separated to cut, each section consists of an ordered set of message read-only column configuration, each of these messages has a specific ID (offset). The system also supports batch processing, i.e., the sequential processing a plurality of pieces of message data stream with partition. Samza data stream execution modules are pluggable, although Samza characteristic is dependent of the Yarn Hadoop (another resource scheduler) and Apache Kafka.

 

 

Common

These three real-time computing systems are open source distributed systems, low latency, scalability and fault tolerance advantages, their common feature is the: Allows you when you run the data flow code that will assign the task to a series of fault tolerant parallel running on the computer. In addition, they provide a simple API to simplify the complexity of the underlying implementation.

The terms of the three frameworks in different terms, but the concept is very similar to their representatives:

 

 

Comparison Chart

The following table summarizes some of the differences:

 

 

Data transfer form is divided into three categories:

At most once (At-most-once): messages can be lost, this is usually not the most desirable outcome.

At least once (At-least-once): messages may be sent (without loss, but will have redundancy) again. In many cases the use is sufficient.

Exactly once (Exactly-once): each message is sent once and only once (no loss, no redundancy). This is the best case, although it is difficult to ensure are implemented in all use cases.

Another aspect is state management: storage of the state have different strategies, Spark Streaming writes data distributed file system (such as HDFS); Samza use the embedded key-value store; while in the Storm, the state management or scroll to the application level, or the use of Trident higher level of abstraction.

example

These three frame performance when a large number of real-time data processing continuity are excellent and efficient, which one do you use? And there is no hard and fast rules when the choice is up to a few guidelines.

如果你想要的是一个允许增量计算的高速事件处理系统,Storm会是最佳选择。它可以应对你在客户端等待结果的同时,进一步进行分布式计算的需求,使用开箱即用的分布式RPC(DRPC)就可以了。最后但同样重要的原因:Storm使用Apache Thrift,你可以用任何编程语言来编写拓扑结构。如果你需要状态持续,同时/或者达到恰好一次的传递效果,应当看看更高层面的Trdent API,它同时也提供了微批处理的方式。

 

 

使用Storm的公司有:Twitter,雅虎,Spotify还有The Weather Channel等。

说到微批处理,如果你必须有状态的计算,恰好一次的递送,并且不介意高延迟的话,那么可以考虑Spark Streaming,特别如果你还计划图形操作、机器学习或者访问SQL的话,Apache Spark的stack允许你将一些library与数据流相结合(Spark SQL,Mllib,GraphX),它们会提供便捷的一体化编程模型。尤其是数据流算法(例如:K均值流媒体)允许Spark实时决策的促进。

使用Spark的公司有:亚马逊,雅虎,NASA JPL,eBay还有百度等。

如果你有大量的状态需要处理,比如每个分区都有许多十亿位元组,那么可以选择Samza。由于Samza将存储与处理放在同一台机器上,在保持处理高效的同时,还不会额外载入内存。这种框架提供了灵活的可插拔API:它的默认execution、消息发送还有存储引擎操作都可以根据你的选择随时进行替换。此外,如果你有大量的数据流处理阶段,且分别来自不同代码库的不同团队,那么Samza的细颗粒工作特性会尤其适用,因为它们可以在影响最小化的前提下完成增加或移除的工作。

使用Samza的公司有:LinkedIn,Intuit,Metamarkets,Quantiply,Fortscale等。

结论

本文中我们只对这三种Apache框架进行了简单的了解,并未覆盖到这些框架中大量的功能与更多细微的差异。同时,文中这三种框架对比也是受到限制的,因为这些框架都在一直不断的发展,这一点是我们应当牢记的。

推荐阅读文章

年薪40+W的大数据开发【教程】,都在这儿!

大数据零基础快速入门教程

Java基础教程

web前端开发基础教程

大数据时代需要了解的六件事

大数据框架hadoop十大误解

年薪30K的大数据开发工程师的工作经验总结?

大数据框架hadoop我们遇见过的问题

Guess you like

Origin blog.csdn.net/chengxvsyu/article/details/92203620