What are the advantages of Big Data Storm over Spark and Hadoop

Abstract:  1. Many newcomers to big data may not know what strom is. Let me introduce you to strom: a distributed real-time computing system. The meaning of storm for real-time computing is similar to that of hadoop for batch processing.

1. Many newcomers to big data may not know what strom is. Let me first introduce you to strom:
distributed real-time computing system. The meaning of storm for real-time computing is similar to that of hadoop for batch processing.
The applicable scene of storm.
Streaming data processing. Storm can be used to process the incoming messages, and write the results to a storage after processing.
Distributed rpc. Because Storm's processing components are distributed, and the processing delay is extremely low, it can be used as a general distributed rpc framework. Of course, in fact, our search engine itself is also a distributed rpc system


Second, then let's take a look at the characteristics of Storm, Spark, and Hadoop.
Storm: Distributed real-time computing, emphasizing real-time, often used in places with high real-time requirements;
Hadoop: distributed batch computing, emphasizing batch processing, commonly used For data mining and analysis;
Spark: is an open source cluster computing system based on in-memory computing, the purpose is to make data analysis faster, Spark is an open source cluster computing environment similar to Hadoop, but there are some differences between the two Where, these useful differences make Spark perform better in certain workloads. In other words, Spark enables memory distribution data sets, in addition to providing interactive queries, it can also optimize iterative workloads.
3. The advantages of strom

  1. Simple programming I
    believe everyone is familiar with hadoop in big data processing. Hadoop based on GoogleMap / Reduce provides developers with map and reduce primitives, which makes parallel batch processing very simple and beautiful. Similarly, Storm also provides some simple and beautiful primitives for real-time computing of big data, which greatly reduces the complexity of the task of developing parallel real-time processing, and helps you develop applications quickly and efficiently.
    There are many types of data set operations provided by Spark, unlike Hadoop that only provides two operations Map and Reduce. Such as map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, mapValues, sort, partionBy and other types of operations, they call these operations Transformations. At the same time also provide Count, collect, reduce, lookup, save and other actions. These various data set operation types provide convenience for upper-layer users. The communication model between various processing nodes is no longer the only Data Shuffle mode like Hadoop. Users can name, materialize, control the partition of intermediate results, etc. It can be said that the programming model is more flexible than Hadoop. If you want to learn big data systematically, you can join big data technology to learn the deduction : 522189307
  2. Multi-language support
    In addition to using java to implement spout and bolt, you can also use any programming language you are familiar with to complete this work, all thanks to Storm's so-called multi-language protocol. The multi-language protocol is a special protocol within Storm that allows spout or bolt to use standard input and standard output for message delivery. The message passed is a single line of text or multiple lines of json encoding.
    Storm supports multi-language programming mainly through ShellBolt, ShellSpout, and ShellProcess. These classes implement the IBolt and ISpout interfaces, and the protocol that allows the shell to execute scripts or programs through the Java ProcessBuilder class.
    It can be seen that in this way, each tuple needs to perform json encoding and decoding when processing, so it will have a greater impact on throughput.
  3. Support horizontal expansion
    There are three main entities that really run topology in a Storm cluster: worker processes, threads, and tasks. Each machine in the Storm cluster can run multiple worker processes, and each worker process can create multiple threads. Each thread can perform multiple tasks. The tasks are the entities that really perform data processing. The spout, which we developed The bolt is executed as one or more tasks.
    Therefore, computing tasks are performed in parallel among multiple threads, processes, and servers, supporting flexible horizontal expansion.
  4. Strong fault tolerance
    If something goes wrong during message processing, Storm will rearrange the faulty processing unit. Storm guarantees that a processing unit will run forever (unless you explicitly kill the processing unit).
  5. Reliable message guarantee
    Storm can ensure that every message sent by spout can be "completely processed", which is also directly different from other real-time systems, such as S4.
  6. Fast message processing
    Use ZeroMQ as the underlying message queue to ensure that messages can be processed quickly
  7. Local mode, support rapid programming test
    Storm has a "local mode", which is to simulate all the functions of a Storm cluster in the process. Running topology in local mode is similar to running topology on the cluster. This is very important for our development and testing. it works.
Published 207 original articles · praised 5 · 40,000+ views

Guess you like

Origin blog.csdn.net/mnbvxiaoxin/article/details/105081704