storm basics

Apache Storm is an open source distributed, real-time, scalable, fault-tolerant computing system. Apache Storm Storm can easily handle unlimited data streams reliably, just like Hadoop batches big data. Storm is very fast, each node can process more than one million data sets per second.
      Apache Storm application scenarios such as: real-time analysis, online machine learning, continuous computing, distributed RPC, ETL, etc.

The water flows from the tap continuously, can flow to different pools, and finally merge into one pool, which vividly describes that Storm can process unlimited data flow, and through each node, data can be filtered, calculated, etc. , and then sent to the next node for processing.

2. Storm Architecture                                         
      
       Apache Storm distributed cluster The main nodes are composed of control nodes (Nimbus nodes) and working nodes (Supervisor nodes). The control node can be composed of one and multiple working nodes, while Zookeeper is mainly responsible for the communication between Nimbus nodes and Supervisor nodes. Coordination.

    The Nimbus node is responsible for resource allocation and task allocation, and monitors the Supervisor status through Zookeeper. The Supervisor node regularly accepts tasks assigned by the Nimbus node, downloads code from Nimbus, and starts the corresponding worker process and monitors the worker status. Each worker is an independent JVM In the process, the Spout/Bolt thread Task task runs in the Worker, and the Storm uses the zookeeper to coordinate the entire cluster. The state information (tasks distributed by Nimbus, supervisor, worker heartbeat, etc.) are stored in the zookeeper.

3. Distributed real-time computing application structure                                 
  
   Apache Storm is an open source distributed, real-time computing application. The real-time computing application is composed of Topology, Streams, Spouts, Bolts, Stream groupings and other elements.

1. Topologies (Topology)
       Storm's Topologies is a distributed real-time computing application. It connects spouts and Bolts through Stream groupings to form a stream data processing structure. Topologies run in the cluster until kill (storm kill topology-name [ -w wait-time-secs]) topology will wait until flutter finishes running.
     To run a topology, you only need to package the code into a jar, and then in the storm cluster environment, execute the command
           storm jar topology-jar-path class........
     Topology operation mode: local mode and distributed mode.
       Local mode:
[html] view plain copy
LocalCluster cluster = new LocalCluster(); 
Topology("word-count", conf, builder.createTopology()); 

Distributed mode:
[html] view plain copy
conf.setNumWorkers(2) ; 
Topology(args[0], conf, builder.createTopology()); 


2. Streams (data stream)
      Streams is the core abstract concept of Storm. A Stream is a sequence of tuples without boundaries. Streams are composed of Tuples (tuples). The types supported by Tuple are Integer, Long, Short, Byte, String, Double, Float, Boolean, Byte arrays. Tuples also support serializable objects.

Tuple is a basic processing unit in the data stream, which contains multiple Fields and corresponding values, which can be understood as a key-value map, because Bolt defines the field names to be passed down in advance through declareOutputFields, so when constructing Tuple , just pass in the corresponding value (value List).
    1) declareOutputFields defines the field names to be passed down in advance
[java] view plain copy
   @Override 
  public void declareOutputFields(OutputFieldsDeclarer declarer) { 
//Define the field description passed to the next bolt 
declarer.declare(new Fields("field" )); 

  
    2) Then pass the emit to fill the value    
[java] view plain copy
collector.emit(new Values(value)) 


description:
          1) The data stream has a default id (streamId=default)
      
[java] view plain copy
public List<Integer> emit(List<Object> tuple) { 
     return emit(Utils.DEFAULT_STREAM_ID, tuple); 

                                                 ---source code
        2) We can define the data stream ID by ourselves. When declareOutputFields defines the field name to be uploaded in advance, we can declare the corresponding data stream ID: declarer.declareStream("streamId", new Fields("field"))

3. Spouts ( Data source)

        Spout is the source of the data flow of the topology. Spout continuously reads data (external resources such as databases, kafka) from the outside, and sends it to the topology for real-time processing.
       Spout is the active mode. Spout inherits the BaseRichSpout or IRichSpout class and continuously calls the nextTuple() function, and then sends the data stream through emit.
   Parameter description:
      Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS: The maximum waiting time for a Tuple tuple to complete, the default is 30 seconds;
      Config.TOPOLOGY_MAX_SPOUT_PENDING: Set the maximum number of tuples waiting for a single spout task at a time, (waiting means that this tuple has not been acked or failed), setting this value can prevent queue overflow

  4. Bolts (data stream processing component)
       Bolt receives Spout Or the Tuple (data stream) sent by the upstream Bolt, and then process the data stream (data filtering, statistics, etc.).
     Bolt is passive mode. Bolt inherits the BaseBasicBolt class or IRichBolt interface to implement. When Bolt receives the Tuple (data stream) sent by Spout or the upstream Bolt, the execute method is called, and the data stream is processed, and the output of OutputCollector sends the data. Stream, the execute method is responsible for receiving and sending data streams in Bolt.

5. Stream groupings (data stream grouping)
      Storm uses Stream groupings to connect spouts and Bolts to form a stream data processing structure.   
      Storm defines 8 data stream grouping methods for Stream groupings:
       1) Shuffle grouping (random grouping): randomly distribute Tuple (data stream) to Tasks (tasks) of the next bolt, each task has the same opportunity to obtain Tupe, guaranteeing load balancing of the cluster.
       2) Fields grouping (field grouping): Group the data stream for the specified field. The data stream corresponding to the same field will be handed over to the Task in the same bolt for processing, and the data stream of different fields will be distributed to different tasks. deal with.
       3) Partial Key grouping (partial field grouping): grouping by fields, which is very similar to Fields grouping, but this method will consider the balance of downstream Bolt data processing, and will provide better resource utilization, you can refer to the official explanation of.
       4) All grouping (complete grouping): Tuple (data stream) will be sent to all Tasks (tasks) in the next bolt at the same time.
           5) Global grouping (global grouping): Tuple (data flow) will be sent to the task (task) with the smallest Id of the same Bolt.
       6) None grouping: Use this method to show that you don't care how the data streams are grouped. At present the result of this method is completely equivalent to random grouping, but in the future, non-grouping methods may be considered to allow Bolt and the Spout or Bolt to which it subscribes to execute in the same thread.
       7) Direct grouping: Specify the specific Task of the next bolt to process through the OutputCollector emitDirect method.
      8) Local or shuffle grouping (local or random grouping): If the target Bolt has one or more tasks in the same Worker process, Tuple will be sent to these tasks, otherwise Tuple will be distributed randomly (same as Shuffle grouping).

6. Reliability
     In order to ensure that the tuple sent by the spout can be successfully processed, Storm tracks whether each tuple is successfully processed through the tuple tree. Storm has configured a maximum waiting time for all tuple tuples to be completed. If this time is exceeded, the tuple will fail by default and the tuple will be resent. The reliability of Tuple is guaranteed through this mechanism. You can refer to the official documentation for data processing instructions.
     Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS: The maximum waiting time for a Tuple tuple to complete, the default is 30 seconds.

7. Tasks (task)
      A task corresponds to a spoout or bolt instance. In the Storm cluster, each spoout or bolt corresponds to multiple tasks to execute. If a spoout or bolt sets multiple parallelisms (setSpout/setBolt), there are many corresponding Each Task, nextTuple() of spoout/execute() of bolt will be executed.

8. The Workers (worker process)
      topology runs on one or more workers. Each worker is an independent JVM process. The process contains one or more executors (threads), and a thread processes one or more Tasks (tasks). ).
   Config.TOPOLOGY_WORKERS sets the number of workers.
  
   

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326488074&siteId=291194637