Some remarks about Storm

one installation

Stand-alone installation

Cluster installation

2. Introduction

  1.     Distributed & Real-Time Computing Systems
  2.     One-time initialization, continuous calculation, using zeromq (netty) as the underlying message queue

Three architectures

  1.    Strom adopts a master-slave structure and consists of Nimbus and Supervisor. The Nimbus process runs on the master node of the cluster and is responsible for task assignment and distribution. The Supervisor runs on the slave node of the cluster and is responsible for executing specific parts of the task.

  • Nimbus: Responsible for resource allocation and task scheduling.
  • Supervisor: Responsible for accepting tasks assigned by nimbus, starting and stopping worker processes that are managed by themselves.
  • Worker: A process that runs specific processing component logic. A topology defaults to one worker, and there are multiple workers on a node.
  • Task: Each spout/bolt thread in the worker is called a task, and the same spout/bolt task may share a physical thread, which is called an executor. You can increase the spout/bolt parallelism in the code to increase the number of executors.
  • The Spout/Bolt programming model is used in the Storm architecture to stream messages. A message flow is the basic abstraction of data in Storm. A message flow is the encapsulation of a piece of input data. The continuous input message flow is processed in a distributed manner. The Spout component is the message producer and the data input in the Storm architecture. The source, it can read data from a variety of heterogeneous data sources and emit message streams. The Bolt component is responsible for receiving the information stream emitted by the Spout component and completes the specific processing logic. Multiple Bolt components can be connected in series in complex business logic, and different functions can be written in each Bolt component to realize the overall processing logic
  • Zookeeper role: nimbus distributes tasks by writing state information on zookeeper. In layman's terms, it is to write the correspondence between which supervisors execute which tasks. The supervisor receives tasks by reading these status information from zookeeper. Supervisor and task will send heartbeats to zookeeper, so that nimbus can monitor the status of the entire cluster, so that when task execution fails, they can be restarted.

 

Four Storm and Hadoop comparison

structure Hadoop Storm
master node JobTracker Nimbus
slave node TaskTracker Supervisor
application Job Topology
Worker process name Child Worker
computational model Map / Reduce Spout / Bolt

In the Hadoop architecture, the application job represents such a job: the input is determined, the job can be completed in a limited time, when the job is completed, the life cycle of the job comes to the end, and the determined calculation result is output; while in the Storm architecture , Topology does not represent a definite job, but a continuous calculation process. Under the definite business logic processing framework, the input data continuously enters the system, and after stream processing, the output is generated with a low delay. If you do not actively end the Topology or shut down the Storm cluster, the data processing process will continue.

How the storm architecture solves the bottleneck of the Hadoop architecture :

  • Storm's Topology only needs to be initialized once. When a topology is submitted to a Storm cluster, the cluster will initialize the topology once. After that, during the running of the topology, there is no time-consuming initialization of the computing framework for the input data, which effectively avoids the initialization of the computing framework. time loss.
  • Storm uses Netty as the underlying message queue to deliver messages to ensure that messages can be processed quickly. At the same time, Storm adopts an in- memory computing model, without the need for file storage, and directly transmits intermediate calculation results through the network, avoiding a large amount of data transmitted between components. time loss.

 

To sum up, there are the following advantages:

  • Simple programming, I believe everyone is familiar with hadoop in terms of big data processing. Hadoop based on Google Map/Reduce provides developers with map/reduce primitives, making parallel batch programs very simple and beautiful. Similarly, Storm also provides some simple and beautiful primitives for real-time computing of big data, which greatly reduces the complexity of developing parallel real-time processing tasks and helps you develop applications quickly and efficiently.
  • Multilingual support, in addition to implementing spouts and bolts in java, you can also use any programming language you are familiar with to do the job, all thanks to what Storm calls a polyglot protocol. The multilingual protocol is a special protocol inside Storm that allows spout or bolt to use standard input and standard output for message passing, and the message to be passed is a single line of text or json-encoded multiple lines.
  • To support horizontal scaling, there are three main entities that actually run a topology in a Storm cluster: worker processes, threads, and tasks. Each machine in the Storm cluster can run multiple worker processes, each worker process can create multiple threads, and each thread can execute multiple tasks. Tasks are entities that actually process data. The spout, A bolt is executed as one or more tasks. Therefore, computing tasks are performed in parallel among multiple threads, processes and servers, supporting flexible horizontal scaling.
  • Strong fault tolerance. If some exception occurs during message processing, Storm will reschedule the processing unit in question. Storm guarantees that a processing unit will run forever (unless you explicitly kill the processing unit).
  • Reliable message guarantees, Storm can guarantee that every message sent by a spout will be "fully processed".
  • Fast message processing, using Netty as the underlying message queue, ensures that messages can be processed quickly.
  • Local mode, supports quick programming tests.

five concepts

1 Topology: A real-time computing task is called Topology, including Spout and Bolt

2 Tuple: Data model, representing the processing unit, can contain multiple Fields, K/V Map.

3 Worker: A topology may be executed in one or more workers (worker processes), each of which is a physical JVM and executes part of the entire topology. For example, for a topology with a parallelism of 300, if we use 50 worker processes to execute, each worker process will handle 6 of the tasks. Storm will try to evenly distribute the work to all workers, the last parameter of setBolt is the amount of parallelism you want for bolts.

4 Spouts

A message source spout is a message producer in a topology in Storm. Typically a message source will read data from an external source and send a message to the topology: a tuple. Spouts can be reliable or unreliable. If the tuple is not successfully processed by Storm, the reliable source spouts can re-emit a tuple, but the unreliable source spouts cannot re-send once a tuple is sent.

  • A message source can emit multiple message streams. Use OutputFieldsDeclarer. declareStream to define multiple streams, and then use SpoutOutputCollector to emit the specified stream. The code is like this: collector.emit(new Values(str));

  • The most important method in the Spout class is nextTuple. Either emit a new tuple into the topology or simply return if there are no new tuple. It should be noted that the nextTuple method cannot block, because Storm calls all message source spout methods on the same thread. The other two more important spout methods are ack and fail. Storm calls ack when it detects that a tuple has been successfully processed by the entire topology, otherwise it calls fail. Storm only calls ack and fail for reliable spouts.

5 Bolts

  • All message processing logic is encapsulated in bolts. Bolts can do many things: filter, aggregate, query databases, and more.

  • Bolts can simply do message flow passing (take a tuple, call execute once). Complex message flow processing often requires many steps and therefore many bolts. For example, it takes at least two steps to calculate the most forwarded pictures in a bunch of pictures: the first step is to calculate the number of forwarding of each picture, and the second step is to find the top 10 most forwarded pictures. (More steps may be required to make this process more scalable).

  • Bolts can emit multiple message streams. Use OutputFieldsDeclarer.declareStream to define the stream and OutputCollector.emit to select the stream to emit.

  • The main method of bolts is execute, which takes a tuple as input, bolts use OutputCollector to emit tuple (spout uses SpoutOutputCollector to emit the specified stream), bolts must call the ack method of OutputCollector for each tuple it processes to notify Storm The tuple has been processed, thus notifying the tuple's emitter spouts. The general flow is: bolts process an input tuple, emit zero or more tuples, and then call ack to notify storm that it has processed the tuple. storm provides an IBasicBolt that will automatically call ack.

6 Reliability

  • Storm guarantees that each tuple will be fully executed by the topology. Storm keeps track of the tuple tree produced by each spout tuple (a bolt processing a tuple may emit other tuple to form a tree structure) and keeps track of when the tuple tree has been successfully processed. Each topology has a message timeout setting. If Storm cannot detect whether a tuple tree has been successfully executed within the timeout period, the topology will mark the tuple as failed to execute and re-emit the tuple after a while. (The timeout time can be set in storm 0.9.0.1 version, the default is 30s).

7 Data flow model

  • Each computing component (Spout and Bolt) in a topology has a parallel execution degree, which can be specified when creating a topology, and Storm will allocate threads corresponding to the number of parallel degrees in the cluster to execute this component at the same time. So, there is a question: since for a Spout or Bolt, there will be multiple task threads to run, how to send tuples between two components (Spout and Bolt)? Storm provides several stream grouping strategies to solve this problem. When the Topology is defined, it is necessary to specify what kind of Stream to receive as its input for each Bolt (Note: Spout does not need to receive Stream, it only emits Stream). Currently, Storm provides the following 7 Stream Grouping strategies: Shuffle Grouping, Fields Grouping, All Grouping, Global Grouping, Non Grouping, Direct Grouping, Local or shuffle grouping.

8 There are 7 types of stream grouping in Storm

  • Shuffle Grouping: Random grouping, random distribution of tuples in the stream, to ensure that the number of tuples received by each bolt is roughly the same.
  • Fields Grouping: Grouping by fields, such as grouping by userid, tuples with the same userid will be assigned to a task in the same Bolts. Different userids will be assigned to tasks in different bolts.
  • All Grouping: Broadcast sent, for each tuple, all bolts will be received.
  • Global Grouping: Global grouping, this tuple is assigned to one of the tasks of a bolt in Storm, and to be more specific, it is assigned to the task with the lowest id value.
  • Non Grouping: No grouping, this grouping means that the stream does not care who will receive its tuple. Currently this grouping has the same effect as Shuffle grouping. One difference is that Storm will put the bolt's subscribers in the same thread for execution.
  • Direct Grouping: Direct grouping, which is a special grouping method. Using this grouping means that the sender of the message specifies which task of the receiver of the message handles the message. Only message streams declared as Direct Streams can declare this grouping method. And this message tuple must be emitted using the emitDirect method. The message handler can obtain the id of the task processing its message through the TopologyContext (the OutputCollector.emit method will also return the id of the task).
  • Local or shuffle grouping: If the target bolt has one or more tasks in the same worker process worker, the tuple will be randomly generated for these tasks. Otherwise, the same behavior as normal Shuffle Grouping.

9 Record-level fault tolerance

  • Compared with other real-time computing systems such as s4 and puma, the biggest highlight of Storm is its record-level fault tolerance and transaction functions that can ensure accurate message processing. Let's focus on the implementation principles of these two highlights.

  • The basic principles of Storm record-level fault tolerance. First, let's take a look at what is record-level fault tolerance? Storm allows users to specify a message id when emitting a new source tuple in a spout. This message id can be any object. Multiple source tuples can share a message id, indicating that these multiple source tuples are the same message unit to the user. Record-level fault tolerance in Storm means that Storm will inform the user whether each message unit has been fully processed within the specified time. Then what is called complete processing, that is, the source tuple bound by the message id and the tuple subsequently generated by the source tuple have been processed by each bolt that should be reached in the topology.

 10 Storm's transaction topology

    Strict requirements for operation

Six one example

package com.ding.storm;

import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.BasicOutputCollector;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.topology.base.BaseBasicBolt;
import org.apache.storm.topology.base.BaseRichSpout;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import org.apache.storm.utils.Utils;

import java.util.HashMap;
import java.util.Map;
import java.util.Random;
import java.util.StringTokenizer;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * Created by Ding on 2018/1/22.
 */
public class WordCountTopolopgyAllInJava {
    // 定义一个喷头,用于产生数据。该类继承自BaseRichSpout
    public static class RandomSentenceSpout extends BaseRichSpout {
        SpoutOutputCollector _collector;
        Random _rand;
        private AtomicInteger counter;

        public void open(Map conf, TopologyContext context, SpoutOutputCollector collector){
            _collector = collector;
            _rand = new Random();
            counter = new AtomicInteger();
        }

        public void nextTuple(){

            // 睡眠一段时间后再产生一个数据
            Utils.sleep(100);

            // 句子数组
            String[] sentences = new String[]{ "the cow jumped over the moon", "an apple a day keeps the doctor away",
                    "four score and seven years ago", "snow white and the seven dwarfs", "i am at two with nature" };

            // 随机选择一个句子
            String sentence = sentences[_rand.nextInt(sentences.length)];

            // 发射该句子给Bolt
            _collector.emit(new Values(sentence),this.counter.getAndIncrement());//发送加id才能反馈ack
        }

        // 确认函数
        @Override
        public void ack(Object id){
            System.out.println("success! id: "+id.toString());
        }

        // 处理失败的时候调用
        @Override
        public void fail(Object id){
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer){
            // 定义一个字段word
            declarer.declare(new Fields("word"));
        }
    }

    // 定义个Bolt,用于将句子切分为单词
    public static class SplitSentence extends BaseBasicBolt {

        public void execute(Tuple tuple, BasicOutputCollector collector){
            // 接收到一个句子
            String sentence = tuple.getString(0);
            // 把句子切割为单词
            StringTokenizer iter = new StringTokenizer(sentence);
            // 发送每一个单词
            while(iter.hasMoreElements()){
                collector.emit(new Values(iter.nextToken()));
            }
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer){
            // 定义一个字段
            declarer.declare(new Fields("word"));
        }
    }

    // 定义一个Bolt,用于单词计数
    public static class WordCount extends BaseBasicBolt {
        Map<String, Integer> counts = new HashMap<String, Integer>();

        public void execute(Tuple tuple, BasicOutputCollector collector){
            // 接收一个单词
            String word = tuple.getString(0);
            // 获取该单词对应的计数
            Integer count = counts.get(word);
            if(count == null)
                count = 0;
            // 计数增加
            count++;
            // 将单词和对应的计数加入map中
            counts.put(word,count);
            System.out.println("hello word!");
            System.out.println(word +"  "+count);
            // 发送单词和计数(分别对应字段word和count)
            collector.emit(new Values(word, count));
        }

        public void declareOutputFields(OutputFieldsDeclarer declarer){
            // 定义两个字段word和count
            declarer.declare(new Fields("word","count"));
        }
    }
    public static void main(String[] args) throws Exception
    {
        // 创建一个拓扑
        TopologyBuilder builder = new TopologyBuilder();
        // 设置Spout,这个Spout的名字叫做"Spout",设置并行度为5,增加executor线程数目
        builder.setSpout("spout", new RandomSentenceSpout(), 5);
        // 设置slot——“split”,并行度为8,它的数据来源是spout的
        builder.setBolt("split", new SplitSentence(), 8).shuffleGrouping("spout");
        // 设置slot——“count”,你并行度为12,它的数据来源是split的word字段
        builder.setBolt("count", new WordCount(), 12).fieldsGrouping("split", new Fields("word"));

        Config conf = new Config();
        conf.setDebug(false);
        conf.setMaxTaskParallelism(20);//topology 最大并行度


        if (args != null && args.length > 0) {
            //集群运行
            StormSubmitter.submitTopology(args[0],conf,builder.createTopology());
        } else {
            //本地运行
            LocalCluster cluster = new LocalCluster();
            cluster.submitTopology("word-count", conf, builder.createTopology());
            Thread.sleep(10000);
        }
    }
}

Seven Comparison with SparkStreaming

Spark Streaming is located in the Spark ecological technology stack, so Spark Streaming can be seamlessly integrated with Spark Core and Spark SQL, which means that we can seamlessly perform delayed batch processing and interaction in the program for the intermediate data processed in real time. query, etc. This feature greatly enhances the advantages and capabilities of Spark Streaming.

In addition, Spark Streaming has no obvious advantages, and how to use it depends on specific business scenarios.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324414132&siteId=291194637