Demo tutorial for getting started with Storm

Introduction to Storm

Storm is Twitter's open source distributed real-time big data processing framework. It was first open sourced on github. After version 0.9.1, it belongs to the Apache community and is known as the real-time version of Hadoop in the industry. As more and more scenarios cannot tolerate the high latency of Hadoop MapReduce, such as website statistics, recommendation systems, early warning systems, financial systems (high-frequency trading, stocks), etc., big data real-time processing solutions (stream computing) The application is becoming more and more extensive, and it is currently the latest explosion point in the field of distributed technology, and Storm is the leader and mainstream in stream computing technology.

Storm's core components

  • Nimbus : The Master of Storm, responsible for resource allocation and task scheduling. A Storm cluster has only one Nimbus.
  • Supervisor : The Slave of Storm, responsible for receiving tasks assigned by Nimbus and managing all Workers. A Supervisor node contains multiple Worker processes.
  • Worker : Worker process, each worker process has multiple Tasks.
  • Task : Task, each Spout and Bolt in the Storm cluster is executed by several tasks (tasks). Each task corresponds to a thread of execution.
  • Topology : Computing topology, Storm's topology is the encapsulation of real-time computing application logic. Its function is very similar to MapReduce's job (Job). The difference is that a MapReduce Job will always end after getting the result, and the topology will always be in the cluster. run until you manually terminate it. Topology can also be understood as a topology structure composed of a series of Spout and Bolt interrelated through data flow (Stream Grouping).
  • Stream : Streams are the core abstraction in Storm. A data stream refers to an unbounded sequence of tuples created and processed in parallel in a distributed environment. A data stream can be defined by a schema that expresses the fields of tuples in the data stream.
  • Spout : The data source (Spout) is the source of data flow in the topology. Typically a spout will read tuples from an external data source and send them to the topology. Depending on the requirements, Spout can be defined as either a reliable data source or an unreliable data source. A reliable spout can resend the tuple it sends if it fails to process the tuple to ensure that all tuples are processed correctly; correspondingly, an unreliable spout will not respond to the tuple after it has been sent. Tuples do any other processing. A spout can send multiple data streams.
  • Bolt : All data processing in the topology is done by Bolt. With functions such as data filtering, functions, aggregations, joins, and database interactions, Bolt can fulfill almost any data processing requirement. A single bolt can implement simple data flow transformations, while more complex data flow transformations often require the use of multiple bolts and are accomplished in multiple steps.
  • Stream grouping : Determining the input data stream for each Bolt in the topology is an important part of defining a topology. Dataflow grouping defines how the dataflow is divided among the different tasks of the Bolt. There are eight built-in ways to group data streams in Storm.
  • Reliability:可靠性。Storm 可以通过拓扑来确保每个发送的元组都能得到正确处理。通过跟踪由 Spout 发出的每个元组构成的元组树可以确定元组是否已经完成处理。每个拓扑都有一个“消息延时”参数,如果 Storm 在延时时间内没有检测到元组是否处理完成,就会将该元组标记为处理失败,并会在稍后重新发送该元组。

Storm程序再Storm集群中运行的示例图如下:
write picture description here

Topology

为什么把Topology单独提出来呢,因为Topology是我们开发程序主要的用的组件。
Topology和MapReduce很相像。
MapReduce是Map进行获取数据,Reduce进行处理数据。
而Topology则是使用Spout获取数据,Bolt来进行计算。
总的来说就是一个Topology由一个或者多个的Spout和Bolt组成。

具体流程是怎么走,可以通过查看下面这张图来进行了解。
示例图:
write picture description here

注:图片来源http://www.tianshouzhi.com/api/tutorials/storm/52

There are three modes of the picture, which are explained as follows:
the first is relatively simple, that is, a Spout obtains data, and then hands it to a Bolt for processing; the
second is slightly more complicated, a Spout obtains data, and then hands it to a Bolt for processing. , and then hand over to the next Bolt to handle the rest.
The third is more complicated. One Spout can send data to multiple Bolts at the same time, and one Bolt can also accept multiple Spout or multiple Bolts, and finally form multiple data streams. But this kind of data flow must be directional, with a starting point and an ending point, otherwise it will cause an infinite loop and the data will never be processed. That is, the Spout is sent to Bolt1, Bolt1 is sent to Bolt2, and Bolt2 is sent to Bolt1, and finally a ring is formed.

Storm cluster installation

It has been written before, so I will not explain it here.
Blog address: http://www.panchengming.com/2018/01/26/pancm70/

Storm Hello World

I talked about some Storm concepts earlier, but I may not understand them clearly, so here we use a Hello World code example to experience the process of Storm operation.

Environmental preparation

Before starting code development, you must first make relevant preparations.
This project was built using Maven, using Storm version 1.1.1.
The relevant dependencies of Maven are as follows:

  <!--storm相关jar  -->
  <dependency>
    <groupId>org.apache.storm</groupId>
    <artifactId>storm-core</artifactId>
    <version>1.1.1</version>
    <scope>provided</scope>
 </dependency> 

specific process

When writing code, let's first define what we're going to do with Storm.
Then the first program simply outputs the information.
The specific steps are as follows:
1. Start the topology and set the Spout and Bolt.
2. Pass the data obtained by Spout to Bolt.
3. Bolt accepts Spout data for printing.

Spout

So first start writing the Spout class. Generally, implement IRichSpout or inherit the BaseRichSpout class, and then implement this method.
Here we inherit the BaseRichSpout class, which needs to implement these main methods:

1. open

The open() method is defined in the ISpout interface and is called when the Spout component is initialized.
There are three parameters, and their functions are:
1. Map of Storm configuration;
2. Information of components in topology;
3. Method to emit tuple;

Code example:

  @Override
    public void open(Map map, TopologyContext arg1, SpoutOutputCollector collector) {
        System.out.println("open:"+map.get("test"));
        this.collector = collector;
    }
2. nextTuple

The nextTuple() method is the core of the Spout implementation.
That is, the main execution method is used to output information and collector.emitemit through the method.

Here our data information has been written to death, so here we will send the data directly.
This setting is only sent twice.
Code sample:

     @Override
    public void nextTuple() {
        if(count<=2){
            System.out.println("第"+count+"次开始发送数据...");
            this.collector.emit(new Values(message));
        }
        count++;
    }
三、declareOutputFields

declareOutputFields is defined in the IComponent interface to declare the data format.
That is, a Tuple of output contains several fields.

Since we're only emitting one here, specify one. If there are multiple, separate them with commas.
Code example:

    @Override
    public void declareOutputFields(OutputFieldsDeclarer declarer) {
        System.out.println("定义格式...");
        declarer.declare(new Fields(field));
    }
Four, ack

ack is defined in the ISpout interface and is used to indicate that the Tuple is processed successfully.

Code example:

    @Override
    public void ack(Object obj) {
        System.out.println("ack:"+obj);
    }
五、fail

fail is defined in the ISpout interface to indicate that Tuple processing fails.

Code example:

    @Override
    public void fail(Object obj) {
        System.out.println("失败:"+obj);
    }
6. close

close is defined in the ISpout interface to indicate that the Topology is stopped.

Code example:

  @Override
    public void close() {
        System.out.println("关闭...");
    }

As for the others, I will not list them one by one here.

Bolt

Bolt is a component for processing data, mainly implemented by the execute method. Generally speaking, you need to implement IRichBolt or inherit the BaseRichBolt class, and then implement its methods.
The method that needs to be implemented is as follows:

1. prepare

Executed before Bolt starts, providing the entry for Bolt startup environment configuration.
The parameters are basically the same as Sqout.
Usually instantiated for non-serializable objects.
Here we simply print

    @Override
    public void prepare(Map map, TopologyContext arg1, OutputCollector collector) {
        System.out.println("prepare:"+map.get("test"));
        this.collector=collector;
    }

Note: If it is a serializable object, it is better to use the constructor.

2. execute

The execute() method is the core of Bolt's implementation.
That is, the execute method, which is called every time the Bolt receives a subscribed tuple from the stream.
To get a message from a tuple you can use tuple.getString()and tuple.getStringByField();these two methods. Personally recommend the second, you can specify the received message through the field.
Note: If IRichBolt is inherited, manual ack is required. No need here, BaseRichBolt will automatically answer for us.
Code example:

    @Override
    public void execute(Tuple tuple) {
//      String msg=tuple.getString(0);
        String msg=tuple.getStringByField("test");
        //这里我们就不做消息的处理,只打印
        System.out.println("Bolt第"+count+"接受的消息:"+msg); 
        count++;
        /**
         * 
         * 没次调用处理一个输入的tuple,所有的tuple都必须在一定时间内应答。
         * 可以是ack或者fail。否则,spout就会重发tuple。
         */
//      collector.ack(tuple);
    }
三、declareOutputFields

Same as Spout's.
Because there is no more output here, so nothing is written.

    @Override
    public void declareOutputFields(OutputFieldsDeclarer arg0) {        
    }
cleanup

Cleanup is defined in the IBolt interface and is used to release the resources occupied by the bolt.
Storm calls this method before terminating a bolt.
Because there are no resources to release here, it is enough to simply print a sentence.

@Override
    public void cleanup() {
        System.out.println("资源释放");
    }

Topology

Here we use the main method to submit the topology.
However, before submitting the topology, you need to make the corresponding settings.
I won't go into details here, the comments of the code are already very detailed.
Code example:

    import org.apache.storm.Config;
    import org.apache.storm.LocalCluster;
    import org.apache.storm.StormSubmitter;
    import org.apache.storm.topology.TopologyBuilder;

    /**
     * 
    * Title: App
    * Description:
    * storm测试 
    * Version:1.0.0  
    * @author pancm
    * @date 2018年3月6日
     */
    public class App {

        private static final String str1="test1"; 
        private static final String str2="test2"; 

        public static void main(String[] args)  {
            // TODO Auto-generated method stub
            //定义一个拓扑
            TopologyBuilder builder=new TopologyBuilder();
            //设置一个Executeor(线程),默认一个
            builder.setSpout(str1, new TestSpout());
            //设置一个Executeor(线程),和一个task
            builder.setBolt(str2, new TestBolt(),1).setNumTasks(1).shuffleGrouping(str1);
            Config conf = new Config();
            conf.put("test", "test");
            try{
              //运行拓扑
           if(args !=null&&args.length>0){ //有参数时,表示向集群提交作业,并把第一个参数当做topology名称
             System.out.println("远程模式");
                 StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
          } else{//没有参数时,本地提交
        //启动本地模式
         System.out.println("本地模式");
        LocalCluster cluster = new LocalCluster();
        cluster.submitTopology("111" ,conf,  builder.createTopology() );
        Thread.sleep(10000);
    //  关闭本地集群
        cluster.shutdown();
          }
            }catch (Exception e){
                e.printStackTrace();
            }   
        }
    }

Running this method, the output is as follows:

本地模式
定义格式...
open:test
第1次开始发送数据...
第2次开始发送数据...
prepare:test
Bolt第1接受的消息:这是个测试消息!
Bolt第2接受的消息:这是个测试消息!
资源释放
关闭...

At this point, have you basically understood the operation of Storm?
This demo achieves the first of the above three modes, a Spout transmits data, and a Bolt processes data.

So if we want to achieve the second mode, how do we do it?
If we want to count the frequency of words in a piece of text, we only need to perform a few steps.
1. First, change the message message in Spout to an array, and send the message to TestBolt in turn.
2. Then TestBolt splits the acquired data and sends the split data to TestBolt2.
3. TestBolt2 counts the data and prints it when the program is closed.
4. After the Topology is successfully configured and started, wait for about 20 seconds, close the program, and get the output.

The code example is as follows:

Spout
is used to send messages.

    import java.util.Map;
    import org.apache.storm.spout.SpoutOutputCollector;
    import org.apache.storm.task.TopologyContext;
    import org.apache.storm.topology.OutputFieldsDeclarer;
    import org.apache.storm.topology.base.BaseRichSpout;
    import org.apache.storm.tuple.Fields;
    import org.apache.storm.tuple.Values;

    /**
     * 
    * Title: TestSpout
    * Description:
    * 发送信息
    * Version:1.0.0  
    * @author pancm
    * @date 2018年3月6日
     */
    public class TestSpout extends BaseRichSpout{

        private static final long serialVersionUID = 225243592780939490L;

        private SpoutOutputCollector collector;
        private static final String field="word";
        private int count=1;
        private String[] message =  {
    "My nickname is xuwujing",
    "My blog address is http://www.panchengming.com/",
    "My interest is playing games"
    };

        /**
     * open()方法中是在ISpout接口中定义,在Spout组件初始化时被调用。
     * 有三个参数:
     * 1.Storm配置的Map;
     * 2.topology中组件的信息;
     * 3.发射tuple的方法;
     */
        @Override
        public void open(Map map, TopologyContext arg1, SpoutOutputCollector collector) {
            System.out.println("open:"+map.get("test"));
            this.collector = collector;
        }

    /**
     * nextTuple()方法是Spout实现的核心。
     * 也就是主要执行方法,用于输出信息,通过collector.emit方法发射。
     */
        @Override
        public void nextTuple() {

            if(count<=message.length){
                System.out.println("第"+count +"次开始发送数据...");
                this.collector.emit(new Values(message[count-1]));
            }
            count++;
        }


        /**
     * declareOutputFields是在IComponent接口中定义,用于声明数据格式。
     * 即输出的一个Tuple中,包含几个字段。
     */
        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            System.out.println("定义格式...");
            declarer.declare(new Fields(field));
        }

        /**
         * 当一个Tuple处理成功时,会调用这个方法
         */
        @Override
        public void ack(Object obj) {
            System.out.println("ack:"+obj);
        }

        /**
         * 当Topology停止时,会调用这个方法
         */
        @Override
        public void close() {
            System.out.println("关闭...");
        }

        /**
         * 当一个Tuple处理失败时,会调用这个方法
         */
        @Override
        public void fail(Object obj) {
            System.out.println("失败:"+obj);
        }

    }

TestBolt

Used to split words.

    import java.util.Map;

    import org.apache.storm.task.OutputCollector;
    import org.apache.storm.task.TopologyContext;
    import org.apache.storm.topology.OutputFieldsDeclarer;
    import org.apache.storm.topology.base.BaseRichBolt;
    import org.apache.storm.tuple.Fields;
    import org.apache.storm.tuple.Tuple;
    import org.apache.storm.tuple.Values;


    /**
     * 
    * Title: TestBolt
    * Description: 
    * 对单词进行分割
    * Version:1.0.0  
    * @author pancm
    * @date 2018年3月16日
     */
    public class TestBolt extends BaseRichBolt{

        /**
         * 
         */
        private static final long serialVersionUID = 4743224635827696343L;

        private OutputCollector collector;

        /**
    * 在Bolt启动前执行,提供Bolt启动环境配置的入口
    * 一般对于不可序列化的对象进行实例化。
    * 注:如果是可以序列化的对象,那么最好是使用构造函数。
    */
        @Override
        public void prepare(Map map, TopologyContext arg1, OutputCollector collector) {
            System.out.println("prepare:"+map.get("test"));
            this.collector=collector;
        }

        /**
         * execute()方法是Bolt实现的核心。
         * 也就是执行方法,每次Bolt从流接收一个订阅的tuple,都会调用这个方法。
         */
        @Override
        public void execute(Tuple tuple) {
            String msg=tuple.getStringByField("word");
        System.out.println("开始分割单词:"+msg);
    String[] words = msg.toLowerCase().split(" ");
    for (String word : words) {
    this.collector.emit(new Values(word));//向下一个bolt发射数据
    } 

        }

        /**
         * 声明数据格式
         */
        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            declarer.declare(new Fields("count"));
        }

        /**
     * cleanup是IBolt接口中定义,用于释放bolt占用的资源。
     * Storm在终止一个bolt之前会调用这个方法。
         */
        @Override
        public void cleanup() {
            System.out.println("TestBolt的资源释放");
        }
    }

Test2Bolt
is used to count the number of word occurrences.


    import java.util.HashMap;
    import java.util.Map;

    import org.apache.storm.task.OutputCollector;
    import org.apache.storm.task.TopologyContext;
    import org.apache.storm.topology.OutputFieldsDeclarer;
    import org.apache.storm.topology.base.BaseRichBolt;
    import org.apache.storm.tuple.Tuple;

    /**
     * 
    * Title: Test2Bolt
    * Description:
    * 统计单词出现的次数 
    * Version:1.0.0  
    * @author pancm
    * @date 2018年3月16日
     */
    public class Test2Bolt extends BaseRichBolt{

        /**
         * 
         */
        private static final long serialVersionUID = 4743224635827696343L;


        /**
         * 保存单词和对应的计数
         */
        private HashMap<String, Integer> counts = null;

        private long count=1;
        /**
    * 在Bolt启动前执行,提供Bolt启动环境配置的入口
    * 一般对于不可序列化的对象进行实例化。
    * 注:如果是可以序列化的对象,那么最好是使用构造函数。
    */
        @Override
        public void prepare(Map map, TopologyContext arg1, OutputCollector collector) {
            System.out.println("prepare:"+map.get("test"));
            this.counts=new HashMap<String, Integer>();
        }

        /**
         * execute()方法是Bolt实现的核心。
         * 也就是执行方法,每次Bolt从流接收一个订阅的tuple,都会调用这个方法。
         * 
         */
        @Override
        public void execute(Tuple tuple) {
            String msg=tuple.getStringByField("count");
            System.out.println("第"+count+"次统计单词出现的次数");
            /**
             * 如果不包含该单词,说明在该map是第一次出现
             * 否则进行加1
             */
            if (!counts.containsKey(msg)) {
                counts.put(msg, 1);
            } else {
                counts.put(msg, counts.get(msg)+1);
            }
            count++;
        }


        /**
     * cleanup是IBolt接口中定义,用于释放bolt占用的资源。
     * Storm在终止一个bolt之前会调用这个方法。
         */
        @Override
        public void cleanup() {
            System.out.println("===========开始显示单词数量============");
            for (Map.Entry<String, Integer> entry : counts.entrySet()) {
                System.out.println(entry.getKey() + ": " + entry.getValue());
            }
            System.out.println("===========结束============");
           System.out.println("Test2Bolt的资源释放");
        }

        /**
         * 声明数据格式
         */
        @Override
        public void declareOutputFields(OutputFieldsDeclarer arg0) {

        }
    }

Topology

Main program entry.

  import org.apache.storm.Config;
    import org.apache.storm.LocalCluster;
    import org.apache.storm.StormSubmitter;
    import org.apache.storm.topology.TopologyBuilder;
    import org.apache.storm.tuple.Fields;

    /**
     * 
    * Title: App
    * Description:
    * storm测试 
    * Version:1.0.0  
    * @author pancm
    * @date 2018年3月6日
     */
    public class App {

        private static final String test_spout="test_spout"; 
        private static final String test_bolt="test_bolt"; 
        private static final String test2_bolt="test2_bolt"; 

        public static void main(String[] args)  {
            //定义一个拓扑
            TopologyBuilder builder=new TopologyBuilder();
            //设置一个Executeor(线程),默认一个
            builder.setSpout(test_spout, new TestSpout(),1);
            //shuffleGrouping:表示是随机分组
            //设置一个Executeor(线程),和一个task
            builder.setBolt(test_bolt, new TestBolt(),1).setNumTasks(1).shuffleGrouping(test_spout);
            //fieldsGrouping:表示是按字段分组
            //设置一个Executeor(线程),和一个task
            builder.setBolt(test2_bolt, new Test2Bolt(),1).setNumTasks(1).fieldsGrouping(test_bolt, new Fields("count"));
            Config conf = new Config();
            conf.put("test", "test");
            try{
              //运行拓扑
           if(args !=null&&args.length>0){ //有参数时,表示向集群提交作业,并把第一个参数当做topology名称
             System.out.println("运行远程模式");
                 StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
          } else{//没有参数时,本地提交
        //启动本地模式
            System.out.println("运行本地模式");
        LocalCluster cluster = new LocalCluster();
        cluster.submitTopology("Word-counts" ,conf,  builder.createTopology() );
        Thread.sleep(20000);
    //  //关闭本地集群
        cluster.shutdown();
          }
            }catch (Exception e){
                e.printStackTrace();
            }
        }
    }

Output result:

运行本地模式
定义格式...
open:test
第1次开始发送数据...
第2次开始发送数据...
第3次开始发送数据...
prepare:test
prepare:test
开始分割单词:My nickname is xuwujing
开始分割单词:My blog address is http://www.panchengming.com/
开始分割单词:My interest is playing games
第1次统计单词出现的次数
第2次统计单词出现的次数
第3次统计单词出现的次数
第4次统计单词出现的次数
第5次统计单词出现的次数
第6次统计单词出现的次数
第7次统计单词出现的次数
第8次统计单词出现的次数
第9次统计单词出现的次数
第10次统计单词出现的次数
第11次统计单词出现的次数
第12次统计单词出现的次数
第13次统计单词出现的次数
第14次统计单词出现的次数
===========开始显示单词数量============
address: 1
interest: 1
nickname: 1
games: 1
is: 3
xuwujing: 1
playing: 1
my: 3
blog: 1
http://www.panchengming.com/: 1
===========结束============
Test2Bolt的资源释放
TestBolt的资源释放
关闭...

The above is running in local mode. If you want to use it in the Storm cluster, you only need to package the program as a jar, then upload the program to the Storm cluster, and
enter:

storm jar xxx.jar xxx xxx
Description: The first xxx is the package name packaged by the storm program, the second xxx is the path to run the main program, and the third xxx represents the parameters input by the main program, which can be optional.

If it is packaged with maven, you need to add it to pom.xml

<plugin>
          <artifactId>maven-assembly-plugin</artifactId>
          <configuration>
            <descriptorRefs>
              <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
              <manifest>
                <mainClass>com.pancm.storm.App</mainClass>
              </manifest>
            </archive>
          </configuration>
      </plugin>

After running the program successfully, you can view the status of the program on the UI interface of the Storm cluster.

This is the end of this article, thank you for reading!

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325634137&siteId=291194637