大数据集群搭建和使用之七——storm配置和使用

这个系列指南使用真实集群搭建环境,不是伪集群,用了三台腾讯云服务器

或者访问我的个人博客站点,链接

Storm

配置

修改/conf/storm.yaml(暂时先配置这么多)storm依赖zookeeper集群,特别注意storm.yaml配置的时候注意空格,不同版本的stotm配置时,nimbus.seeds可能换成nimbus.host。另外需要注意旧版本为backtype,而新版本为org.apache.
更多配置请参考这里

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
 storm.zookeeper.servers:
     - "master"
     - "slave1"
     - "slave2"
 storm.local.dir: "/root/storm"
 nimbus.seeds: ["master"]
 supervisor.slots.ports:
    - 6700
    - 6701
    - 6702
    - 6703
 java.library.path: "/opt/java/jdk1.8"
 storm.messaging.transport: "backtype.storm.messaging.netty.Context"
########### These MUST be filled in for a storm configuration
# storm.zookeeper.servers:
#     - "server1"
#     - "server2"
#
# nimbus.seeds: ["host1", "host2", "host3"]
#
#
# ##### These may optionally be filled in:
#
## List of custom serializations
# topology.kryo.register:
#     - org.mycompany.MyType
#     - org.mycompany.MyType2: org.mycompany.MyType2Serializer
#
## List of custom kryo decorators
# topology.kryo.decorators:
#     - org.mycompany.MyDecorator
#
## Locations of the drpc servers
# drpc.servers:
#     - "server1"
#     - "server2"

## Metrics Consumers
## max.retain.metric.tuples
## - task queue will be unbounded when max.retain.metric.tuples is equal or less than 0.
## whitelist / blacklist
## - when none of configuration for metric filter are specified, it'll be treated as 'pass all'.
## - you need to specify either whitelist or blacklist, or none of them. You can't specify both of them.
## - you can specify multiple whitelist / blacklist with regular expression
## expandMapType: expand metric with map type as value to multiple metrics
## - set to true when you would like to apply filter to expanded metrics
## - default value is false which is backward compatible value
## metricNameSeparator: separator between origin metric name and key of entry from map
## - only effective when expandMapType is set to true
# topology.metrics.consumer.register:
#   - class: "org.apache.storm.metric.LoggingMetricsConsumer"
#     max.retain.metric.tuples: 100
#     parallelism.hint: 1
#   - class: "org.mycompany.MyMetricsConsumer"
#     max.retain.metric.tuples: 100
#     whitelist:
#       - "execute.*"
#       - "^__complete-latency$"
#     parallelism.hint: 1
#     argument:
#       - endpoint: "metrics-collector.mycompany.org"
#     expandMapType: true
#     metricNameSeparator: "."

## Cluster Metrics Consumers
# storm.cluster.metrics.consumer.register:
#   - class: "org.apache.storm.metric.LoggingClusterMetricsConsumer"
#   - class: "org.mycompany.MyMetricsConsumer"
#     argument:
#       - endpoint: "metrics-collector.mycompany.org"
#
# storm.cluster.metrics.consumer.publish.interval.secs: 60
  • 记得建立/root/storm文件夹
  • storm.zookeeper.servers是指定zookeeper的服务地址。
    因为storm的存储信息在zookeeper上,所以要配置zookeeper的服务地址。如果zookeeper是单机就只用指定一个!
  • storm.local.dir 表示存储目录。
    Nimbus和Supervisor守护进程需要在本地磁盘上存储一个目录来存储少量的状态(比如jar,confs等等)。可以在每台机器创建,并给于权限。
  • nimbus.seeds 表示候选的主机。
    worker需要知道那一台机器是主机候选(zookeeper集群是选举制),从而可以下载 topology jars 和confs。
  • supervisor.slots.ports 表示worker 端口。
    对于每一个supervisor机器,我们可以通过这项来配置运行多少worker在这台机器上。每一个worker使用一个单独的port来接受消息,这个端口同样定义了那些端口是开放使用的。如果你在这里定义了5个端口,就意味着这个supervisor节点上最多可以运行5个worker。如果定义3个端口,则意味着最多可以运行3个worker。在默认情况下(即配置在defaults.yaml中),会有有四个workers运行在 6700, 6701, 6702, and 6703端口。
    supervisor并不会在启动时就立即启动这四个worker。而是接受到分配的任务时,才会启动,具体启动几个worker也要根据我们Topology在这个supervisor需要几个worker来确定。如果指定Topology只会由一个worker执行,那么supervisor就启动一个worker,并不会启动所有。
    使用scp将文件复制到各个节点上。

解决storm进程自动结束问题

按照官网给出的解决方案Launches the nimbus daemon. This command should be run under supervision with a tool like daemontools or monit. See Setting up a Storm cluster for more information.,我们可以使用daemontools来监控并拉起相关服务。

  • 前期准备GCC编译器,yum install gcc即可。
  • 到/opt目录,wget http://cr.yp.to/daemontools/daemontools-0.76.tar.gz,解压,进入./admin/daemontools-0.76/src,vim error.h,把extern int errno;替换成#include <errno.h>,执行package/install命令编译。没有错的话,会出现如下log
Copying commands into ./command...  
Creating symlink daemontools -> daemontools-0.76...  
Making command links in /command...  
Making compatibility links in /usr/local/bin...  
Creating /service...  
Adding svscanboot to inittab...  
init should start svscan now. 
  • 配置启动命令
cd /service
mkdir nimbus supervisor ui
cd nimbus
vim run
    #!/bin/bash
    exec 2>&1
    exec /opt/storm/storm1.1/bin/storm nimbus
chmod 755 run
cd ../supervisor
vim run
    #!/bin/bash
    exec 2>&1
    exec /opt/storm/storm1.1/bin/storm supervisor
chmod 755 run
cd ../ui
vim run
    #!/bin/bash
    exec 2>&1
    exec /opt/storm/storm1.1/bin/storm  ui
chmod 755 run
  • 启动进程
    在各个节点上按照自己的规划启动
nohup supervise /service/nimbus &
nohup supervise /service/supervisor &
nohup supervise /service/ui &
  • 如何杀掉被保护的进程
ps -ef|grep supervise

找到相关的进程,kill -9。然后再jps,再kill -9具体的进程。

启动(不建议这样启动)

在启动之前先确保zookeeper已经正常运行,可以使用jps命令进行查看
master上storm nimbus > /dev/null 2>&1
slave上storm supervisor >/dev/null 2>&1 &
启动完成之后在storm的bin目录下运行 ./storm ui,可以用storm的web ui来监测storm运行情况。

使用Storm

(实时数据流处理,不缓存)
官网介绍:
Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

与其他架构组合示意图:

storm一般结合数据库或者消息队列来应用。消息队列是数据源,数据库是结果存放目的地。

基本概念

Topologies 拓扑,也称为一个任务
Spouts 拓扑的消息源
Bolts 拓扑的处理逻辑单元
Tuple消息元组
Streams
Stream groupings 流的分组策略
Tasks 任务处理单元
Executor 工作线程
Workers 工作进程
Configuration topology配置

  • 一个topo描述了整个storm集群的部件组成和连接方式,如下图,spout组件是整个集群的输入(后面会看到kafka的consumer可以作为一个spout,从而storm和kafka可以结合在一起),输入的信息流会经过定义好的bolt组件,bolt之间并发,处理完消息之后,会传送到下一级bolt或者集群输出。一级bolt与另一级bolt之间的连接方式可以自定义,有按字段定义连接,也有按随机方式划分等等。

  • 集群物理结构(这里需要三台),nimbus节点作为主节点起着协调作用,它不能宕机,或者说宕机后要能够立即拉起并恢复工作,supervisor节点受nimbus节点管理,负责具体的计算任务。在配置文件中可以配置一个从节点最多可以有多少个worker,每一个worker里会根据代码的指定或者集群自己的自适应来生成executor来执行具体的task,这些worker,executor都是分布式的。

storm java demo


任务场景,数据是一些手机品牌的名字,小写流式输入,先转换成大写,再添加日期,最后输出。

  • RandomWordSpout.java作为数据源的输入,注意这里使用的低版本storm,高版本需要重新进行依赖包的import

    package cn.colony.cloud.storm;
    
    import java.util.Map;
    import java.util.Random;
    
    import backtype.storm.spout.SpoutOutputCollector;
    import backtype.storm.task.TopologyContext;
    import backtype.storm.topology.OutputFieldsDeclarer;
    import backtype.storm.topology.base.BaseRichSpout;
    import backtype.storm.tuple.Fields;
    import backtype.storm.tuple.Values;
    import backtype.storm.utils.Utils;
    
    
    public class RandomWordSpout extends BaseRichSpout{
    
        private static final long serialVersionUID = -927405933479827173L;
        private SpoutOutputCollector collector;
        //模拟一些数据
        String[] words = {"iphone","xiaomi","mate","sony","sumsung","moto","meizu"};
    
        //告诉代码如何不断获取信息流
        @Override
        public void nextTuple() {
            // TODO Auo-generated method stub
            Random random = new Random();
            int index = random.nextInt(words.length);
    
            String goodName = words[index];
    
            //将商品名称封装成tuple然后发送给下一个组件,emit方法在QT GUI的信号-槽机制的时候也用到
            //注意,这里可声明多个字段
            collector.emit(new Values(goodName));
            //collector还可以指定流路径
            //collector.emit(streamId, tuple)
    
            Utils.sleep(500);
        }
    
        //初始化方法,在spout组件实例化时调用一次
        @Override
        public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
            // TODO Auto-generated method stub
            this.collector = collector;
        }
    
        //声明输出到tuple的数据类型
        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            // TODO Auto-generated method stub
            //注意,这里可以声明多个字段,但是要和collector.emit方法对应
            //declarer.declare(new Fields("orignname","id","price"));
            declarer.declare(new Fields("orignname"));
            //同样,declare可以指定流
            //declarer.declareStream(String arg0, Fields arg1);
        }
    }
    
  • UpperBolt.java将得到的数据流变成大写,使用高版本同样需要重新导入包

    package cn.colony.cloud.storm;
    
    import backtype.storm.topology.BasicOutputCollector;
    import backtype.storm.topology.OutputFieldsDeclarer;
    import backtype.storm.topology.base.BaseBasicBolt;
    import backtype.storm.tuple.Fields;
    import backtype.storm.tuple.Tuple;
    import backtype.storm.tuple.Values;
    
    public class UpperBolt extends BaseBasicBolt{
    
        private static final long serialVersionUID = 7761803499934346929L;
    
        //业务处理逻辑
        @Override
        public void execute(Tuple tuple, BasicOutputCollector collector) {
            // TODO Auto-generated method stub
            //先获取到上一个组件传递过来的数据,数据在tuple里面,按照索引值来取
            String godName = tuple.getString(0);
    
            //将商品名转换成大写
            String godName_upper = godName.toUpperCase();
    
            //将转换完成的商品名发送出去
            //其实应该emit一个tuple,但是Values是Tuple的一个子类,所以可以直接传Values
            collector.emit(new Values(godName_upper));
        }
    
        @Override
        public void declareOutputFields(OutputFieldsDeclarer declarer) {
            // TODO Auto-generated method stub
            declarer.declare(new Fields("uppername"));
        }
    }
  • SuffixBolt.java将上一步的结果加上日期。

    package cn.colony.cloud.storm;
    
    import java.io.FileWriter;
    import java.io.IOException;
    import java.util.Map;
    import java.util.UUID;
    
    import backtype.storm.task.TopologyContext;
    import backtype.storm.topology.BasicOutputCollector;
    import backtype.storm.topology.OutputFieldsDeclarer;
    import backtype.storm.topology.base.BaseBasicBolt;
    import backtype.storm.tuple.Tuple;
    
    
    public class SuffixBolt extends BaseBasicBolt{
    
        private static final long serialVersionUID = 2567257599496393226L;
        FileWriter fileWriter = null;
    
        @Override
        public void prepare(Map stormConf, TopologyContext context) {
            // TODO Auto-generated method stub
            try {
                fileWriter = new FileWriter("D:\\"+UUID.randomUUID());
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
        @Override
        public void execute(Tuple tuple, BasicOutputCollector collector) {
            // TODO Auto-generated method stub
            //先拿到上一个组件发送过来的商品名称
            String upper_name = tuple.getString(0);
            String suffix_name = upper_name + "_itisok";
    
            try {
                fileWriter.write(suffix_name);
                fileWriter.write("\n");
                fileWriter.flush();
    
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
        }
    
        @Override
        public void declareOutputFields(OutputFieldsDeclarer arg0) {
            // TODO Auto-generated method stub
        }
    }
  • 最后使用TopoMain.java来构建整个topo

    package cn.colony.cloud.storm;
    
    import backtype.storm.Config;
    import backtype.storm.LocalCluster;
    import backtype.storm.StormSubmitter;
    import backtype.storm.generated.StormTopology;
    import backtype.storm.topology.TopologyBuilder;
    
    /**
     * 组织各个处理组件形成一个完整的处理流程,就是所谓的topology(类似于mapreduce程序中的job)
     * 并且将该topology提交给storm集群去运行,topology提交到集群后就将永无休止地运行,除非人为或者异常退出
     *
     */
    public class TopoMain {
    
    
        public static void main(String[] args) throws Exception {
    
            TopologyBuilder builder = new TopologyBuilder();
    
            //将我们的spout组件设置到topology中去 
            //parallelism_hint :4  表示用4个excutor来执行这个组件
            //setNumTasks(8) 设置的是该组件执行时的并发task数量,也就意味着1个excutor会运行2个task
            builder.setSpout("randomspout", new RandomWordSpout(), 4).setNumTasks(8);
    
            //将大写转换bolt组件设置到topology,并且指定它接收randomspout组件的消息
            //.shuffleGrouping("randomspout")包含两层含义:
            //1、upperbolt组件接收的tuple消息一定来自于randomspout组件
            //2、randomspout组件和upperbolt组件的大量并发task实例之间收发消息时采用的分组策略是随机分组shuffleGrouping
            builder.setBolt("upperbolt", new UpperBolt(), 4).shuffleGrouping("randomspout");
    
            //将添加后缀的bolt组件设置到topology,并且指定它接收upperbolt组件的消息
            builder.setBolt("suffixbolt", new SuffixBolt(), 4).shuffleGrouping("upperbolt");
    
            //用builder来创建一个topology
            StormTopology demotop = builder.createTopology();
    
    
            //配置一些topology在集群中运行时的参数
            Config conf = new Config();
            //这里设置的是整个demotop所占用的槽位数,也就是worker的数量
            conf.setNumWorkers(4);
            conf.setDebug(true);
            //ack机制可以防止某一个任务执行失败而导致整个事务执行的失败,类似网络通信里的ack机制
            conf.setNumAckers(0);
    
            //将这个topology提交给storm集群运行
            StormSubmitter.submitTopology("demotopo", conf, demotop);
    //      
    //      LocalCluster cluster = new LocalCluster();
    //        cluster.submitTopology("demotopo", conf, builder.createTopology());
        }
    }


除了上面提到的两种,还有很多

常用shell命令

  • storm ui
  • storm jar jar-name main-class
  • storm list
  • storm kill topology-name

学会查看日志

storm的日志在/${STORM_HOME}/logs里,master和slaved的日志是分开的。不管storm集群发生了什么,一定要先看日志。(血的教训)

进阶

分布式共享锁
事务topology实现机制
框架整合
数据入:flume/activeMQ/kafka(分布式消息队列系统)
数据出:redis/hbase/mysql cluster

猜你喜欢

转载自blog.csdn.net/moquancsdn/article/details/81700418
今日推荐