这个系列指南使用真实集群搭建环境,不是伪集群,用了三台腾讯云服务器
或者访问我的个人博客站点,链接
Storm
配置
修改/conf/storm.yaml(暂时先配置这么多)storm依赖zookeeper集群
,特别注意storm.yaml配置的时候注意空格,不同版本的stotm配置时,nimbus.seeds可能换成nimbus.host
。另外需要注意旧版本为backtype
,而新版本为org.apache
.
更多配置请参考这里
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
storm.zookeeper.servers:
- "master"
- "slave1"
- "slave2"
storm.local.dir: "/root/storm"
nimbus.seeds: ["master"]
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
java.library.path: "/opt/java/jdk1.8"
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
########### These MUST be filled in for a storm configuration
# storm.zookeeper.servers:
# - "server1"
# - "server2"
#
# nimbus.seeds: ["host1", "host2", "host3"]
#
#
# ##### These may optionally be filled in:
#
## List of custom serializations
# topology.kryo.register:
# - org.mycompany.MyType
# - org.mycompany.MyType2: org.mycompany.MyType2Serializer
#
## List of custom kryo decorators
# topology.kryo.decorators:
# - org.mycompany.MyDecorator
#
## Locations of the drpc servers
# drpc.servers:
# - "server1"
# - "server2"
## Metrics Consumers
## max.retain.metric.tuples
## - task queue will be unbounded when max.retain.metric.tuples is equal or less than 0.
## whitelist / blacklist
## - when none of configuration for metric filter are specified, it'll be treated as 'pass all'.
## - you need to specify either whitelist or blacklist, or none of them. You can't specify both of them.
## - you can specify multiple whitelist / blacklist with regular expression
## expandMapType: expand metric with map type as value to multiple metrics
## - set to true when you would like to apply filter to expanded metrics
## - default value is false which is backward compatible value
## metricNameSeparator: separator between origin metric name and key of entry from map
## - only effective when expandMapType is set to true
# topology.metrics.consumer.register:
# - class: "org.apache.storm.metric.LoggingMetricsConsumer"
# max.retain.metric.tuples: 100
# parallelism.hint: 1
# - class: "org.mycompany.MyMetricsConsumer"
# max.retain.metric.tuples: 100
# whitelist:
# - "execute.*"
# - "^__complete-latency$"
# parallelism.hint: 1
# argument:
# - endpoint: "metrics-collector.mycompany.org"
# expandMapType: true
# metricNameSeparator: "."
## Cluster Metrics Consumers
# storm.cluster.metrics.consumer.register:
# - class: "org.apache.storm.metric.LoggingClusterMetricsConsumer"
# - class: "org.mycompany.MyMetricsConsumer"
# argument:
# - endpoint: "metrics-collector.mycompany.org"
#
# storm.cluster.metrics.consumer.publish.interval.secs: 60
- 记得建立/root/storm文件夹
- storm.zookeeper.servers是指定zookeeper的服务地址。
因为storm的存储信息在zookeeper上,所以要配置zookeeper的服务地址。如果zookeeper是单机就只用指定一个! - storm.local.dir 表示存储目录。
Nimbus和Supervisor守护进程需要在本地磁盘上存储一个目录来存储少量的状态(比如jar,confs等等)。可以在每台机器创建,并给于权限。 - nimbus.seeds 表示候选的主机。
worker需要知道那一台机器是主机候选(zookeeper集群是选举制),从而可以下载 topology jars 和confs。 - supervisor.slots.ports 表示worker 端口。
对于每一个supervisor机器,我们可以通过这项来配置运行多少worker在这台机器上。每一个worker使用一个单独的port来接受消息,这个端口同样定义了那些端口是开放使用的。如果你在这里定义了5个端口,就意味着这个supervisor节点上最多可以运行5个worker。如果定义3个端口,则意味着最多可以运行3个worker。在默认情况下(即配置在defaults.yaml中),会有有四个workers运行在 6700, 6701, 6702, and 6703端口。
supervisor并不会在启动时就立即启动这四个worker。而是接受到分配的任务时,才会启动,具体启动几个worker也要根据我们Topology在这个supervisor需要几个worker来确定。如果指定Topology只会由一个worker执行,那么supervisor就启动一个worker,并不会启动所有。
使用scp将文件复制到各个节点上。
解决storm进程自动结束问题
按照官网给出的解决方案,Launches the nimbus daemon. This command should be run under supervision with a tool like daemontools or monit. See Setting up a Storm cluster for more information.
,我们可以使用daemontools来监控并拉起相关服务。
- 前期准备GCC编译器,yum install gcc即可。
- 到/opt目录,
wget http://cr.yp.to/daemontools/daemontools-0.76.tar.gz
,解压,进入./admin/daemontools-0.76/src,vim error.h,把extern int errno;
替换成#include <errno.h>
,执行package/install命令编译。没有错的话,会出现如下log
Copying commands into ./command...
Creating symlink daemontools -> daemontools-0.76...
Making command links in /command...
Making compatibility links in /usr/local/bin...
Creating /service...
Adding svscanboot to inittab...
init should start svscan now.
- 配置启动命令
cd /service
mkdir nimbus supervisor ui
cd nimbus
vim run
#!/bin/bash
exec 2>&1
exec /opt/storm/storm1.1/bin/storm nimbus
chmod 755 run
cd ../supervisor
vim run
#!/bin/bash
exec 2>&1
exec /opt/storm/storm1.1/bin/storm supervisor
chmod 755 run
cd ../ui
vim run
#!/bin/bash
exec 2>&1
exec /opt/storm/storm1.1/bin/storm ui
chmod 755 run
- 启动进程
在各个节点上按照自己的规划启动
nohup supervise /service/nimbus &
nohup supervise /service/supervisor &
nohup supervise /service/ui &
- 如何杀掉被保护的进程
ps -ef|grep supervise
找到相关的进程,kill -9。然后再jps,再kill -9具体的进程。
启动(不建议这样启动)
在启动之前先确保zookeeper已经正常运行,可以使用jps命令进行查看
master上storm nimbus > /dev/null 2>&1
slave上storm supervisor >/dev/null 2>&1 &
启动完成之后在storm的bin目录下运行 ./storm ui,可以用storm的web ui来监测storm运行情况。
使用Storm
(实时数据流处理,不缓存)
官网介绍:
Apache Storm is a free and open source distributed realtime computation system
. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing
.
与其他架构组合示意图:
storm一般结合数据库或者消息队列来应用。消息队列是数据源,数据库是结果存放目的地。
基本概念
Topologies 拓扑,也称为一个任务
Spouts 拓扑的消息源
Bolts 拓扑的处理逻辑单元
Tuple消息元组
Streams 流
Stream groupings 流的分组策略
Tasks 任务处理单元
Executor 工作线程
Workers 工作进程
Configuration topology配置
一个topo描述了整个storm集群的部件组成和连接方式,如下图,spout组件是整个集群的输入(后面会看到kafka的consumer可以作为一个spout,从而
storm和kafka可以结合在一起
),输入的信息流会经过定义好的bolt组件,bolt之间并发
,处理完消息之后,会传送到下一级bolt或者集群输出。一级bolt与另一级bolt之间的连接方式可以自定义
,有按字段定义连接,也有按随机方式划分等等。
集群物理结构(这里需要三台),nimbus节点作为主节点起着协调作用,
它不能宕机,或者说宕机后要能够立即拉起并恢复工作
,supervisor节点受nimbus节点管理,负责具体的计算任务。在配置文件中可以配置一个从节点最多可以有多少个worker
,每一个worker里会根据代码的指定或者集群自己的自适应来生成executor
来执行具体的task
,这些worker,executor都是分布式
的。
storm java demo
任务场景,数据是一些手机品牌的名字,小写流式输入,先转换成大写,再添加日期,最后输出。
RandomWordSpout.java作为数据源的输入,注意这里使用的低版本storm,
高版本需要重新进行依赖包的import
。package cn.colony.cloud.storm; import java.util.Map; import java.util.Random; import backtype.storm.spout.SpoutOutputCollector; import backtype.storm.task.TopologyContext; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseRichSpout; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Values; import backtype.storm.utils.Utils; public class RandomWordSpout extends BaseRichSpout{ private static final long serialVersionUID = -927405933479827173L; private SpoutOutputCollector collector; //模拟一些数据 String[] words = {"iphone","xiaomi","mate","sony","sumsung","moto","meizu"}; //告诉代码如何不断获取信息流 @Override public void nextTuple() { // TODO Auo-generated method stub Random random = new Random(); int index = random.nextInt(words.length); String goodName = words[index]; //将商品名称封装成tuple然后发送给下一个组件,emit方法在QT GUI的信号-槽机制的时候也用到 //注意,这里可声明多个字段 collector.emit(new Values(goodName)); //collector还可以指定流路径 //collector.emit(streamId, tuple) Utils.sleep(500); } //初始化方法,在spout组件实例化时调用一次 @Override public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) { // TODO Auto-generated method stub this.collector = collector; } //声明输出到tuple的数据类型 @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { // TODO Auto-generated method stub //注意,这里可以声明多个字段,但是要和collector.emit方法对应 //declarer.declare(new Fields("orignname","id","price")); declarer.declare(new Fields("orignname")); //同样,declare可以指定流 //declarer.declareStream(String arg0, Fields arg1); } }
UpperBolt.java将得到的数据流变成大写,
使用高版本同样需要重新导入包
。package cn.colony.cloud.storm; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Fields; import backtype.storm.tuple.Tuple; import backtype.storm.tuple.Values; public class UpperBolt extends BaseBasicBolt{ private static final long serialVersionUID = 7761803499934346929L; //业务处理逻辑 @Override public void execute(Tuple tuple, BasicOutputCollector collector) { // TODO Auto-generated method stub //先获取到上一个组件传递过来的数据,数据在tuple里面,按照索引值来取 String godName = tuple.getString(0); //将商品名转换成大写 String godName_upper = godName.toUpperCase(); //将转换完成的商品名发送出去 //其实应该emit一个tuple,但是Values是Tuple的一个子类,所以可以直接传Values collector.emit(new Values(godName_upper)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { // TODO Auto-generated method stub declarer.declare(new Fields("uppername")); } }
SuffixBolt.java将上一步的结果加上日期。
package cn.colony.cloud.storm; import java.io.FileWriter; import java.io.IOException; import java.util.Map; import java.util.UUID; import backtype.storm.task.TopologyContext; import backtype.storm.topology.BasicOutputCollector; import backtype.storm.topology.OutputFieldsDeclarer; import backtype.storm.topology.base.BaseBasicBolt; import backtype.storm.tuple.Tuple; public class SuffixBolt extends BaseBasicBolt{ private static final long serialVersionUID = 2567257599496393226L; FileWriter fileWriter = null; @Override public void prepare(Map stormConf, TopologyContext context) { // TODO Auto-generated method stub try { fileWriter = new FileWriter("D:\\"+UUID.randomUUID()); } catch (IOException e) { throw new RuntimeException(e); } } @Override public void execute(Tuple tuple, BasicOutputCollector collector) { // TODO Auto-generated method stub //先拿到上一个组件发送过来的商品名称 String upper_name = tuple.getString(0); String suffix_name = upper_name + "_itisok"; try { fileWriter.write(suffix_name); fileWriter.write("\n"); fileWriter.flush(); } catch (IOException e) { throw new RuntimeException(e); } } @Override public void declareOutputFields(OutputFieldsDeclarer arg0) { // TODO Auto-generated method stub } }
最后使用TopoMain.java来构建整个topo
package cn.colony.cloud.storm; import backtype.storm.Config; import backtype.storm.LocalCluster; import backtype.storm.StormSubmitter; import backtype.storm.generated.StormTopology; import backtype.storm.topology.TopologyBuilder; /** * 组织各个处理组件形成一个完整的处理流程,就是所谓的topology(类似于mapreduce程序中的job) * 并且将该topology提交给storm集群去运行,topology提交到集群后就将永无休止地运行,除非人为或者异常退出 * */ public class TopoMain { public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); //将我们的spout组件设置到topology中去 //parallelism_hint :4 表示用4个excutor来执行这个组件 //setNumTasks(8) 设置的是该组件执行时的并发task数量,也就意味着1个excutor会运行2个task builder.setSpout("randomspout", new RandomWordSpout(), 4).setNumTasks(8); //将大写转换bolt组件设置到topology,并且指定它接收randomspout组件的消息 //.shuffleGrouping("randomspout")包含两层含义: //1、upperbolt组件接收的tuple消息一定来自于randomspout组件 //2、randomspout组件和upperbolt组件的大量并发task实例之间收发消息时采用的分组策略是随机分组shuffleGrouping builder.setBolt("upperbolt", new UpperBolt(), 4).shuffleGrouping("randomspout"); //将添加后缀的bolt组件设置到topology,并且指定它接收upperbolt组件的消息 builder.setBolt("suffixbolt", new SuffixBolt(), 4).shuffleGrouping("upperbolt"); //用builder来创建一个topology StormTopology demotop = builder.createTopology(); //配置一些topology在集群中运行时的参数 Config conf = new Config(); //这里设置的是整个demotop所占用的槽位数,也就是worker的数量 conf.setNumWorkers(4); conf.setDebug(true); //ack机制可以防止某一个任务执行失败而导致整个事务执行的失败,类似网络通信里的ack机制 conf.setNumAckers(0); //将这个topology提交给storm集群运行 StormSubmitter.submitTopology("demotopo", conf, demotop); // // LocalCluster cluster = new LocalCluster(); // cluster.submitTopology("demotopo", conf, builder.createTopology()); } }
除了上面提到的两种,还有很多
常用shell命令
- storm ui
- storm jar jar-name main-class
- storm list
- storm kill topology-name
学会查看日志
storm的日志在/${STORM_HOME}/logs里,master和slaved的日志是分开的。不管storm集群发生了什么,一定要先看日志。(血的教训)
进阶
分布式共享锁
事务topology实现机制
框架整合
数据入:flume/activeMQ/kafka
(分布式消息队列系统)
数据出:redis/hbase
/mysql cluster