一、storm简介
---------------------------------------------------------
1.开源,分布式,实时计算
2.实时可靠的处理无限数据流,可以使用任何语言开发
3.适用于实时分析,在线机器学习,分布式PRC,ETL
4.每秒可以处理上百万条记录(元组)
5.可拓展,容错,并可保证数据至少处理一次
6.低延迟,秒级,分钟级计算响应
二、核心概念
------------------------------------------------------------
1.toplogy:
a.封装 实时计算 的对象,类似于MR,并且不会终止
b.spout和bolt 连接在一起形成一个top, 形成有向图[定点 -- 边] 定点代表计算核心,边代表数据流
2.nimbus:
a.资源调度和分配 ,类似于jobTracker
b.master node, 是核心组件,领导,管理指派task
c.分析top,收集并运行task.分发task给supervisor
d.使用内部的消息系统,与 supervisor进行通信
e.监控top是否失败,无状态,必须依靠zk来监控top的运行状态
3.supervisor:
a.接收nimbus的指令, 启动和管辖所有的worker ,类似于 TaskTracker,
b.worker node , 监理 ,分配Task给worker
c.每个supervisor 有n个worker进程,supervisor管理旗下所有的worker 进程
4.Worker :本身不执行任务,只是孵化出 具体处理计算 的进程[Executors],让executors进程执行tasks
5.Executor: Worker孵化出的一个物理线程,内部运行的task必须属于同一作业[spout/bolt]
6.Task: storm 中的最小工作单元,类似于MR,执行实际的任务处理工作,或者是spout或者是bolt
7.Spout: 水龙头,获取数据源数据,通过nextTuple函数发送到bolt。数据流的源头,可以自定义,可以kafka
8.bolt: 转接头,逻辑处理单元,接受spout发送的数据,进行过滤,合并,写入db等.
9.Tuple:元组,消息的基本单位,主要的数据结构,有序元素的列表
10.Stream : 源源不断的tuple构成了流,一系列元组
11.stream group: 消息组/分区 -- 分组方式有shuffle,fields,all,global,none,direct,localshuffle等
三、storm和hadoop的对比
-----------------------------------------------------------
storm hadoop
实时流处理 批处理
无状态 有状态
使用zk协同的主从架构 无zk主从架构
每秒处理百万记录 MR作业 数分钟数小时
不会主动停 会停
toplogy -- task mapreduce -- mrtask
nimbus - Supervisor jobTracker -- TaskTracker
spout - bolt map -- reduce
四、storm的工作流程
-----------------------------------------------------------
1.nimbus等待toplogy的提交
2.提交toplogy
3.nimbus收到top,从中提取tasks
4.nimbus分发tasks给所有可用的supervisors
5.all supervisors 会周期性的发送心跳信息给nimbus证明其依然存活,如果supervosor挂掉,nimbus就不再给其发送tasks,取而代之的是发送给其他的supervisor
6.当nimbus挂掉,已经分配给supervisors的tasks会继续执行,而不受其影响。
7.tasks完成之后,supervisor会等待新的tasks
8.挂掉的nimbus会通过监控工具自动重启,从挂掉的地方继续工作
五、安装和部署storm
-------------------------------------------------------------
1.找4台机器,1 nimbus , 3 supervisors
2.下载安装包apache-storm-1.1.0.tar.gz,tar开,创建符号链接,分发
3.配置环境变量,分发
[/etc/environment]
...
STORM_HOME="/soft/storm"
$PATH="... :/soft/storm/bin"
4.验证安装
$> storm version
5.部署storm
a.配置配置文件并分发 [/soft/storm/conf/storm.yaml]
storm.local.dir: "/home/ubuntu/storm"
storm.zookeeper.servers:
- "s200"
- "s300"
storm.zookeeper.port: 2181
### nimbus.* configs are for the master
nimbus.seeds : ["s201"]
### ui.* configs are for the master
ui.host: 0.0.0.0
ui.port: 8080
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
6.启动storm
a.在s100上启动nimbus节点
$> storm nimbus
b.在s200,s300,s400上启动 supervisor
$> storm supervisor
c.在s100上启动UI进程
$> cd /soft/storm/bin
$bin> ./storm ui &
d.查看webui
http://s100:8080
六、电话通信案例分析
--------------------------------------------------------------
1.spout类
----------------------------------------------------------
package test.storm;
import org.apache.storm.spout.ISpout;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichSpout;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.Random;
/**
* Storm Spout类
* 负责产生数据流的
*/
public class CallLogSpout implements IRichSpout {
//spout输出的收集器:传递tuples to bolt
private SpoutOutputCollector collector;
//是否完成
private boolean completed = false;
//top数据的封装
private TopologyContext context;
//随机发送器
private Random randomGenerator = new Random();
//索引
private Integer idx = 0;
public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
this.context = topologyContext;
this.collector = spoutOutputCollector;
}
public void close() {
}
public void activate() {
}
public void deactivate() {
}
/**
* 下一个元组,记录
*/
public void nextTuple() {
if(this.idx <= 1000) {
List<String> mobileNumbers = new ArrayList<String>();
mobileNumbers.add("1234123401");
mobileNumbers.add("1234123402");
mobileNumbers.add("1234123403");
mobileNumbers.add("1234123404");
Integer localIdx = 0;
while(localIdx++ < 100 && this.idx++ < 1000) {
//随机主叫
String caller = mobileNumbers.get(randomGenerator.nextInt(4));
//随机被叫
String callered = mobileNumbers.get(randomGenerator.nextInt(4));
while(caller == callered) {
callered = mobileNumbers.get(randomGenerator.nextInt(4));
}
//随机通话时长
Integer duration = randomGenerator.nextInt(60);
//收集器发送元组信息给bolt
this.collector.emit(new Values(caller, callered, duration));
}
}
}
public void ack(Object o) {
}
public void fail(Object o) {
}
/**
* 定义输出字段的名称
* @param declarer
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("caller", "callered", "time"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
2.BoltA类
-----------------------------------------------------------------------
package test.storm;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichBolt;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import java.util.Map;
/**
* 呼叫日志的转接头类bolt -- 对通话记录进行一次组装
*/
public class CallLogBolt implements IRichBolt {
//收集器:用于接收spout发送的元组信息并发送给输出者
private OutputCollector collector;
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.collector = outputCollector;
}
/**
* 执行方法 -- 计算过程
* @param tuple
*/
public void execute(Tuple tuple) {
//从元组中取出主叫
String from = tuple.getString(0);
//从tuple中取出被叫
String to = tuple.getString(1);
//从元组中取出通话时长
Integer duration = tuple.getInteger(2);
//收集器组装消息[处理通话记录],产生新的tuple,然后发送出去
collector.emit(new Values(from + " - " + to, duration));
}
public void cleanup() {
}
/**
* 给tuple定义输出字段的名称
* @param declarer
*/
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("calllog", "duration"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
3.BoltB类
-------------------------------------------------------------
package test.storm;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichBolt;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import java.util.HashMap;
import java.util.Map;
/**
* 呼叫日志的转接头类bolt -- 对通话记录进行统计
*/
public class CounterBolt implements IRichBolt {
Map<String, Integer> counterMap;
private OutputCollector collector;
/**
* 初始化
*/
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
this.counterMap = new HashMap<String, Integer>();
collector = outputCollector;
}
/**
* 统计通话次数
*/
public void execute(Tuple tuple) {
String calllog = tuple.getString(0);
Integer duration = tuple.getInteger(1);
if(!counterMap.containsKey(calllog)){
counterMap.put(calllog, 1);
}else{
Integer c = counterMap.get(calllog) + 1;
counterMap.put(calllog, c);
}
//到达最后一个bolt,进行ack确认,表示元组被处理完毕了[表示到此处,一个元组已经被处理完毕,往后没有bolt了]
collector.ack(tuple);
}
public void cleanup() {
for(Map.Entry<String, Integer> entry:counterMap.entrySet()){
System.out.println(entry.getKey()+" : " + entry.getValue());
}
}
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("call"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
4.App类
----------------------------------------------------------------------
package test.storm;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.tuple.Fields;
/**
* app入口类 -- toplogy
*/
public class App_Toplogy {
public static void main(String [] args)
{
//storm cluster configuration
Config config = new Config();
config.setDebug(true);
//创建top
TopologyBuilder builder = new TopologyBuilder();
//指定龙头
builder.setSpout("spoutA", new CallLogSpout());
//采用shuffleGrouping的方式,将boltA对接到spoutA
builder.setBolt("boltA", new CallLogBolt()).shuffleGrouping("spoutA");
//采用fieldsGrouping的方式,对boltA中的call这个字段进行分组,将boltB对接到boltA
builder.setBolt("boltB", new CounterBolt()).fieldsGrouping("boltA", new Fields("calllog"));
// //本地模式
// LocalCluster cluster = new LocalCluster();
// //提交top
// cluster.submitTopology("LogAnalyserStorm", config, builder.createTopology());
// try {
// Thread.sleep(10000);
// } catch (InterruptedException e) {
// e.printStackTrace();
// }
// //Stop the topology
// cluster.shutdown();
//集群模式
try {
StormSubmitter.submitTopology("mytop",config,builder.createTopology());
} catch (Exception e) {
e.printStackTrace();
}
}
}
七、发布到storm集群上运行
-------------------------------------------------------------------------------
1.修改app类 -- 改为集群提交模式
//集群模式
try {
StormSubmitter.submitTopology("mytop",config,builder.createTopology());
} catch (Exception e) {
e.printStackTrace();
}
2.导出jar包
3.在ubuntu上运行
$bin> ./storm jar /share/storm/TestStrom-1.0-SNAPSHOT.jar test.storm.App_Toplogy
4.在s200上开启日志查看器
$bin> ./storm logviewer &
5.查看结果[电话记录统计信息]
$s400> cat /soft/storm/logs/workers-artifacts/mytop-1-1538085858/6700/worker.log | grep 1234
八、单词统计案例
---------------------------------------------------------------------------
1.WordSpout -- 产生单词
---------------------------------------------
package test.storm.wc;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichSpout;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Values;
import util.Util;
import java.util.Map;
import java.util.Random;
/**
* 单词产生源spout -- 水龙头
*/
public class WordSpout implements IRichSpout {
private TopologyContext context;
private SpoutOutputCollector collector;
public void open(Map map, TopologyContext topologyContext, SpoutOutputCollector spoutOutputCollector) {
Util.sendToClient(this,"WordSpout.open()",7777);
context = topologyContext;
collector = spoutOutputCollector;
}
public void close() {
}
public void activate() {
}
public void deactivate() {
}
/**
* 下一个
*/
public void nextTuple() {
//Util.sendToClient(this,"WordSpout.nextTuple()",7777);
String line = "how are you" + " tom" + new Random().nextInt(100);
collector.emit(new Values(line));
try {
Thread.sleep(1000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
public void ack(Object o) {
}
public void fail(Object o) {
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("line"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
2.SplitBolt -- 进行切割
----------------------------------------------
package test.storm.wc;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichBolt;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Fields;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import util.Util;
import java.util.Map;
public class SpiltBolt implements IRichBolt {
private TopologyContext context;
private OutputCollector collector;
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
Util.sendToClient(this,"SpiltBolt.prepare()",8888);
this.context = topologyContext;
this.collector = outputCollector;
}
/**
* 计算
*/
public void execute(Tuple tuple) {
//Util.sendToClient(this,"SpiltBolt.execute()" + tuple.toString(), 8888);
String str = tuple.getString(0);
String [] strs = str.split(" ");
for(String s : strs)
{
collector.emit(new Values(s));
}
}
public void cleanup() {
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
outputFieldsDeclarer.declare(new Fields("word"));
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
3.CounterBolt -- 进行单词统计
----------------------------------------------
package test.storm.wc;
import org.apache.storm.spout.SpoutOutputCollector;
import org.apache.storm.task.OutputCollector;
import org.apache.storm.task.TopologyContext;
import org.apache.storm.topology.IRichBolt;
import org.apache.storm.topology.OutputFieldsDeclarer;
import org.apache.storm.tuple.Tuple;
import org.apache.storm.tuple.Values;
import util.Util;
import java.util.HashMap;
import java.util.Map;
public class CounterBolt implements IRichBolt {
private TopologyContext context;
private OutputCollector collector;
private Map<String, Integer> map = new HashMap<String, Integer>();
public void prepare(Map map, TopologyContext topologyContext, OutputCollector outputCollector) {
Util.sendToClient(this,"CounterBolt.prepare()" , 9999);
context = topologyContext;
collector = outputCollector;
map = new HashMap<String, Integer>();
}
public void execute(Tuple tuple) {
//Util.sendToClient(this,"CounterBolt.execute()" + tuple.toString(),9999);
String word = tuple.getString(0);
if (map.containsKey(word)) {
int count = map.get(word) + 1;
map.put(word, count);
}
else {
map.put(word, 1);
}
collector.ack(tuple);
}
public void cleanup() {
for(Map.Entry<String, Integer> entry:map.entrySet()){
System.out.println(entry.getKey()+" : " + entry.getValue());
}
}
public void declareOutputFields(OutputFieldsDeclarer outputFieldsDeclarer) {
}
public Map<String, Object> getComponentConfiguration() {
return null;
}
}
4.App主程序
----------------------------------------------
package test.storm.wc;
import org.apache.storm.Config;
import org.apache.storm.LocalCluster;
import org.apache.storm.StormSubmitter;
import org.apache.storm.topology.TopologyBuilder;
import org.apache.storm.tuple.Fields;
public class App {
public static void main(String [] args)
{
//storm cluster configuration
Config config = new Config();
config.setDebug(true);
//设定5个worker
config.setNumWorkers(5);
//创建top
TopologyBuilder builder = new TopologyBuilder();
//指定龙头,设置并发暗示为3
builder.setSpout("wordSpout", new WordSpout(),3).setNumTasks(3);
//采用shuffleGrouping的方式,将boltA对接到spoutA,并设置并发暗示为4
builder.setBolt("spiltBolt", new SpiltBolt(),4).shuffleGrouping("wordSpout");
//采用fieldsGrouping的方式,对boltA中的call这个字段进行分组,将boltB对接到boltA,设定并发暗示为5
builder.setBolt("counterBolt", new CounterBolt(),5).fieldsGrouping("spiltBolt", new Fields("word"));
//本地模式
// LocalCluster cluster = new LocalCluster();
// //提交top
// cluster.submitTopology("wc", config, builder.createTopology());
// try {
// Thread.sleep(10000);
// } catch (InterruptedException e) {
// e.printStackTrace();
// }
// //Stop the topology
// cluster.shutdown();
//集群模式
try {
StormSubmitter.submitTopology("wc",config,builder.createTopology());
} catch (Exception e) {
e.printStackTrace();
}
}
}
九、设置top的并发程度-- 配置任务数,worker数,线程数
---------------------------------------------------------------------
1.设置worker数量
conf.setNumWorkers(3); //默认为1
2.设置执行线程execcutor数量
//指定龙头,设置并发暗示为3 [每个worker 开启三个计算线程来产生单词源]
builder.setSpout("wordSpout", new WordSpout(),3);
//采用shuffleGrouping的方式,将boltA对接到spoutA,并设置并发暗示为4 [每个worker,开启4个线程数来切割]
builder.setBolt("spiltBolt", new SpiltBolt(), 4).shuffleGrouping("wordSpout");
3.设置任务数
//设定3个计算线程executor,6个任务Task -- 1个线程平均2个任务
builder.setSpout("wordSpout", new WordSpout(),3).setNumTasks(6);
4.按照上面的设定,结果为
-- 以Spout为例
-- 1个nimbus -- 3台主机运行3个worker -- 每个worker开启3个计算线程,总共6个task(均分),每个计算线程计算2个任务
-- 一台主机最多可以跑4个worker[配置文件决定],并且会平均分配资源 -- 3台主机运行3个worker
-- 主机上开启worker进程,worker进程上开启executor线程,executor线程上开启若干分线程Task,Task为最小单元,进行需求计算
5.并发度 == 所有Tasks数量的总和