Strom数据流分组解析

本文可作为 <<Storm-分布式实时计算模式>>一书1.5节的读书笔记
数据流分组定义了一个数据流中的tuple如何分发给topology中不同bolt的task。

Shufﬂe grouping（随机分组）：这种方式会随机分发 tuple 给 bolt 的各个 task，每个bolt 实例接收到的相同数量的 tuple。
Fields grouping（按字段分组）：根据指定字段的值进行分组。比如说，一个数据流根据“word”字段进行分组，所有具有相同“word”字段值的 tuple 会路由到同一个 bolt 的 task 中。
All grouping（全复制分组）：将所有的 tuple 复制后分发给所有 bolt task。每个订阅数据流的 task 都会接收到 tuple 的拷贝。
Globle grouping（全局分组）：这种分组方式将所有的 tuples 路由到唯一一个 task 上。Storm 按照最小的 task ID 来选取接收数据的 task。注意，当使用全局分组方式时，设置 bolt 的 task 并发度是没有意义的，因为所有 tuple 都转发到同一个 task 上了。使用全局分组的时候需要注意，因为所有的 tuple 都转发到一个 JVM 实例上，可能会引起 Storm 集群中某个 JVM 或者服务器出现性能瓶颈或崩溃。
None grouping（不分组）：在功能上和随机分组相同，是为将来预留的。
Direct grouping（指向型分组）：数据源会调用 emitDirect() 方法来判断一个 tuple 应该由哪个 Storm 组件来接收。只能在声明了是指向型的数据流上使用。
Local or shufﬂe grouping（本地或随机分组）：和随机分组类似，但是，会将 tuple 分发给同一个 worker 内的 bolt task（如果 worker 内有接收数据的 bolt task）。其他情况下，采用随机分组的方式。取决于 topology 的并发度，本地或随机分组可以减少网络传输，从而提高 topology 性能。

随机分组

最经常用的也就是Shufﬂe grouping,Fields grouping,Direct grouping等等
现在我们看一个例子:
就是最经常见的数单词的例子

public class WordCountBolt extends BaseRichBolt{    private OutputCollector collector;    private HashMap<String, Long> counts = null;    public void prepare(Map config, TopologyContext context,             OutputCollector collector) {        this.collector = collector;        this.counts = new HashMap<String, Long>();    }    public void execute(Tuple tuple) {        String word = tuple.getStringByField("word");        Long count = this.counts.get(word);        if(count == null){            count = 0L;        }        count++;        this.counts.put(word, count);        this.collector.emit(new Values(word, count));    }    public void declareOutputFields(OutputFieldsDeclarer declarer) {        declarer.declare(new Fields("word", "count"));    }}

在添加这个bolt的时候,使用的是按字段分组,如下

 builder.setBolt(COUNT_BOLT_ID, countBolt,4)                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));

如果我们分组模式改成

 builder.setBolt(COUNT_BOLT_ID, countBolt,4)                .shuffleGrouping(SPLIT_BOLT_ID);

那么对单词的统计就会偏少。
为什么?
大家想想恩,有4个countbolt实例(咱们暂时称之为countbolta,b,c,d),如果我是随机分组,the这个单词出现了3回,前两回被分配到了countbolta,第三回被分配到了countboltb,那么后面的reportbolt先收到了<the,2>这个tuple(来自countbolta),然后又收到了<the,1>这个tuple(来自countboltb),最后的输出肯定是the:1喽
那么如果使用

 builder.setBolt(COUNT_BOLT_ID, countBolt,4)                .fieldsGrouping(SPLIT_BOLT_ID, new Fields("word"));

自然就不会出现刚才的问题了,为什么,大家自己想。

直接分组

这里我引用一个用storm给句子加感叹号的例子,代码在最后

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

以下为16-7-4修改

其实我下面这例子不好

直接分组,主要是保证把消息给bolt中某一个特定的task

而下面的例子的实际效果是想吧 messagea给bolta,messageb给boltb

那么其实还有更方便的做法,就是

在发送是:

public void execute(Tuple tuple, BasicOutputCollector collector) {     tpsCounter.count();     Long tupleId = tuple.getLong(0);     Object obj = tuple.getValue(1);     if (obj instanceof TradeCustomer) {         TradeCustomer tradeCustomer = (TradeCustomer)obj;         Pair trade = tradeCustomer.getTrade();         Pair customer = tradeCustomer.getCustomer();            collector.emit(SequenceTopologyDef.TRADE_STREAM_ID,                     new Values(tupleId, trade));            collector.emit(SequenceTopologyDef.CUSTOMER_STREAM_ID,                     new Values(tupleId, customer));     }else if (obj != null){         LOG.info("Unknow type " + obj.getClass().getName());     }else {         LOG.info("Nullpointer " );     } }

在提交时:

builder.setBolt(SequenceTopologyDef.SPLIT_BOLT_NAME, new SplitRecord(), 2).shuffleGrouping(                        SequenceTopologyDef.SEQUENCE_SPOUT_NAME);                builder.setBolt(SequenceTopologyDef.TRADE_BOLT_NAME, new PairCount(), 1).shuffleGrouping(                        SequenceTopologyDef.SPLIT_BOLT_NAME,  // --- 发送方名字                        SequenceTopologyDef.TRADE_STREAM_ID); // --- 接收发送方该stream 的tuple                builder.setBolt(SequenceTopologyDef.CUSTOMER_BOLT_NAME, new PairCount(), 1)                        .shuffleGrouping(SequenceTopologyDef.SPLIT_BOLT_NAME, // --- 发送方名字                                SequenceTopologyDef.CUSTOMER_STREAM_ID);      // --- 接收发送方该stream 的tuple

定义输出格式

public void declareOutputFields(OutputFieldsDeclarer declarer) {  declarer.declareStream(SequenceTopologyDef.TRADE_STREAM_ID, new Fields("ID", "TRADE"));  declarer.declareStream(SequenceTopologyDef.CUSTOMER_STREAM_ID, new Fields("ID", "CUSTOMER")); }

最后接收的时候还得判断一下

if (input.getSourceStreamId().equals(SequenceTopologyDef.TRADE_STREAM_ID) ) {            customer = pair;            customerTuple = input;            tradeTuple = tradeMap.get(tupleId);            if (tradeTuple == null) {                customerMap.put(tupleId, input);                return;            }            trade = (Pair) tradeTuple.getValue(1);        }

参考资料

数据的分流

以上为16-7-4修改

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

最开始的时候
运行的结果如下:

mystorm.PrintBolt@67178f5d   String recieved: edi:I'm happy!mystorm.PrintBolt@67178f5d   String recieved: marry:I'm angry!mystorm.PrintBolt@393ddf54   String recieved: ted:I'm excited!mystorm.PrintBolt@393ddf54   String recieved: john:I'm sad!mystorm.PrintBolt@5f97cfcb   String recieved: marry:I'm angry!

不同的task都平均收到了tuple

然后我想让指定某些句子只让某个task接受,怎么办?
首先看ExclaimBasicBolt

public class ExclaimBasicBolt extends BaseBasicBolt { /**  *   */ private static final long serialVersionUID = -6239845315934660303L; private List<Integer> list; private List<Integer> list2; @Override public void execute(Tuple tuple, BasicOutputCollector collector) {  //String sentence = tuple.getString(0);  String sentence = (String) tuple.getValue(0);  String out = sentence + "!";  if (out.startsWith("e")) {   collector.emitDirect(list.get(0),new Values(out));  }else {   collector.emitDirect(list2.get(0),new Values(out));  }   } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) {  declarer.declare(true,new Fields("excl_sentence")); }      @Override    public void prepare(Map stormConf, TopologyContext context) {     list =context.getComponentTasks("print");     list2=context.getComponentTasks("print2");    }}

在构建topology的时候
使用directGrouping

  builder.setSpout("spout", new RandomSpout());  builder.setBolt("exclaim", new ExclaimBasicBolt(),3).shuffleGrouping("spout");  builder.setBolt("print", new PrintBolt(),3).directGrouping("exclaim");  builder.setBolt("print2", new PrintBolt2(),3).directGrouping("exclaim");

PrintBolt2与PrintBolt类似
只是打印的时候打印出 System.err.println(this+" i am two String recieved: " + rec);
OK这下运行的时候我们就能看到

mystorm.PrintBolt2@238ac8bf   String recieved: ted:I'm excited!mystorm.PrintBolt2@238ac8bf   String recieved: john:I'm sad!mystorm.PrintBolt2@238ac8bf   String recieved: marry:I'm angry!mystorm.PrintBolt2@238ac8bf   String recieved: ted:I'm excited!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt@611b7a20  i am two   String recieved: edi:I'm happy!mystorm.PrintBolt2@238ac8bf   String recieved: marry:I'm angry!mystorm.PrintBolt2@238ac8bf   String recieved: ted:I'm excited!mystorm.PrintBolt2@238ac8bf   String recieved: marry:I'm angry!

所有e开头的句子都跑到Print2这个Bolt的某个task里面了。

本节的整体代码见

package mystorm;public class ExclaimBasicTopo { public static void main(String[] args) throws Exception {  TopologyBuilder builder = new TopologyBuilder();    builder.setSpout("spout", new RandomSpout());  builder.setBolt("exclaim", new ExclaimBasicBolt(),3).shuffleGrouping("spout");  builder.setBolt("print", new PrintBolt(),3).shuffleGrouping("exclaim");  Config conf = new Config();  conf.setDebug(false);  if (args != null && args.length > 0) {   conf.setNumWorkers(3);   StormSubmitter.submitTopology(args[0], conf, builder.createTopology());  } else {   LocalCluster cluster = new LocalCluster();   cluster.submitTopology("test", conf, builder.createTopology());  } }}package mystorm; public class RandomSpout extends BaseRichSpout { private SpoutOutputCollector collector; private Random rand; private int index; private static String[] sentences = new String[] {  "edi:I'm happy", "marry:I'm angry", "john:I'm sad", "ted:I'm excited", "laden:I'm dangerous"};  @Override public void open(Map conf, TopologyContext context,SpoutOutputCollector collector) {  this.collector = collector;  this.rand = new Random(); } @Override public void nextTuple() {  if (index<10*sentences.length) {   String toSay = sentences[rand.nextInt(sentences.length)];   this.collector.emit(new Values(toSay));   index++;  }else {   try {    Thread.sleep(1000);    System.out.println("我停了一秒");   } catch (InterruptedException e) {    // TODO Auto-generated catch block    e.printStackTrace();   }  }   } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) {  declarer.declare(new Fields("sentence")); }}package mystorm;public class ExclaimBasicBolt extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) {  //String sentence = tuple.getString(0);  String sentence = (String) tuple.getValue(0);  String out = sentence + "!";  collector.emit(new Values(out));   } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) {  declarer.declare(new Fields("excl_sentence")); }}package mystorm;public class PrintBolt extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) {  String rec = tuple.getString(0);  System.err.println(this+"   String recieved: " + rec); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) {  // do nothing }}

再分享一下我老师大神的人工智能教程吧。零基础！通俗易懂！风趣幽默！还带黄段子！希望你也加入到我们人工智能的队伍中来！https://blog.csdn.net/jiangjunshow