Storm 学习记录

Storm 学习记录

原创编写: 王宇 
2016-10-20


 


Storm 概念

Let us now have a closer look at the components of Apache Storm −

Components Description
Tuple Tuple is the main data structure in Storm. It is a list of ordered elements. By default, a Tuple supports all data types. Generally, it is modelled as a set of comma separated values and passed to a Storm cluster.
Stream Stream is an unordered sequence of tuples.
Spouts Source of stream. Generally, Storm accepts input data from raw data sources like Twitter Streaming API, Apache Kafka queue, Kestrel queue, etc. Otherwise you can write spouts to read data from datasources. “ISpout” is the core interface for implementing spouts. Some of the specific interfaces are IRichSpout, BaseRichSpout, KafkaSpout, etc.
Bolts Bolts are logical processing units. Spouts pass data to bolts and bolts process and produce a new output stream. Bolts can perform the operations of filtering, aggregation, joining, interacting with data sources and databases. Bolt receives data and emits to one or more bolts. “IBolt” is the core interface for implementing bolts. Some of the common interfaces are IRichBolt, IBasicBolt, etc.
Topology Spouts and bolts are connected together and they form a topology. Real-time application logic is specified inside Storm topology. In simple words, a topology is a directed graph where vertices are computation and edges are stream of data.
Tasks In simple words, a task is either the execution of a spout or a bolt.
Nimbus Nimbus is a master node of Storm cluster. All other nodes in the cluster are called as worker nodes. Master node is responsible for distributing data among all the worker nodes, assign tasks to worker nodes and monitoring failures.
Supervisor The nodes that follow instructions given by the nimbus are called as Supervisors. A supervisor has multiple worker processes and it governs worker processes to complete the tasks assigned by the nimbus.
Worker process A worker process will execute tasks related to a specific topology. A worker process will not run a task by itself, instead it creates executors and asks them to perform a particular task. A worker process will have multiple executors.
Executor An executor is nothing but a single thread spawn by a worker process. An executor runs one or more tasks but only for a specific spout or bolt.
Task A task performs actual data processing. So, it is either a spout or a bolt.
ZooKeeper framework Apache ZooKeeper is a service used by a cluster (group of nodes) to coordinate between themselves and maintaining shared data with robust synchronization techniques. Nimbus is stateless, so it depends on ZooKeeper to monitor the working node status.

ZooKeeper helps the supervisor to interact with the nimbus. It is responsible to maintain the state of nimbus and supervisor.|

 

 

 

 

  • Stream Grouping(消息分发策略)
    • Shuffle Grouping 随机分组
    • Fields Grouping 按字段分组
    • All Grouping 广播发送,对于每个tuple, 所有Bolts都会收到
    • Global Grouping 全局分组
    • None Grouping 同随机分组相同
    • Direct Grouping 指向分组
    • Local or shuffle Grouping 本地或随机分组

Storm Workflow

  • Local Mode
  • Production Mode

Storm 配置

  • 步骤一: 安装JDK 并配置环境变量 JAVA_HOME CLASSPATH
  • 步骤二 : 安装ZooKeeper

    下载ZooKeeper 
    解包

    1. $ tar xzvf zookeeper-3.5.2-alpha.tar.gz
    2. $ mv ./zookeeper-3.5.2-alpha /opt/zookeepter
    3. $ cd /opt/zookeeper
    4. $ mkdir data

    创建配置文件

    1. $ cd /opt/zookeeper
    2. $ vim conf/zoo.cfg
    3. tickTime=2000
    4. dataDir=/path/to/zookeeper/data
    5. clientPort=2181
    6. initLimit=5
    7. syncLimit=2

    启动ZooKeeper Seve

    扫描二维码关注公众号,回复: 310534 查看本文章
    1. $ bin/zkServer.sh start
  • 步骤三:在安装配置Storm

    下载Storm 
    解包

    1. $ tar xvfz apache-storm-1.0.2.tar.gz
    2. $ mv apache-storm-1.0.2/opt/storm
    3. $ cd /opt/storm
    4. $ mkdir data

    编辑Storm配置

    1. $ cd /opt/storm
    2. $ vim conf/storm.yaml
    3. storm.zookeeper.servers:
    4. -"localhost"
    5. storm.local.dir:“/path/to/storm/data(any path)”
    6. nimbus.host:"localhost"
    7. supervisor.slots.ports:
    8. -6700
    9. -6701
    10. -6702
    11. -6703
    12. ui.port:6969

    启动 Nimbus

    1. $ cd /opt/storm
    2. $ ./bin/storm nimbus

    启动 Supervisor

    1. $ cd /opt/storm
    2. $ ./bin/stormi supervisor

    启动 UI

    1. $ cd /opt/storm
    2. $ ./bin/storm ui

在Storm上开发实现一个统计任务

  • 场景 - 统计移动电话的数量.

    在Spout中,准备4个电话号码和电话之间随机通话数量。 
    分别创建不同的Bolt,用于统计 
    使用 Topology 将 Spout 和 Bolt 关联起来

  • 以下程序在Ubuntu 16.04 64位 JDK1.8 环境下编译执行通过
  • 创建 Spout 组件 
    Spout 需要继承 IRichSpout 接口, 接口描述如下:

    open − Provides the spout with an environment to execute. The executors will run this method to initialize the spout. 
    nextTuple − Emits the generated data through the collector. 
    close − This method is called when a spout is going to shutdown. 
    declareOutputFields − Declares the output schema of the tuple. 
    ack − Acknowledges that a specific tuple is processed 
    fail − Specifies that a specific tuple is not processed and not to be reprocessed.

    1. import java.util.*;
    2. //import storm tuple packages
    3. import org.apache.storm.tuple.Fields;
    4. import org.apache.storm.tuple.Values;
    5. //import Spout interface packages
    6. import org.apache.storm.topology.IRichSpout;
    7. import org.apache.storm.topology.OutputFieldsDeclarer;
    8. import org.apache.storm.spout.SpoutOutputCollector;
    9. import org.apache.storm.task.TopologyContext;
    10. //Create a class FakeLogReaderSpout which implement IRichSpout interface to access functionalities
    11. publicclassFakeCallLogReaderSpoutimplementsIRichSpout{
    12. //Create instance for SpoutOutputCollector which passes tuples to bolt.
    13. privateSpoutOutputCollector collector;
    14. privateboolean completed =false;
    15. //Create instance for TopologyContext which contains topology data.
    16. privateTopologyContext context;
    17. //Create instance for Random class.
    18. privateRandom randomGenerator =newRandom();
    19. privateInteger idx =0;
    20. @Override
    21. publicvoid open(Map conf,TopologyContext context,SpoutOutputCollector collector){
    22. this.context = context;
    23. this.collector = collector;
    24. }
    25. @Override
    26. publicvoid nextTuple(){
    27. if(this.idx <=1000){
    28. List<String> mobileNumbers =newArrayList<String>();
    29. mobileNumbers.add("1234123401");
    30. mobileNumbers.add("1234123402");
    31. mobileNumbers.add("1234123403");
    32. mobileNumbers.add("1234123404");
    33. Integer localIdx =0;
    34. while(localIdx++<100&&this.idx++<1000){
    35. String fromMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
    36. String toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
    37. while(fromMobileNumber == toMobileNumber){
    38. toMobileNumber = mobileNumbers.get(randomGenerator.nextInt(4));
    39. }
    40. Integer duration = randomGenerator.nextInt(60);
    41. this.collector.emit(newValues(fromMobileNumber, toMobileNumber, duration));
    42. }
    43. }
    44. }
    45. @Override
    46. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
    47. declarer.declare(newFields("from","to","duration"));
    48. }
    49. //Override all the interface methods
    50. @Override
    51. publicvoid close(){}
    52. publicboolean isDistributed(){
    53. returnfalse;
    54. }
    55. @Override
    56. publicvoid activate(){}
    57. @Override
    58. publicvoid deactivate(){}
    59. @Override
    60. publicvoid ack(Object msgId){}
    61. @Override
    62. publicvoid fail(Object msgId){}
    63. @Override
    64. publicMap<String,Object> getComponentConfiguration(){
    65. returnnull;
    66. }
    67. }
  • 创建 Bolt 组件 
    Bolt 需要继承 IRichBolt 接口, 接口描述如下

    prepare − Provides the bolt with an environment to execute. The executors will run this method to initialize the spout. 
    execute − Process a single tuple of input. 
    cleanup − Called when a bolt is going to shutdown. 
    declareOutputFields − Declares the output schema of the tuple.

    1. //import util packages
    2. import java.util.HashMap;
    3. import java.util.Map;
    4. import org.apache.storm.tuple.Fields;
    5. import org.apache.storm.tuple.Values;
    6. import org.apache.storm.task.OutputCollector;
    7. import org.apache.storm.task.TopologyContext;
    8. //import Storm IRichBolt package
    9. import org.apache.storm.topology.IRichBolt;
    10. import org.apache.storm.topology.OutputFieldsDeclarer;
    11. import org.apache.storm.tuple.Tuple;
    12. //Create a class CallLogCreatorBolt which implement IRichBolt interface
    13. publicclassCallLogCreatorBoltimplementsIRichBolt{
    14. //Create instance for OutputCollector which collects and emits tuples to produce output
    15. privateOutputCollector collector;
    16. @Override
    17. publicvoid prepare(Map conf,TopologyContext context,OutputCollector collector){
    18. this.collector = collector;
    19. }
    20. @Override
    21. publicvoid execute(Tuple tuple){
    22. Stringfrom= tuple.getString(0);
    23. String to = tuple.getString(1);
    24. Integer duration = tuple.getInteger(2);
    25. collector.emit(newValues(from+" - "+ to, duration));
    26. }
    27. @Override
    28. publicvoid cleanup(){}
    29. @Override
    30. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
    31. declarer.declare(newFields("call","duration"));
    32. }
    33. @Override
    34. publicMap<String,Object> getComponentConfiguration(){
    35. returnnull;
    36. }
    37. }
    1. import java.util.HashMap;
    2. import java.util.Map;
    3. import org.apache.storm.tuple.Fields;
    4. import org.apache.storm.tuple.Values;
    5. import org.apache.storm.task.OutputCollector;
    6. import org.apache.storm.task.TopologyContext;
    7. import org.apache.storm.topology.IRichBolt;
    8. import org.apache.storm.topology.OutputFieldsDeclarer;
    9. import org.apache.storm.tuple.Tuple;
    10. publicclassCallLogCounterBoltimplementsIRichBolt{
    11. Map<String,Integer> counterMap;
    12. privateOutputCollector collector;
    13. @Override
    14. publicvoid prepare(Map conf,TopologyContext context,OutputCollector collector){
    15. this.counterMap =newHashMap<String,Integer>();
    16. this.collector = collector;
    17. }
    18. @Override
    19. publicvoid execute(Tuple tuple){
    20. String call = tuple.getString(0);
    21. Integer duration = tuple.getInteger(1);
    22. if(!counterMap.containsKey(call)){
    23. counterMap.put(call,1);
    24. }else{
    25. Integer c = counterMap.get(call)+1;
    26. counterMap.put(call, c);
    27. }
    28. collector.ack(tuple);
    29. }
    30. @Override
    31. publicvoid cleanup(){
    32. for(Map.Entry<String,Integer> entry:counterMap.entrySet()){
    33. System.out.println(entry.getKey()+" : "+ entry.getValue());
    34. }
    35. }
    36. @Override
    37. publicvoid declareOutputFields(OutputFieldsDeclarer declarer){
    38. declarer.declare(newFields("call"));
    39. }
    40. @Override
    41. publicMap<String,Object> getComponentConfiguration(){
    42. returnnull;
    43. }
    44. }
  • 创建 Topology 和 Local Cluster

    1. import org.apache.storm.tuple.Fields;
    2. import org.apache.storm.tuple.Values;
    3. //import storm configuration packages
    4. import org.apache.storm.Config;
    5. import org.apache.storm.LocalCluster;
    6. import org.apache.storm.topology.TopologyBuilder;
    7. //Create main class LogAnalyserStorm submit topology.
    8. publicclassLogAnalyserStorm{
    9. publicstaticvoid main(String[] args)throwsException{
    10. //Create Config instance for cluster configuration
    11. Config config =newConfig();
    12. config.setDebug(true);
    13. //
    14. TopologyBuilder builder =newTopologyBuilder();
    15. builder.setSpout("call-log-reader-spout",newFakeCallLogReaderSpout());
    16. builder.setBolt("call-log-creator-bolt",newCallLogCreatorBolt())
    17. .shuffleGrouping("call-log-reader-spout");
    18. builder.setBolt("call-log-counter-bolt",newCallLogCounterBolt())
    19. .fieldsGrouping("call-log-creator-bolt",newFields("call"));
    20. LocalCluster cluster =newLocalCluster();
    21. cluster.submitTopology("LogAnalyserStorm", config, builder.createTopology());
    22. Thread.sleep(10000);
    23. //Stop the topology
    24. cluster.shutdown();
    25. }
    26. }
  • 远程模式 
    http://storm.apache.org/releases/current/Distributed-RPC.html
  • 编译并运行应用

    1. $ cd /opt/storm/my-example
    2. $ javac *.java
    3. $ java LogAnalyserStorm
  • 输出结果

    1. 1234123402-1234123401:78
    2. 1234123402-1234123404:88
    3. 1234123402-1234123403:105
    4. 1234123401-1234123404:74
    5. 1234123401-1234123403:81
    6. 1234123401-1234123402:81
    7. 1234123403-1234123404:86
    8. 1234123404-1234123401:63
    9. 1234123404-1234123402:82
    10. 1234123403-1234123402:83
    11. 1234123404-1234123403:86
    12. 1234123403-1234123401:93

参考

Storm 官网 : http://storm.apache.org 
教程 : https://www.tutorialspoint.com/apache_storm/index.htm 
Storm-Java Doc http://storm.apache.org/releases/current/javadocs/index.html

  • PDF 
    《Storm Applied》 
    《Getting Started with Storm》 
    《Storm Real-time Processing Cookbook》 
    《Learning Storm》 
    《Storm Blueprints:Patterns for Distributed Real-time Computation》 
    《Hadoop The Definitive Guide》

猜你喜欢

转载自wangyuxxx.iteye.com/blog/2342471