Storm understands

Storm understands

Storm is a free, open source, distributed, and fault-tolerant real-time computing system.

Relevant sample projects: Leek - Simple Real-Time Smart Stock Picking Platform

1. Storm cluster architecture

  • Nimbus: The Master node of the Storm cluster is responsible for distributing user code and assigning it to the Worker node on the specific Supervisor node to run the Task of the component (Spout/Bolt) corresponding to the Topology.

  • Supervisor: The slave node of the Storm cluster, responsible for managing the startup and termination of each Worker process running on the Supervisor node. Through the supervisor.slots.ports configuration item in Storm's configuration file, you can specify the maximum number of slots allowed on a Supervisor. Each slot is uniquely identified by a port number, and a port number corresponds to a Worker process (if the Worker process is start up).

  • ZooKeeper: It is used to coordinate Nimbus and Supervisor. If the Supervisor cannot run the Topology due to a fault, Nimbus will sense it at the first time and reassign the Topology to run on other available Supervisors.

2. Storm component abstraction

Multiple Tasks corresponding to the Spout/Bolt of a Topology may be distributed within multiple Workers of multiple Supervisors. And there are multiple Executors inside each Worker, which are calculated and allocated at runtime according to the actual configuration of the Topology.

  • Topology: Storm's abstraction of a distributed computing application, the purpose is to complete one thing (from a business point of view) through an implementation Topology. A Topology is composed of a set of static program components (Spout/Bolt) and the component relationship Streaming Groups.
  • Spout: Describes how data enters the Storm cluster from an external system (or directly generated inside a component), and is processed by the Topology to which the Spout belongs, usually reading data from a data source, or doing some simple processing ( In order not to affect the continuous, real-time, and rapid entry of data into the system, it is generally not recommended to put complex processing logic here).
  • Bolt: Describes business-related processing logic.

The above are some concepts that express static things (components). After we write a Topology, the above components all exist in a static way. Next, let's take a look at the dynamic components (concepts) that will be generated after submitting the Topology to run:

  • Task: The entities displayed by Spout/Bolt at runtime are called Task. A Spout/Bolt may correspond to one or more Spout Task/Bolt Task at runtime, which is related to the actual configuration when writing Topology.
  • Worker: The first-level container where the Task is located at runtime, the Executor runs in the Worker, and a Worker corresponds to a JVM instance created on the Supervisor
  • Executor: The direct container where the Task is located at runtime, executes the processing logic of the Task in the Executor; one or more Executor instances can run in the same Worker process, and one or more Tasks can run in the same Executor; in the Worker process On the basis of parallelism, Executor can be parallelized, and Task can also realize parallel computing based on Executor

3. Storm's flow grouping strategy

The most important abstraction in Storm should be Stream grouping, which can control how the Task corresponding to Spot/Bolt distributes Tuple, and transmit Tuple to the Task corresponding to the destination Spot/Bolt, as shown in the following figure:

  • Shuffle grouping:  This randomly distributes tuples across the target bolt's tasks such that each bolt receives an equal number of tuples. (Randomly distributes tuples to different tasks to ensure even distribution)
  • Fields grouping:  This routes tuples to bolt tasks based on the values ​​of the fields specified in the grouping. For example, if a stream is grouped on the "word" field, tuples with the same value for the "word" field will always be routed to the same bolt task. (distributed to different tasks according to the field value of each tube, to ensure the same value to the same task)
  • All grouping:  This replicates the tuple stream across all bolt tasks such that each task will receive a copy of the tuple. (each tuple is replicated to all related tasks)
  • Global grouping: This routes all tuples in a stream to a single task, choosingthe task with the lowest task ID value. Note that setting a parallelism hint or number of tasks on a bolt when using the global grouping is meaningless since all tuples will be routed to the same bolt task. The global grouping should be used with caution since it will route all tuples to a single JVM instance, potentially creating a bottleneck or overwhelming a specific JVM/machine in a cluster.(每个tuple会被发送到一个ID最小的task里面,慎用!容易引起性能问题)
  • None grouping: The none grouping is functionally equivalent to the shuffle grouping. It has been reserved for future use.(不分组,效果和shuffle grouping差不多)
  • Direct grouping:  With a direct grouping, the source stream decides which component will receive a given tuple by calling the emitDirect() method. It and can only be used on streams that have been declared direct streams. (It is up to the Tupe producer to decide to send Which Bolt Task is given to the downstream, this should be precisely controlled in the logic of actual development and writing of Bolt code)
  • Local or shuffle grouping: The local or shuffle grouping is similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Otherwise, it will fall back to the shuffle grouping behavior. Depending on the parallelism of a topology, the local or shuffle grouping can increase topology performance by limiting network transfer. (If the target bolt has one or more tasks in the JVM instance corresponding to the same worker process, the Tuple is only sent to these tasks)
  • Own Stream Grouping : you can define your own stream grouping by implementing the CustomStreamGrouping interface(通过实现CustomStreamGrouping接口自定义分组)

四、Topology并行度计算

官网的栗子:

conf.setNumWorkers(2); // 该Topology运行在Supervisor节点的2个Worker进程中
topologyBuilder.setSpout("blue-spout", new BlueSpout(), 2); // 设置并行度为2,则Task个数为2*1
topologyBuilder.setBolt("green-bolt", new GreenBolt(), 2)
           .setNumTasks(4)
           .shuffleGrouping("blue-spout"); // 设置并行度为2,设置Task个数为4 ,则Task个数为4
topologyBuilder.setBolt("yellow-bolt", new YellowBolt(), 6)
           .shuffleGrouping("green-bolt"); // 设置并行度为6,则Task个数为6*1

那么,下面我们看Storm是如何计算一个Topology运行时的并行度,并分配到2个Worker中的:

  • 计算Task总数:2乘1+4+6乘1=12(总计创建12个Task实例)
  • 计算运行时Topology并行度:10/2=5(每个Worker对应5个Executor)
  • 将12个Task分配到2个Worker中的5*2个Executor中:应该是每个Worker上5个Executor,将6个Task分配到5个Executor中
  • 每个Worker中分配6个Task,应该是分配3个Yellow Task、2个Green Task、1个Blue Task
  • Storm内部优化:会把同类型的Task尽量放到同一个Executor中运行
  • 分配过程:从Task个数最少的开始,1个Blue Task只能放到一个Executor,总计1个Executor被占用;2个Green Task可以放到同一个Executor中,总计2个Executor被占用;最后看剩下的3个Yellow Task能否分配到5-2=3个Executor中,显然每个Yellow Task对应一个Executor

五、Bolt生命周期

Bolt是这样一种组件,它把元组作为输入,然后产生新的元组作为输出。实现一个bolt时,通常需要实现IRichBolt接口。Bolts对象由客户端机器创建,序列化为拓扑,并提交给集群中的主机。然后集群启动工人进程反序列化bolt,调用prepare,最后开始处理元组。

//为bolt声明输出模式
declareOutputFields(OutputFieldsDeclarer declarer)
//仅在bolt开始处理元组之前调用
prepare(java.util.Map stormConf, TopologyContext context, OutputCollector collector)
//处理输入的单个元组
execute(Tuple input)
//在bolt即将关闭时调用
cleanup()

六、在Storm上的topology的生命周期如下:

  1. 上传代码并做校验(/data/nimbus/inbox);
  2. 建立本地目录(/data/nimbus/stormdist/topology-id/);
  3. 建立zookeeper上的心跳目录;
  4. 计算topology的工作量(parallelism hint),分配task-id并写入zookeeper;
  5. 把task分配给supervisor执行;
  6. 在supervisor中定时检查是否有新的task,下载新代码、删除老代码,剩下的工作交个小弟worker;
  7. 在worker中把task拿到,看里面有哪些spout/Bolt,然后计算需要给哪些task发消息并建立连接;
  8. 在nimbus将topology终止的时候会将zookeeper上的相关信息删除;

七、消息的可靠处理机制

Storm内部通过一种巧妙的异或算法判读每个tuple是否被正确完整的处理。

  1. Spout的一个Task创建一个Tuple时,即在Spout的nextTuple()方法中实现从特定数据源读取数据的处理逻辑中,会与Acker进行通信,向Acker发送消息,Acker保存该Tuple对应信息:{:spout-task task-id :val ack-val)}。
  2. Bolt在emit一个新的子Tuple时,会保存子Tuple与父Tuple的关系。
  3. 在Bolt中进行ack时,会计算出父Tuple与由该父Tuple新生成的所有子Tuple的一个异或值,将该值发送给Acker(计算异或值:tuple-id ^ (child-tuple-id1 ^ child-tuple-id2 … ^ child-tuple-idN))。可见,这里Bolt并没有把所有生成的子Tuple发送给Acker,这要比发送一个异或值大得多了,只发送一个异或值大大降低了Bolt与Acker之间网络通信的开销。
  4. Acker收到Bolt发送的异或值,与当前保存的task-id对应的初始ack-val做异或,tuple-id与ack-val相同,异或结果为0,但是子Tuple的child-tuple-id等并不互相相同,只有等所有的子Tuple的child-tuple-id都执行ack回来,最后ack-val就为0,表示整个Tuple树处理成功。无论成功与失败,最后都要从Acker维护的队列中移除。
  5. 最后,Acker会向产生该原始父Tuple的Spout对应的Task发送通知,成功或者失败,回调Spout的ack或fail方法。如果我们在实现Spout时,重写了ack和fail方法,处理回调就会执行这里的逻辑。

当然这种异或算法存在1/2^64概率的误差,可以忽略不计。
在开发中,对于那些不允许丢失的消息我们在发送消息时要对tuple指定messageID并进行锚定,告诉tuple tree这里增加了一个新的节点,保证消息的可靠性。

collector.emit(tuple,messageId)//可靠消息
collector.emit(tuple)//不可靠的消息

collector.emit(tuple, new Values(word));//锚定发送,可靠的消息
collector.emit(new Values(word)));//非锚定发送,不可靠的消息

注意:继承BaseBasicBolt实现的API本是就是可靠性的,不需要自己进行锚定发送和调用ack以及fail方法。

八、Storm的容错机制

1、任务级容错

  • Bolt任务crash引起的消息未被应答。此时,acker中所有与此Bolt任务关联的消息都会因为超时而失败,对应的Spout的fail方法将被调用。
  • acker任务失败。如果acker任务本身失败了,它在失败之前持有的所有消息都将超时而失败。Spout的fail方法将被调用。
  • Spout任务失败。在这种情况下,与Spout任务对接的外部设备(如MQ)负责消息的完整性。例如,当客户端异常时,kestrel队列会将处于pending状态的所有消息重新放回队列中。

2、任务槽(slot)故障

  • Worker失败。每个Worker中包含数个Bolt(或Spout)任务。Supervisor负责监控这些任务,当worker失败后会尝试在本机重启它,如果它在启动时连续失败了一定的次数,无法发送心跳信息到Nimbus,Nimbus将在另一台主机上重新分配worker。
  • Supervisor失败。Supervisor是无状态(所有的状态都保存在Zookeeper或者磁盘上)和快速失败(每当遇到任何意外的情况,进程自动毁灭)的,因此Supervisor的失败不会影响当前正在运行的任务,只要及时将他们重新启动即可。
  • Nimbus失败。Nimbus也是无状态和快速失败的,因此Nimbus的失败不会影响当前正在运行的任务,但是当Nimbus失败时,无法提交新的任务,只要及时将它重新启动即可。

3、集群节点(机器):

  • Storm集群中的节点故障。此时Nimbus会将此机器上所有正在运行的任务转移到其他可用的机器上运行。
  • Zookeeper集群中的节点故障。Zookeeper保证少于半数的机器宕机系统仍可正常运行,及时修复故障机器即可。

九、Storm's DRPC Server

DRPC Server整体工作过程:
(1)接受一个RPC请求
(2)发送请求到Storm Topology
(3)执行相应操作
(4)把结果发回给客户端

参考资料:

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326687418&siteId=291194637