Detailed explanation of the principle of Storm batch transaction

Transaction: The Storm fault tolerance mechanism uses a system-level component acker, combined with the xor verification mechanism to determine whether a tuple is successfully sent, and then the spout can resend the tuple to ensure that a tuple is resent at least once in the event of an error.

When it is necessary to accurately count the number of tuples such as sales amount scenarios, it is hoped that each tuple is "processed only once". Storm 0.7.0 introduced Transactional Topology, which ensures that each tuple is "processed only once", so that we can implement a very accurate and highly fault-tolerant way to implement counting applications. Batch processing: The efficiency of transaction processing a single tuple is relatively low. Processing a single tuple one by one increases a lot of overhead, such as writing the library and the frequency of outputting results is too high. Therefore, batch processing is introduced in Storm. Batch processing is to process a batch of tuple at one time. Transactions It can ensure that the batch is either processed successfully, and if there is a processing failure, it will not be counted. Storm will resend the failed batch, and ensure that each batch is processed only once. Principle of transaction mechanism: In Storm transaction processing, the calculation of a batch is divided into two phases: processing and commit: Processing phase: Multiple batches can be calculated in parallel; Commiting phase: The batches are forced to be submitted in order. Example : Processing phase: Multiple batches can be calculated in parallel. In the above example, bolt2 is an ordinary batchbolt (implementing IBatchBolt), then multiple batches can be executed in parallel between bolt2 tasks.











Commiting phase: The batches are forced to be submitted in order. In the above figure, Bolt3 implements IBatchBolt and marks those that require transaction processing (implementing the ICommitter interface or adding BatchBolt to the topology through the setCommitterBolt method of TransactionalTopologyBuilder), then Storm thinks that the batch can be submitted. When calling finishbatch, do txid comparison and state preservation work in finishBatch. In the example, batch2 must wait for batch1 to submit before submitting.


For the need of processing only once: In principle, it is necessary to bring the txid when sending the tuple. When transaction processing is required, whether to process the txid or not has been successfully processed before. Of course, the txid and the processing result need to be Save together.
In transaction batch processing: a batch of tuples is assigned a txid. In order to improve the parallelism of batch processing, Storm adopts a pipeline (pipeline) processing model, so that multiple transactions can be executed in parallel, but the commits are strictly in order.



Transaction Topology Implementation
1. Transaction spout implementation

Transactional spout needs to implement ITransactionalSpout, which contains two internal interface classes Coordinator and Emitter. When the topology is running, the transactional spout contains a sub-Topology. The structure diagram is as follows:


There are two types of tuple, one is a transactional tuple, and the other is a tuple in a batch; the
   coordinator starts a transaction preparation When a batch is emitted, it enters the processing phase of a transaction and emits a transactional tuple (transactionAttempt & metadata) to the "batch emit" stream
   The Emitter subscribes to the coordinator's "batch emit" stream in an all grouping (broadcast) manner, and is responsible for actually emitting tuples for each batch. The sent tuple must use TransactionAttempt as the first field, and Storm determines which batch the tuple belongs to based on this field.
There is only one coordinator, and the emitter can have multiple instances according to the degree of parallelism.


TransactionAttempt and metadata
 TransactionAttempt contain two values: a transaction id and an attempt id. The role of transaction id is that the tuple in each batch is unique as described above, and it is the same no matter how many times the batch is replayed.
The attempt id is a unique id for each batch, but for the same batch, the attempt id after the replay is different from that before the replay. We can understand the attempt id as replay-times, and storm uses this id to distinguish one The metadata (metadata) of different versions of the tuple emitted by the batch

  contains the point from which the current transaction can replay the data, which is stored in the zookeeper, and the spout can serialize and deserialize the metadata from the zookeeper through Kryo.

Transaction internal processing flow chart



2. Transactional Bolt
BaseTransactionalBolt
To process tuples that are batched together, the execute method is called for each tuple call, and the finishBatch method is called when the entire batch processing is complete. If the BatchBolt is marked as a Committer, the finishBatch method can only be called during the commit phase. The commit phase of a batch is guaranteed by Storm to be executed only after the previous batch has been successfully submitted. And it will retry until all bolts in the topology complete the commit after commit. So how do you know that the processing of the batch is completed, that is, whether the bolt has received and processed all the tuples in the batch; inside the bolt, there is a CoordinatedBolt model.

The specific principle of CoordinateBolt is as follows:

Each CoordinateBolt records two values: which tasks have sent me tuples (according to topology grouping information); which tasks I want to send information (also based on grouping information).

After all tuples have been sent, CoordinateBolt tells all the tasks that it has sent tuples in the way of emitDirect through another special stream, how many tuples it has sent to this task. The downstream task will compare this number with the number of tuples it has received. If it is equal, it means that all tuples have been processed.

The downstream CoordinateBolt will repeat the above steps to notify its downstream.







Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326317411&siteId=291194637