Introduction to storm concept

Introduce the following basic concepts

 

Topologies  

The logic of a real-time application is encapsulated in a Storm topology. A Storm topology is similar to a MapReduce job. The key difference between the two is that a MapReduce job eventually completes, while a topology task runs forever (unless it is killed). A topology is a directed acyclic graph of Spout and Bolt connected by stream groupings.

 

Streams

A stream is a core concept in Storm. A stream is an unbounded sequence of Tuples that are created and processed in parallel in a distributed fashion. A stream is defined by a schema that names the fields in the stream tuple. The default In this case Tuple can contain integers, longs, shorts, bytes, strings, doubles, floats, booleans, and byte arrays and other data types. You can also define your own serializers so that you can use custom types in Tuples.

Each stream is assigned an ID when it is declared. Since Spout and Bolt containing only one stream are common, OutputFieldsDeclarer has a more convenient way to define a single stream without specifying an ID. The stream is assigned a default ID, "default".

 

 

Spouts

A spout is the source of streams in a topology. Usually a spout will read tuples from an external data source and send them to the topology (such as a Kestel queue, or the Twitter API). Spout can be reliable or unreliable. Reliable Spout Tuples can be replayed when Storm fails to process them, and unreliable spouts are gone once they send a tuple.

A Spout can send multiple streams. Multiple streams can be defined using the declareStream method of the OutputFieldsDeclarer, and the stream to be sent to is specified in the emit method of the SpoutOutputCollector object.

The main method in a Spout is nextTuple. nextTuple either sends a new Tuple to the topology, or returns if there are no Tuples to send. For any Spout implementation, the nextTuple method must be non-blocking, since Storm is in a All Spout methods are called in the thread.

A few other important methods of a spout are ack and fail. These methods are called when Storm detects that the tuple sent by the spout was successfully processed or failed. ack and fail are only called on reliable spouts. More information , see the Javadoc.

 

Bolts

All business processing in the topology is done in Bolts. Bolts can do many things, filter, function, aggregate, associate, interact with database, etc.

Bolts can do simple stream transformations. Complex stream transformations generally require multiple steps, so multiple bolts work together. For example, converting a tweets stream to a trending images stream requires two steps: a bolt to do each image The scroll count of favorites while one or more bolts output the image of the favorite Top X (you can handle this stream transition in a more flexible way, using 3 bolts instead of the previous 2 bolts).

Bolt can send multiple streams. You can use the declareStream method of OutputFieldsDeclarer to define multiple streams, and specify the stream to send when using the OutputCollector emit method.

When you declare a Bolt's input stream, you always subscribe to other component-specific streams. If you want to subscribe to all streams of other components, you must subscribe one by one. InputDeclarer has syntax to subscribe to streams with default stream-id, Code: declarer.shuffleGrouping("1"), meaning: subscribe to the default stream of component "1", equivalent to declarer.shuffleGrouping("1", DEFAULT_STREAM_ID).

The main method in Bolt is the execute method, which is entered when a new Tuple is input. Bolt uses the OutputCollector object to send new Tuples. Bolt must call the ack method on the OutputCollector after each Tuple is processed, and Storm will You will know when the tuple is complete (finally you can determine that there is no problem with calling the source Spout Tuple). When processing an input Tuple: zero or more Tuples will be generated based on this Tuple and sent out. When all the tuple are completed, it will be called acking. Storm provides the IBasicBolt interface to automatically perform acking.

It is best to start a new thread in the Bolt to process tuples asynchronously. OutputCollector is thread-safe and can be called at any time.

 

 

 

 

Stream groupings

Part of the topology definition is specifying input streams for each bolt. Stream grouping defines how streams are partitioned among bolts tasks.

There are a total of 8 built-in Stream Groupings in Storm. You can customize Stream groupings by implementing the CustomStreamGrouping interface.

Shuffle grouping: Tuples are randomly distributed to Bolt Tasks, and each Bolt gets an equal amount of Tuples.

Fields grouping: Streams are partitioned by the fields specified by grouping. For example, streams are partitioned by the "user-id" field, Tuples with the same "user-id" will be sent to the same task, and Tuples with different "user-id" may flow in to different tasks.

Partial Key grouping: The stream is grouped by the field specified in grouping, similar to Fields Grouping. But it is load balanced for 2 downstream Bolts, which can provide better optimization in the case of uneven input data. The following address This paper better explains how it works and its advantages.

All grouping: The stream is replicated across all Bolt Tasks. Use this grouping with care.

Global grouping: The entire stream will enter one of the tasks of the Bolt. In particular, it will enter the task with the smallest id.

None grouping: With this grouping, you don't need to care how streams are grouped. Currently, None grouping and Shuffle grouping are equivalent. Also, Storm will use None grouping bolts and upstream subscription bolts and spouts to run on the same thread (when possible).

Direct grouping: This is a special grouping method. Stream grouping in this way means that the producer of this Tuple decides which consumer will receive it. Direct grouping can only be used for direct streams. Tuples must be sent using the emitDirect(int, int, java.util.List) method. Bolts can be obtained using the TopologyContext or by keeping track of the output of the emit method in the OutputCollector (returns the target task id to which the tuple was sent) The IDs of all consumers of .

Local or shuffle grouping: If the target Bolt has multiple tasks and streams source in the same worker process, Tuple will only shuffle tasks to the same worker. Otherwise, it is the same as shuffle goruping.

Storm guarantees that each spout's tuple will be processed by the topology. By tracking the tuples tree, each spout tuple triggers the tree, ensuring that the tuples tree completes successfully. Each topology has an associated "message timeout". If Storm detects When a Spout Tuple is not processed within this timeout period, the Tuple is judged to fail and re-executed later.

To take advantage of this reliability feature, you must tell Storm when a new edge is created in the tuple tree, and also notify Storm when a single tuple completes. This is done in the OutputCollector object that Bolt uses to send the tuple. Action. Anchoring is done in the emit method, use the ack method to declare that you have successfully completed processing a Tuple.

 

Tasks

Each Spout or Bolt is executed as multiple Tasks across the cluster. Each Task corresponds to a thread of execution, and stream groupings define how to send Tuples from one Task to another Task. You can use the setSpout and setBolt methods of TopologyBuilder for each task. A Spout or Bolt sets the degree of parallelism, .

 

Workers

Topologies execute on one or across multiple workers. Each worker process is a physical JVM that executes a subset of topology tasks. For example, if a topology has a parallelism of 300, there are 50 workers in total During operation, each Worker will be allocated 6 Tasks (as threads in the Worker). Storm will try to distribute all Tasks evenly to all Workers.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326182306&siteId=291194637