Talk about Storm

Interview-oriented blogs are presented in Q / A style.


Question1: Briefly introduce Storm?

Answer1:

Storm is a free and open source distributed real-time computing system. Using Storm can easily handle unlimited data streams reliably. Like Hadoop processing big data in batches, Storm can process data in real time.

Storm cluster architecture, as shown below:

Insert picture description here
For the explanation of the above figure:

1. Nimbus (as master, function: distribute code to Supervisor)
The master node of the Storm cluster is responsible for distributing user code and assigning it to the Worker node on the specific Supervisor node to run the Task of the component (Spout / Bolt) corresponding to Topology .

2. Supervisor (as a slave, function: manage the start and termination of the Worker process)
The slave node of the Storm cluster is responsible for managing the start and termination of each Worker process running on the Supervisor node. Through the supervisor.slots.ports configuration item in Storm's configuration file, you can specify the maximum number of slots allowed on a Supervisor. Each slot is uniquely identified by a port number, and a port number corresponds to a Worker process (if the Worker process is start up).

3. Worker (a process
that specifically processes component logic ) runs a process that specifically processes component logic. There are only two types of tasks that Worker runs, one is Spout task and the other is Bolt task.

4.
Each spout / bolt thread in Task worker is called a task. After storm 0.8, the task no longer corresponds to a physical thread. Tasks with different spout / bolt may share a physical thread, which is called an executor.

5. ZooKeeper is
used to coordinate Nimbus and Supervisor. If the Supervisor is unable to run Topology due to a problem, Nimbus will immediately sense and redistribute Topology to run on other available Supervisors.


Question2: Talk about the Storm programming model (spout-> tuple-> bolt)?

Answer2:

Strom can be divided into two components: spout and bolt during operation. Among them, the data source starts from spout, and the data is sent to the bolt in tuple mode. Multiple bolts can be connected in series, and a bolt can also be connected to multiple spot / bolt. .

The whole runtime principle is as follows:

Insert picture description here
Explanation of the main concepts in the above picture:

1.
The name of a real-time application running in Topology Storm. Topology graph integrating Spout and Bolt. Defines the combination relationship, concurrent number, configuration, etc. of Spout and Bolt.

2. Spout
gets the components of the source data stream in a topology. Normally, spout will read data from an external data source and then convert it to the source data inside the topology.

3. Bolt
accepts the data and then executes the processing component, where users can perform the operations they want.

4.
The basic unit of one-time message transmission of Tuple , understood as a group of messages is a Tuple.

5.
Collection of Stream Tuple. Represents the flow of data.


Question3: Talk about the operation of Topology?

Answer3:

In Storm, the computing task of a real-time application is packaged and released as Topology, which is similar to the Hadoop MapReduce task. But there is one difference: In Hadoop, the MapReduce task will eventually finish after execution is completed; in Storm, once the Topology task is submitted, it will never end unless you show it to stop the task . The computing task Topology is a graph connected by different Spouts and Bolts through a data stream (Stream). When a Storm runs a Topology on a cluster, it mainly completes the execution of Topology through the following three entities:

(1). Worker (process)
(2). Executor (thread)
(3). Task
Insert picture description here

1. Worker (1 worker process executes a subset of 1 topology)
1 worker process executes a subset of 1 topology (Note: there will not be 1 worker serving multiple topologies). A worker process will start one or more executor threads to execute a topology component (spout or bolt). Therefore, a running topology is composed of multiple worker processes on multiple physical machines in the cluster.

2. Executor (executor is a separate thread started by the worker process)
Executor is a separate thread started by the worker process. Each executor will only run a task of 1 component (spout or bolt) of topology (Note: task can be 1 or more, storm default is 1 component only generates 1 task, executor thread will be in each All task instances are called sequentially in the second loop).

3. Task (the unit
that finally runs the code in spout or bolt) Task is the unit that finally runs the code in spout or bolt (Note: 1 task is 1 instance of spout or bolt, and the executor thread will call this task during execution The nextTuple or execute method). After the topology is started, the number of tasks of a component (spout or bolt) is fixed, but the number of executor threads used by the component can be dynamically adjusted (for example: 1 executor thread can execute one or more tasks of the component Examples). This means that there is such a condition #threads <= # tasks for one component (that is, the number of threads is less than or equal to the number of tasks). By default, the number of tasks is equal to the number of executor threads, that is, one executor thread runs only one task.

Insert picture description here


Question4: Talk about Storm Streaming Grouping?

Answer4:

Stream grouping is the most important abstraction in Storm. It can control the way in which the Task corresponding to Spot / Bolt distributes Tuple, and launch the Tuple to the Task corresponding to the target Spot / Bolt.
Insert picture description here
Currently, Storm Streaming Grouping supports the following types:

1. Huffle Grouping
random grouping, as far as possible evenly distributed to the downstream Bolt to define flow grouping as mixed row. This shuffle grouping means that the input from Spout will be shuffled, or randomly distributed to the tasks in this Bolt. shuffle grouping allocates the tuple of each task more evenly.

2. Fields Grouping is
grouped by field and grouped by the field value in the data; Tuples with the same field value are sent to the same Task. This grouping mechanism ensures that tuples with the same field value will go to the same task.

3. All grouping: broadcast
broadcast transmission, for each tuple will be copied to each bolt for processing.

4. Global grouping
Global grouping, Tuple is assigned to a Task in a Bolt, to achieve transactional Topology. All tuples in Stream will be sent to the same bolt task for processing, and all tuples will be sent to the bolt task with the smallest task_id for processing.

5. None grouping: not grouping.
This method is used when you do not pay attention to the parallel processing load balancing strategy. It is currently equivalent to shuffle grouping. In addition, Storm will arrange the bolt task and his upstream data providing task under the same thread.

6. Direct grouping: Direct grouping specifies the grouping. The
tuple's transmitting unit directly determines the bolt to which the tuple will be transmitted. Generally, it is the bolt that receives the tuple to decide which bolt to transmit. This is a special grouping method. Using this grouping means that the sender of the message specifies which task of the message receiver processes the message. Only message streams declared as Direct Stream can declare this grouping method. And this message tuple must be transmitted using emitDirect method. The message handler can get the taskid of the message processing it through the TopologyContext (The OutputCollector.emit method will also return the taskid).

Published 207 original articles · praised 80 · 120,000 views

Guess you like

Origin blog.csdn.net/qq_36963950/article/details/105336074