Concurrency mechanism of Storm program

Foreword:

In order to improve the efficiency of Storm program execution in future practice, it is still necessary to understand the concurrency mechanism of the corresponding Storm program. (Haha, although the blogger’s rookie level has not been exposed to this kind of thing that improves the efficiency of the program (here is just empty theory), but it is still necessary to understand the parallel mechanism of Storm.

1. Concept

  • Concurrency: A task specified by the user can be executed by multiple threads , and the number of concurrency is equal to the number of threads. Multiple threads of a task will be run on multiple Workers (JVMs), and there is a load balancing strategy similar to the average algorithm. Minimizing network IO as much as possible is the same as local computing in MapReduce in Hadoop.

  • Workers (JVMs): One or more independent JVM processes can run on a physical node . A topology can contain one or more workers (running on different physical machines in parallel), so the worker process is to execute a subset of a topology, and a worker can only correspond to one topology .
  • Executors (threads): Runs multiple Java threads in a worker JVM process. An executor thread can execute one or more tasks . But generally, each executor executes only one task by default . A worker can contain one or more executors, and each component (spout or bolt) corresponds to at least one executor, so it can be said that an executor executes a subset of a component, and an executor can only correspond to one component.
  • Tasks (bolt/spout instances): Task is a specific processing logic object. Each Spout and Bolt will be executed as many tasks in the entire cluster. Each task corresponds to a thread, and stream grouping defines how to emit tuples from one group of tasks to another group of tasks. You can call TopologyBuilder.setSpout and TopBuilder.setBolt to set the degree of parallelism — that is, how many tasks there are.

2. Configure parallelism

  • For the configuration of concurrency, it can be configured in multiple places in storm, and the priority is: defaults.yaml < storm.yaml < topology-specific configuration < internal component-specific configuration < external component-specific configuration
  • The number of worker processes can be configured through configuration files and code. Workers are executing processes, so considering the effect of concurrency, the number should be at least greater than the number of machines
  • The number of executors, the number of concurrent threads of the component, can only be configured in the code (through the parameters of setBolt and setSpout), for example, setBolt("green-bolt", new GreenBolt(), 2)
  • The number of tasks, which can be unconfigured, defaults to executor1:1, or can be configured through setNumTasks(). The number of workers of a topology is set through config, that is, the number of worker (java) processes that execute the topology. It can be adjusted arbitrarily with the storm rebalance command.
  • Dynamically changing the degree of parallelism
    Storm supports dynamically changing (increasing or decreasing) the number of worker processes and the number of executors without restarting the topology, called rebalancing. Through the Storm web UI, or through the storm rebalance command:
    storm rebalance mytopology -n 5 -e blue-spout=3 -e yellow-bolt=10

The concurrency description is shown in the following figure:
write picture description here

Configuration instance

Config conf = newConfig();
conf.setNumWorkers(2); //用2个worker
topologyBuilder.setSpout("blue-spout", newBlueSpout(), 2); //设置2个并发度
topologyBuilder.setBolt("green-bolt", newGreenBolt(), 2).setNumTasks(4).shuffleGrouping("blue-spout"); //设置2个并发度,4个任务
topologyBuilder.setBolt("yellow-bolt", newYellowBolt(), 6).shuffleGrouping("green-bolt"); //设置6个并发度
StormSubmitter.submitTopology("mytopology", conf, topologyBuilder.createTopology());

The concurrency of the three components adds up to 10, which means that the topology has a total of 10 executors and a total of 2 workers, and each worker generates 10 / 2 = 5 threads.
The green bolt is configured with 2 executors and 4 tasks. For this each executor runs 2 tasks for this bolt.

Summarize:

After knowing the concurrency mechanism, how to specify the number of concurrency of each component in the driver class in actual production? How to set the number of workers? There are the following reference points:
1. Set the concurrency of Spout according to the amount of upstream data.
2. Set the Bolt concurrency according to the business complexity and the execution time of the execute method.
3. It is configured according to the available resources of the cluster. Generally, the resource utilization rate is 70%.
4. The number of workers is theoretically divided equally according to the total number of tasks in the program concurrency. In actual business scenarios, it needs to be adjusted repeatedly.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325807790&siteId=291194637