Storm basic principle concept and basic use

1. Background introduction

1.1 What is offline computing

Offline computing: batch acquisition of data, batch transmission of data, periodic batch calculation of data, and data display;

Representative technologies: Sqoop batch data import, HDFS batch data storage, MapReduce batch data calculation, Hive batch data calculation

1.2 What is stream computing

Streaming computing: real-time data generation, real-time data transmission, real-time data calculation, real-time display

Representative technologies: Flume real-time data acquisition, Kafka/metaq real-time data storage, Storm/JStorm real-time data calculation, Redis real-time result cache, persistent storage (mysql)

Summarize in one sentence: Collect and calculate in real time the continuously generated data, and get the calculation results as quickly as possible

1.3 What is Storm

Storm is used to process data in real time, and features: low latency, high availability, distributed, scalable, and no data loss. Provides a simple and easy-to-understand interface for easy development.

1.4 Difference between Storm and Hadoop

1. Storm is used for real-time computing, and Hadoop is used for offline computing;

2. The data processed by Storm is stored in memory, and there is a steady stream; the data processed by Hadoop is stored in the file system, batch by batch;

3. The data of Storm is transmitted through the network; the data of Hadoop is stored in the disk;

4. The programming model of Storm and Hadoop is similar;

2. Storm core components

Component Description Supplement:

Nimbus: Responsible for resource allocation and task scheduling.

Supervisor: Responsible for accepting tasks assigned by nimbus, starting and stopping worker processes that are managed by themselves. Set how many workers are started on the current supervisor through the configuration file.

Worker: The process that runs the specific processing component logic (actually a JVM). There are only two types of tasks that Worker runs, one is Spout task and the other is Bolt task.

Task: Each spout/bolt thread in the worker is called a task. After Storm 0.8, tasks no longer correspond to physical threads. Tasks with different spout/bolt may share a physical thread, which is called executor.

Zookeeper: saves task assignment information, heartbeat information, and metadata information.

Concurrency: A task specified by the user can be executed by multiple threads, and the number of concurrency is equal to the number of threads. Multiple threads of a task will be run on multiple Workers (JVMs), and there is a load balancing strategy similar to the average algorithm. Minimizing network IO as much as possible is the same as local computing in MapReduce in Hadoop.

3. Storm programming model

 

Component Description Supplement :

DataSource: External data source.

Spout: A component that receives an external data source, converts the external data source into internal data of Storm, and sends it to Bolt with Tuple as the basic transmission unit.

Bolt: Receive data sent by Spout, or data sent by upstream Bolt. Process according to business logic. Send it to the next Bolt or store it on some medium. The medium can be Redis, MySQL, or others.

Tuple: The basic unit of data transmission in Storm, which encapsulates a List object to save data.

StreamGrouping: Data grouping strategy. 7 types, shuffleGrouping (Random function), Non Grouping (Random function), FieldGrouping (Hash modulo), Local or ShuffleGrouping, local or random, priority local.

Worker与Topology

A worker belongs to only one topology, and the tasks running in each worker can only belong to this topology. Conversely, a topology contains multiple workers, in fact, the topology runs on multiple workers. If the number of workers required by a topology is not met, the cluster runs the topology first according to the existing workers when assigning tasks. If the number of workers in the current cluster is 0, the newly submitted topology will only be marked active and will not run. It will only be run when the cluster has free resources.

4. Storm common operation commands

storm has many simple and useful commands for managing topology, they can commit, kill, disable, rebalance topology.

4.1 Submit task command

storm jar [jar path] [topology package name . topology class name] [topology name]

storm  jar  examples/storm-starter/storm-starter-topologies-0.9.6.jar  storm.starter.WordCountTopology  wordcount

4.2 Kill task command

storm kill [topology name] -w 10 (When executing the kill command, you can use -w [waiting seconds] to specify the waiting time after the topology is deactivated)

storm  kill  topology-name  -w  10

4.3 Deactivate task command

storm deactivte [topology name]

storm  deactivte  topology-name

We are able to suspend or deactivate a running topology. When deactivating the topology, all distributed tuples are processed, but the spouts' nextTuple method is not called. To destroy a topology, use the kill command. It destroys a topology in a safe manner, first deactivating the topology, allowing the topology to complete the current flow of data for a period of time waiting for topology messages.

4.4 Enable task command

activate storm [topology name]

storm  activate  topology-name

4.5 Redeploy task command

storm rebalance [topology name]

storm  rebalance  topology-name

Rebalancing lets you redistribute cluster tasks. This is a very powerful command. For example, you add nodes to a running cluster. The rebalance command will deactivate the topology, then redistribute workers and restart the topology after the appropriate timeout period.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325413197&siteId=291194637