Introduction and principle of Storm

1. Overview

Storm is an open source distributed real-time computing system that can process large data streams simply and reliably.

Storm has many usage scenarios: such as real-time analytics, online machine learning, continuous computing, distributed RPC, ETL, and more.

Storm supports horizontal scaling, has high fault tolerance, guarantees that every message will be processed, and the processing speed is very fast (in a small cluster, each node can process millions of messages per second).

Storm is easy to deploy and operate, and more importantly, applications can be developed using any programming language.

2. Components

1. Structure

The storm structure is called topology (topology), which consists of stream (data stream), spout (data stream generator), and bolt (data stream operator).

The following figure shows the model provided by the official website:

Unlike jobs in Hadoop, topologies in Storm keep running until the process is killed or undeployed.

2、Stream

Storm's core data structure is a tuple, which is essentially a list of one or more key-value pairs. Stream is a sequence of infinite tuples.

3、spout

The spout connects to the data source, converts the data into a tuple, and emits the tuple as a data stream. The main job of developing a spout is to use the API to write code to consume data streams from data sources.

Data sources for spout can come from many sources:

Clickstreams from web or mobile applications, information from social networks, data collected by sensors, log information generated by applications.

Spout is usually only responsible for converting data and transmitting data, not for processing business logic, to prevent coupling with business, so that the multiplexing of spout can be easily realized.

4、bolt

Bolt is mainly responsible for the operation of data. After performing operations on the received data, selectively output one or more data streams.

A bolt can receive multiple data streams transmitted by spouts or other bolts, so that complex network topologies for data conversion and processing can be built.

Common typical functions of bolt:

Filters, joins, aggregations, calculations, and database reads and writes.

3. Introductory case

1. Case structure

Case: Word Count Case

Statement Spout-->Statement Separation Bolt-->Word Count Bolt-->Report Bolt.

2. Code implementation

1. Statement Generation Spout

SentenceSpout

As an introductory case, you can continuously read statements directly from an array as a data source.

SentenceSpout continuously reads the statement and uses it as a data source, assembles it into a single-valued tuple (key name sentence, key value is a statement in string format) and emits it backwards.

  {"sentence":"i am so shuai!"}

Code:

/**
 * BaseRichSpout类是ISpout接口和IComponent接口的一个简便的实现。采用了适配器模式，对用不到的方法提供了默认实现。
 */
public class SentenceSpout extends BaseRichSpout {

	private SpoutOutputCollector collector = null;
	private String[] sentencer = { 
			"i am so shuai", 
			"do you look me", 
			"you can see me", 
			"i am so shuai",
			"do you bilive" 
			};
	private int index = 0;

	/**
	 * ISpout接口中定义的方法 所有Spout组件在初始化时都会调用这个方法。 map 包含了Storm配置信息 context
	 * 提供了topology中的组件信息 collector 提供了发射tuple的方法
	 */
	@Override
	public void open(Map conf, TopologyContext context, SpoutOutputCollector collector) {
		this.collector = collector;
	}

	/**
	 * 覆盖自BaseRichSpout中的方法 核心方法 Storm通过调用此方法向发射tuple
	 */
	@Override
	public void nextTuple() {
		this.collector.emit(new Values(sentencer[index]));
		index = (index + 1 >= sentencer.length ? 0 : index + 1);
		Utils.sleep(1000);
	}

	/**
	 * IComponent接口中定义的方法 所有的Storm组件(spout、bolt)都要实现此接口。
	 * 此方法告诉Storm当前组件会发射哪些数据流，每个数据流中的tuple中包含哪些字段。
	 */
	@Override
	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		declarer.declare(new Fields("sentence"));
	}
}

2. Statement Delimited Bolt

SplitSenetenceBolt

The statement-separated Bolt subscribes to the tuple emitted by SentenceSpout, obtains the value corresponding to "sentence" every time a tuple is received, and then divides the obtained statement into words according to spaces. Each word is then fired back a tuple.

{"word":"I"}
{"word":"am"}
{"word":"so"}
{"word":"shuai"}

Code:

/**
 * BaseRichBolt 是IComponent 和 IBolt接口的一个简单实现。采用了适配器模式，对用不到的方法提供了默认实现。
 */
public class SplitSenetenceBolt extends BaseRichBolt {

	private OutputCollector collector = null;

	/**
	 * 定义在IBolt中的方法 在bolt初始化时调用，用来初始化bolt stormConf 包含了Storm配置信息 context
	 * 提供了topology中的组件信息 collector 提供了发射tuple的方法
	 */
	@Override
	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		this.collector = collector;
	}

	/**
	 * 覆盖自BaseRichBolt中的方法 核心方法 Storm通过调用此方法向发射tuple
	 */
	@Override
	public void execute(Tuple input) {
		String centence = input.getStringByField("sentence");
		String[] words = centence.split(" ");
		for (String word : words) {
			collector.emit(new Values(word));
		}
	}

	/**
	 * IComponent接口中定义的方法 所有的Storm组件(spout、bolt)都要实现此接口。
	 * 此方法告诉Storm当前组件会发射哪些数据流，每个数据流中的tuple中包含哪些字段。
	 */
	@Override
	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		declarer.declare(new Fields("word"));
	}
}

3. Word Count Bolt

WordCountBolt

The word count Bolt subscribes to the output of SplitSentenceBolt and saves the number of occurrences of each specific word. When a new tuple is received, the corresponding word count is incremented by one, and the current count of the word is sent backwards.

{"word":"I","count":3}

Code:

public class WordCountBolt extends BaseRichBolt {

	private OutputCollector collector = null;
	private Map<String, Integer> counts = null;

	/**
	 * 注意: 所有的序列化操作最好都在prepare方法中进行 原因:
	 * Storm在工作时会将所有的bolt和spout组件先进行序列化，然后发送到集群中，
     * 如果在序列化之前创建过任何无法序列化的对象都会造成序列化时抛出NotSerializableException。
	 * 此处的HashMap本身是可以序列化的所以不会有这个问题，但是有必要养成这样的习惯 。
	 */
	@Override
	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		this.collector = collector;
		this.counts = new HashMap<>();
	}

	@Override
	public void execute(Tuple input) {
		String word = input.getStringByField("word");
		counts.put(word, counts.containsKey(word) ? counts.get(word) + 1 : 1);
		this.collector.emit(new Values(word, counts.get(word)));
	}

	@Override
	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		declarer.declare(new Fields("word", "count"));
	}
}

4. Escalate Bolt

ReportBolt

The report Bolt subscribes to the output of the WordCountBolt class, and internally maintains a table of the corresponding counts of all words. When a tuple is received, the report Bolt will update the count data in the table and print the value to the terminal.

/**
 * 此Bolt处于数据流的末端，所以只接受tuple而不发射任何数据流。
 */
public class ReportBolt extends BaseRichBolt {

	private Map<String, Integer> counts = null;

	@Override
	public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) {
		this.counts = new HashMap<>();
	}

	@Override
	public void execute(Tuple input) {
		String word = input.getStringByField("word");
		Integer count = input.getIntegerByField("count");
		counts.put(word, count);
	}

	/**
	 * Storm会在终止一个Bolt之前调用此方法。
	 *  此方法通常用来在Bolt退出之前释放资源。
	 *   此处我们用来输出统计结果到控制台。
	 * 注意：
	 * 真正集群环境下，cleanup()方法是不可靠的，不能保证一定执行，后续会讨论。
	 */
	@Override
	public void cleanup() {
		List<String> keys = new ArrayList<>();
		keys.addAll(counts.keySet());
		Collections.sort(keys);
		for (String key : keys) {
			Integer count = counts.get(key);
			System.err.println("--" + key + "发生了变化----数量为" + count + "--");
		}
	}

	@Override
	public void declareOutputFields(OutputFieldsDeclarer declarer) {
		// 处于流末端的tuple，没有任何输出数据流，所以此方法为空
	}
}

5. Word Count Topology

Assemble the processing flow through the main method. Here we use the stand-alone mode to test.

public class WCDirver {
	public static void main(String[] args) throws Exception {
		// --实例化Spout和Bolt
		SentenceSpout sentenceSpout = new SentenceSpout();
		SplitSenetenceBolt splitSenetenceBolt = new SplitSenetenceBolt();
		WordCountBolt wordCountBolt = new WordCountBolt();
		ReportBolt reportBolt = new ReportBolt();
		// --创建TopologyBuilder类实例
		TopologyBuilder builder = new TopologyBuilder();
		// --注册SentenceSpout
		builder.setSpout("sentence_spout", sentenceSpout);
		// --注册SplitSentenceBolt，订阅SentenceSpout发送的tuple.
		// 此处使用了shuffleGrouping方法，
		// 此方法指定所有的tuple随机均匀的分发给SplitSentenceBolt的实例。
		builder.setBolt("split_senetence_bolt", splitSenetenceBolt).shuffleGrouping("sentence_spout");
		// --注册WordCountBolt,，订阅SplitSentenceBolt发送的tuple
		// 此处使用了filedsGrouping方法，
		// 此方法可以将指定名称的tuple路由到同一个WordCountBolt实例中
		builder.setBolt("word_count_bolt", wordCountBolt).fieldsGrouping("split_senetence_bolt", new Fields("word"));
		// --注册ReprotBolt，订阅WordCountBolt发送的tuple
		// 此处使用了globalGrouping方法，表示所有的tuple都路由到唯一的ReprotBolt实例中
		builder.setBolt("report_bolt", reportBolt).globalGrouping("word_count_bolt");
		//--创建配置对象
		Config config = new Config();
		//--创建代表集群的对象，LocalCluster表示在本地开发环境来模拟一个完整的Storm集群
		//本地模式是开发和测试的简单方式，省去了在分布式集群中反复部署的开销
		//另外可以执行断点调试非常的便捷
		LocalCluster cluster = new LocalCluster();
		//--提交Topology给集群运行
		cluster.submitTopology("Wc_Topology", config, builder.createTopology());
		//--运行10秒钟后杀死Topology关闭集群
		Thread.sleep(10 * 1000);
		cluster.killTopology("Wc_Topology");
		cluster.shutdown();
	}
}

4. Concurrency Mechanism

1. Concurrency level

Topologies in a Storm cluster have concurrency in the following four levels:

1．Nodes

Server: A server configured in a Storm cluster that performs part of the topology's operations. A Storm cluster contains one or more Nodes.

2．Workers

JVM virtual machine and process: Refers to a JVM process that operates independently of each other on a Node. Each Node can be configured to run one or more workers. A Topology is assigned to run on one or more workers.

3．Executor

Thread: refers to a java thread running in a worker's jvm. Multiple tasks can be assigned to the same executor for execution. Unless explicitly specified, Storm assigns a task to each executor by default.

4．Task

Bolt/spout instance: task is an instance of spout and bolt, and their nextTuple() and execute() methods will be called and executed by the executors thread.

In most cases, unless explicitly specified, Storm's default concurrency setting is 1. That is, a server (node) assigns a worker to the topology, and each executor executes a task.

Figure: Storm's default concurrency mechanism.

The only concurrency mechanism at this time occurs at the thread level, namely the Executor.

2. Increase concurrency at all levels

1. add Node

This is actually to increase the number of servers in the cluster.

2. add worker

The number of workers allocated to a topology can be modified in two ways: API and configuration modification.

API adds waker:

Config config = new Config();
config.setNumWorkers(2);

In stand-alone mode, increasing the number of workers will not have any speed-up effect.

3. Add Executor

API adds Executor:

builder.setSpout(spout_id,spout,2);
builder.setBolt(bolt_id,bolt,executor_num);

This method increases the number of threads for a Spout or Bolt. By default, each thread runs a task of the Spout or Bolt.

4. Add Task

API adds Task:

builder.setSpout(...).setNumTasks(2);
builder.setBolt(...).setNumTasks(task_num);

If you manually set the number of tasks, the total number of tasks is the specified number, regardless of the number of threads, these tasks will be randomly assigned to execute within these threads.

3. Data flow grouping

The way the data flow is grouped defines how the data is distributed.

Storm has seven built-in data flow grouping methods:

1．Shuffle Grouping

Random grouping.

Randomly distribute tuples in the data stream to each task in the bolt, and each task receives the same number of tuples.

2．Fields Grouping

Group by field.

Groups based on the value of the specified field. Tuples with the same value for the specified field are routed to tasks in the same bolt.

3．All Grouping

Full copy grouping.

All tuples are copied and distributed to all tasks of subsequent bolts.

4．Globle Grouping

Global grouping.

This grouping method routes all tuples to a single task, and Storm selects the task that accepts data according to the smallest task id. It is meaningless to configure the concurrency of bolts and tasks in this grouping method.

This method will cause all tuples to be sent to one JVM instance, which may cause a performance bottleneck or crash of a JVM or server in the Strom cluster.

5．None Grouping

Not grouped.

Functionally the same as random grouping, reserved for the future.

6．Direct Grouping

Pointer grouping.

The data source will use the emitDirect() method to determine which Strom component should accept a tuple. Can only be used on data streams declared as pointing.

7．Local or shuffle Grouping

Local or random grouping.

Similar to random grouping, but the tuple will be distributed to bolt tasks within the same worker, and random grouping will be used in other cases. In this way, network traffic can be reduced, thereby improving the performance of the topology.

8. customize

In addition, you can customize the data flow grouping method

Write a class to implement the CustomStreamGrouping interface

Code:

/**
* 自定义数据流分组方式
* @author park
*
*/
public class MyStreamGrouping implements CustomStreamGrouping {

  /**
  * 运行时调用，用来初始化分组信息
  * context:topology上下文对象
  * stream:待分组数据流属性
  * targetTasks:所有待选task的标识符列表
  * 
 */
  @Override
  public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) {

  }

  /**
  * 核心方法，进行task选择
  * taskId:发送tuple的组件id
  * values:tuple的值
  * 返回值要发往哪个task
  */
  @Override
  public List<Integer> chooseTasks(int taskId, List<Object> values) {
    return null;
  }
}

5. Concepts in Storm Cluster

1. Overview

Storm clusters follow a master/slave structure. Storm's master nodes are semi-fault tolerant.

A Strom cluster consists of a master node (nimbus) and one or more worker nodes (supervisor).

In addition, the Storm cluster also needs a ZooKeeper for cluster coordination.

2 、nimbus

The main responsibility of the nimbus daemon is to manage, coordinate and monitor the topology running on the cluster.

Including the release of the topology, the assignment of tasks, and the reassignment of tasks when event processing fails.

1. Task release process

Publish the topology to the Storm cluster.

Submit the topology and configuration information pre-packaged into jar files to the nimbus server. Once nimbus receives the topology compressed package, it will distribute the jar package to a sufficient number of supervisor nodes. When the supervisor node receives the topology compressed file, nimbus will A task is assigned to each supervisor and a signal is sent to the supervisor to generate enough workers to execute the assigned task.

nimbus records the status of all supervisor nodes and the tasks assigned to them. If nimbus finds that a supervisor has not reported a heartbeat or is unreachable, it will reassign the tasks assigned by the faulty supervisor to other supervisor nodes in the cluster.

2. Semi-fault-tolerant mechanism

Nimbus does not participate in the data processing process of the topology, but is only responsible for the initialization, task distribution and process monitoring of the topology. Therefore, even if the nimbus daemon stops while the topology is running, as long as the assigned supervisor and worker are running healthy, the topology will continue to process data, so it is called a semi-fault-tolerant mechanism.

3、supervisor

The supervisor daemon waits for the nimbus to assign tasks to spawn and monitor workers to execute tasks.

Both supervisor and worker run on different JVM processes. If the worker process started by the supervisor exits abnormally due to an error, the supervisor will try to regenerate a new worker process.