[Storm Streaming Computing] Introduction to Streaming Computing-Storm Basics and Installation Configuration

1. Streaming or Batch Streaming calculation and batch calculation:


1.1. Basic concepts:

Streaming Analytics Streaming Computing : As the name suggests, it is to process data streams, such as using streaming analysis engines such as Storm, Flink to process and analyze data in real time, and to use more scenarios such as real-time large screens and real-time reports.

Batch Analytics batch calculation : Collect data uniformly, store it in the Data Base, and then process the data in batches. In the traditional sense, it uses MapReduce, Hive, Spark Batch, etc. to analyze, process, and generate offline reports.

Flow computing the most essential difference with the bulk calculation is different treatments of data:
in the flow calculations, data according to the data flow manner, as a river, dynamic data increases, and there is no boundary; continuous processing of data.
In batch computing, data is processed in data blocks, like express parcels one by one, one batch of data is processed at one time.

Different treatment performance in appearance is reflected in the two modes of processing data timeliness and size of the different: the performance of flow calculation is a high time efficiency: a time from the data stream into the system to the output interval Shorter, you can get the latest calculation results in real time. The performance of batch computing is that it has a large data capacity: directly processing data sets from persistent storage devices or loading data sets into memory can provide sufficient processing resources.




1.2. Off-Line or Real-Time offline calculation and real-time calculation:

Offline calculation: All input data is known before the calculation starts, and the input data will not change. Generally, the calculation magnitude is larger and the calculation time is longer. It is a kind of processing and extraction of historical data. The characteristic of this type of data is that the amount of data is large and fixed. Basically, batch processing is adopted for calculation.

Real-time calculation: The input data can be input and processed one by one in a serialized manner, that is to say, it is not necessary to know all the input data at the beginning. The running time is short, the calculation level is relatively small, and it is real-time.

Real-time calculation is necessarily flow calculation ? Off-line calculation is necessarily batch computing ?

In fact, real-time and off-line from the needs of time-sensitive point of view, with the bulk flow is from the way the data processing point of view.
There is no one-to-one correspondence between the two, but streaming computing provides efficiency improvements for real-time computing , and offline computing allows larger data scales to be calculated in batches.
Batch calculation of high frequency calculation can also achieve the effect of real-time calculation.


2. Basic knowledge about Storm:


2.1. The background of Storm:

Twitter Storm is a free and open source distributed real-time computing system. The meaning of Storm for streaming computing is similar to the meaning of Hadoop for batch processing . Storm can process streaming data simply, efficiently and reliably, and supports multiple programming languages; Storm The framework can be easily integrated with the database system to develop a powerful real-time computing system.


2.2. Features of Storm:

Storm can be used in many fields, such as real-time analysis, online machine learning, continuous computing, remote RPC, data extraction, loading conversion, etc. Storm has the following characteristics:

Integration: Storm can be easily integrated with the queue system and database system;

simple API: Storm's API is simple and convenient to use;

scalability: Storm's parallel feature makes it run in a distributed cluster;

fault tolerance Performance: Storm can automatically restart failed nodes and redistribute tasks;

reliable message processing: Storm guarantees that each message can be processed completely;

supports various programming languages: Storm supports the use of various programming languages ​​to define tasks;

rapid deployment : Storm can be deployed and used quickly;

free and open source: Storm is an open source framework that can be used for free.


2.3. The main terms of Storm:

Nimbus: The Master of Storm, responsible for resource allocation and task scheduling. There is only one Nimbus in a Storm cluster.

Supervisor: The Slave of Storm, which is responsible for receiving tasks assigned by Nimbus and managing all Workers. A Supervisor node contains multiple Worker processes.

Worker: Work process, there are multiple tasks in each work process.

Task: Task, each Spout and Bolt in the Storm cluster is executed by several tasks. Each task corresponds to a thread of execution.

Streams: Storm describes the streaming data Stream as an infinite sequence of Tuples, which are created and processed in parallel in a distributed manner.

Tuple: Each tuple is a bunch of values, each value has a name, and each value can be of any type; Tuple should be a Key-Value Map, because the field names of the tuple passed between the components have been defined in advance Okay, so Tuple only needs to fill in each Value in order, so it is a Value List.

Spout: Storm believes that every Stream has a source, and abstracts this source as Spout; usually Spout reads data from external data sources (queues, databases, etc.), then encapsulates it into a Tuple form and sends it to the Stream. Spout is an active role. There is a nextTuple function inside the interface, and the Storm framework will call this function continuously.

Bolt:Storm将Streams的状态转换过程抽象为Bolt。Bolt即可以处理Tuple,也可以将处理后的Tuple作为新的Streams发送给其他Bolt;Bolt可以执行过滤、函数操作、Join、操作数据库等任何操作;Bolt是一个被动的角色,其接口中有一个execute(Tuple input)方法,在接收到消息之后会调用此函数,用户可以在此方法中执行自己的处理逻辑。

Topology:Storm将Spouts和Bolts组成的网络抽象成Topology,它可以被提交到Storm集群执行。Topology可视为流转换图,图中节点是一个Spout或Bolt,边则表示Bolt订阅了哪个Stream。当Spout或者Bolt发送元组时,它会把元组发送到每个订阅了该Stream的Bolt上进行处理;Topology里面的每个处理组件(Spout或Bolt)都包含处理逻辑, 而组件之间的连接则表示数据流动的方向;Topology里面的每一个组件都是并行运行的;在Topology里面可以指定每个组件的并行度, Storm会在集群里面按照并行度分配线程来同时计算;在Topology的具体实现上,Storm中的Topology定义仅仅是一些Thrift结构体(二进制高性能的通信中间件),支持各种编程语言进行定义。

Stream Groupings:Storm中的Stream Groupings告知Topology如何在组件间(如Spout和Bolt之间,或者不同的Bolt之间)进行Tuple的传送。每一个Spout和Bolt都可以有多个分布式任务,一个任务在什么时候、以什么方式发送Tuple就是由Stream Groupings来决定的。

2.4、Storm的架构设计与工作流程:


2.4.1、Storm的整体结构:

在这里插入图片描述

图2.4.1 Storm集群架构示意图

Storm集群采用“Master—Worker”的节点方式:
Master节点运行名为“Nimbus”的后台程序:负责在集群范围内分发代码、为Worker分配任务和监测故障。
Worker节点运行名为“Supervisor”的后台程序:负责监听分配给它所在机器的工作。

即根据Nimbus分配的任务来决定启动或停止Worker进程,在一个Worker节点上同时运行若干个Worker进程。

Storm使用 Zookeeper 作为分布式协调组件,负责Nimbus和多个Supervisor之间的所有协调工作。借助于Zookeeper,若Nimbus进程或Supervisor进程意外终止,重启时也能读取、恢复之前的状态并继续工作,使得Storm极其稳定。


2.4.2、向 Supervisor 的深处进发:

在这里插入图片描述

图2.4.2 Supervisor内部结构示意图

Worker:每个worker进程都属于一个特定的Topology,每个Supervisor节点的Worker可以有多个,每个Worker对Topology中的每个组件(Spout或 Bolt)运行一个或者多个Executor线程来提供task的运行服务。

Executor:Executor是产生于Worker进程内部的线程,会执行同一个组件的一个或者多个Task。

Task:实际的数据处理由Task完成,在Topology的生命周期中,每个组件的Task数目是不会发生变化的,而Executor的数目却不一定。Executor数目小于等于Task的数目,默认情况下,二者是相等的。


2.4.3、Storm的工作流程:

在这里插入图片描述

图2.4.3 Storm工作流程图

所有Topology任务的提交必须在Storm客户端节点上进行,提交后由Nimbus节点分配给其他Supervisor节点进行处理; Nimbus节点首先将提交的Topology进行分片,分成一个个Task,分配给相应的Supervisor, 并将Task和Supervisor相关的信息提交到Zookeeper集群上; Supervisor会去Zookeeper集群上认领自己的Task,通知自己的Worker进程进行Task的处理。 说明: 在提交了一个Topology之后,Storm就会创建Spout/Bolt实例并进行序列化。 之后,将序列化的组件发送给所有的任务所在的机器(即Supervisor节点),在每一个任务上反序列化组件。

3、Storm的安装:

相信你在学到这个知识点之前已经安装配置了CentOSJava JDKZookeeper 这些必要的组件。
如果没有安装好这些组件的同学也没关系,再不济你总有个 CentOS 7.x 的虚拟机配置吧。
我用的版本是 Java JDK-1.8.0 Zookeeper 3.4.6 CentOS 7 ,目标是安装 Storm1.2.3


3.1、安装Java JDK-1.8.0(已有JDK同学可以跳过这一步)

(确保要有网络连接)在虚拟机中,打开Shell窗格:

sudo yum install java-1.8.0-openjdk java-1.8.0-openjdk-devel

通过上述命令安装 OpenJDK,默认安装位置为 /usr/lib/jvm/java-1.8.0-openjdk。
OpenJDK 安装后就可以直接使用 java、javac 等命令了。
配置一下 JAVA_HOME 环境变量:

vim ~/.bashrc

在文件最后面添加如下单独一行(指向JDK的安装位置),并保存:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

接着还需要让该环境变量生效,执行如下代码:

source ~/.bashrc

设置好后我们来检验一下是否设置正确:

java -version

如果设置正确的话,java -version会输出 java 的版本信息。这样,Storm所需的Java运行环境就安装好了。

3.2 安装Zookeeper(已有Zookeeper同学可以跳过这一步)

选择安装 zookeeper 稳定版(3.4.6):
下载地址:Zookeeper安装地址1Zookeeper安装地址2
下载方式:打开网页,点击 Projects 下的 “zookeeper-3.4.6.tar.gz” 进行下载
传输方式:可以在虚拟机直接下载至Downloads文件夹下,也可以在Windows端下载好压缩文件用X Shell传输。


下载后执行如下命令进行安装 zookeeper:

sudo tar -zxf ~/Downloads/zookeeper-3.4.6.tar.gz -C /usr/local

cd /usr/local

sudo mv zookeeper-3.4.6 zookeeper

sudo chown -R hadoop ./zookeeper

最后一步操作是赋予权限,hadoop是虚拟机用户名,安装虚拟机时会配置一个hadoop用户拥有root权限。
接着执行如下命令进行zookeeper配置:

cd /usr/local/zookeeper

mkdir tmp

cp ./conf/zoo_sample.cfg ./conf/zoo.cfg

vim ./conf/zoo.cfg

将当中的 dataDir=/tmp/zookeeper 更改为

dataDir=/usr/local/zookeeper/tmp

启动Zookeeper:

./bin/zkServer.sh start

若成功显示 Starting zookeeper … STARTED 则启动成功。

3.2 安装Storm(单机模式)

下载地址:Storm-1.2.3下载地址
下载后执行如下命令进行安装Storm:

sudo tar -zxf ~/Downloads/apache-storm-1.2.3.tar.gz -C /usr/local

cd /usr/local

sudo mv apache-storm-1.2.3 storm

sudo chown -R hadoop ./storm

接着执行如下命令进行Storm配置:

cd /usr/local/storm

vim ./conf/storm.yaml

修改其中的 storm.zookeeper.serversnimbus.host 两个配置项,
因为我们配置的为主机模式的Storm,即 取消掉注释 且都 修改值为主机名,亦可以为主机端口号,如下所示。

在这里插入图片描述

添加主机名的ip映射:

sudo vim /etc/hosts

添加一行(主机号+主机名):

127.0.0.1 localhost

在这里插入图片描述

简单配置后就可以启动 Storm 了。执行如下命令启动 nimbus 后台进程:

cd /usr/local/storm

./bin/storm nimbus &

若启动成功则显示如下图内容:
在这里插入图片描述

启动 nimbus 后,终端被该进程占用了,不能再继续执行其他命令了。
因此我们需要另外开启一个终端,然后执行如下命令启动 supervisor 后台进程:

/usr/local/storm/bin/storm supervisor

在这里插入图片描述

同样的,启动 supervisor 后,我们还需要开启另外的终端才能执行其他命令。
另外,我们可以使用 jps 命令 检查是否成功启动,若成功启动会显示 nimbussupervisorQuorumPeeMain

在这里插入图片描述

至此,Storm(单机)的安装配置全部完成。

Guess you like

Origin blog.csdn.net/NEUmaohou/article/details/114987675