Overview and installation of Apache Flink (Chapter 1)

Apache Flink

Flink is a stream computing framework for stateful computing built on top of data streams, and is usually understood as a third-generation big data analysis solution.

The first generation-Hadoop's MapReduce computing (2006) (static), Storm stream computing (2014.9); two independent computing engines, difficult to use, low throughput, but relatively fast calculations.
The second generation-Spark RDD static batch processing (2014.2), DStream|Structured Streaming stream computing; a unified computing engine, with a small degree of difficulty and large throughput, but the calculation is relatively slow.

The third generation-Flink DataStream (2014.12) stream computing framework, Flink Dataset batch processing; unified computing engine, the degree of difficulty is not low or high

It can be seen that Spark and Flink were born at the same time, but the slow development of Flink was because early people did not have a deep understanding of the analysis of large data or the business scenarios at the time were limited to the batch processing field. The development of Flink has been slower than that of Spark, and it was not until 2016 that people began to slowly start the importance of stream of consciousness computing.

Stream computing field: system monitoring, public opinion monitoring, traffic forecasting, national power, disease forecasting, banking/financial financial control, etc.

More detailed analysis: https://blog.csdn.net/weixin_38231448/article/details/100062961

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-vvJPROWu-1583327569689)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ 1583293992367.png)]

Operational architecture


concept

Task和Operator Chain

Flink is a distributed stream computing engine. The engine splits a computing job into several tasks (equivalent to the Stage in Spark). Each task has its own parallelism. The degree is represented by a thread. Because a task is executed in parallel, the bottom layer of a task corresponds to a series of threads. Flink calls these threads as subtasks of the task.

The difference from Spark is that Spark implements stage division through RDD dependencies, while Flink implements task division through the concept of OperatorChain.

The so-called OperatorChain refers to when Flink is doing job weaving, it tries to concatenate multiple operators into a task in order to reduce the overhead of thread-to-thread transmission of data. Previously, Flink's usual Operatorchain There are two modes: forward, hash|rebalance.
Insert picture description here

Task-equivalent to the Stage in spark, each Task has several SubTask
SubTask-equivalent to a thread, it is a subtask in the Task
OperatorChain-a mechanism for merging multiple calculations into one Task , The merging principle is similar to the wide and narrow dependencies of SparkRDD

JobManagers、TaskManagers、Clients

JobManagers-(also known as masters) are responsible for coordinating distributed execution. Responsible for task scheduling, coordination of checkpoints, coordination of failure recovery, etc., which are equivalent to the functions of Master+Driver in Spark. Generally, there is at least one Active JobManager in a cluster, if the others are in the StandBy state in HA mode.

TaskManagers-(called Workers) are really responsible for Task execution of the computing nodes, and at the same time need to report personal status information and work load to the JobManager. Usually there are several TaskManagers in a cluster.

Client-Unlike Spark, the Client in Flink is not a part of cluster computing. The Client is only responsible for submitting the Dataflow Graph of the task to the JobManager. After the submission is completed, it can exit directly. Therefore, the Client is not responsible for scheduling during task execution.

Task Slots和Resources

Each Worker (TaskManager) is a JVM process that can execute one or more subtasks (Thread/SubTask). In order to control the Worker node to accept multiple Task tasks, Worker proposes the so-called Task slot to express a calculation The computing power of the node (each computing node has at least one Task slot).

Each TaskSlot represents a fixed sub-set of TaskManager computing resources. For example: if a TaskManager has 3 TaskSlots, each Task Slot represents 1/3 of the memory resources of the current TaskManager process. Each job (calculation) has its own fixed Task Slot when it starts, which means that it avoids the preemption of memory resources during operation between different jobs. These allocated Task Slot resources can only be used by all tasks of the current job, and there are no resource sharing and preemption problems between tasks of different jobs.

However, a Job will be split into several Tasks, and each Task is composed of several SubTasks (depending on the task parallelism). The memory resources corresponding to the default Task Slot can only be shared among the subtasks of different tasks under the same job, which means that different subtasks of the same task cannot run in the same Task slot, but if it is SubTasks of different tasks of the same job can.

If subtasks of different tasks of the same job do not share slots, it will result in waste of resources. For example, the source and map operations in the figure below are positioned as resource sparse operations, because the operation takes up a small amount of memory, but keyBy/windows()/apply() involves Shu!le, which will occupy a large amount of memory resources, and is positioned as a resource Intensive operations, less memory consumption.

Therefore, Flink's underlying default is to share Task Slot resources with subtasks of different Tasks. Therefore, the user can adjust the parallelism of the tasks corresponding to `source/map and keyBy/windows()/apply(), and adjust the parallelism from 2 to 6 in the above figure, so that the bottom layer of Flink will be as follows Resource allocation:

Therefore, it can be seen that Flink's default behavior is to try to share SubTasks of different tasks under the same job in the Task slot. This means that the number of Task Slots required for the operation of a job should be equal to the maximum value of the task parallelism in the job. Of course, users can also pre-program the Task Slot sharing strategy between Flink Tasks through the program.

Conclusion: The number of resources required for Flink's job operation is automatically calculated, and there is no need for the user to specify it. The user only needs to specify the calculation and the degree.

State Backends

Flink is a state-based calculation flow calculation engine. The exact data structure of the stored key/value state index depends on the selected StateBackend. For example: Use Memory State Backend to store data in HashMap in memory, or use RocksDB (embedded with NoSQL data, similar to Derby database) as State Backend to store state.

In addition to defining the data structure of the saved state, State Backend also implements logic to obtain a point-in-time snapshot of the key/value state and store the snapshot as part of the Checkpoint.

[External link image transfer failed. The source site may have an anti-hotlinking mechanism. It is recommended to save the image and upload it directly (img-tmVI1nJy-1583327569692) (C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\ 1583320515317.png)]

Savepoints

Programs written with Data Stream API can be resumed from Savepoint. Savepoint allows updating programs and Flink clusters without losing any state.

Savepoint is a Checkpoint triggered manually. Savepoint creates a snapshot of the program and writes it to the State Backend. Savepoint relies on the regular Checkpoint mechanism. The so-called checkpoint refers to that during the execution of the program, the program will periodically take a snapshot of the work node and generate a checkpoint. In order to recover, you only need to obtain the last checkpoint completed, and the older checkpoint can be safely discarded immediately after the new checkpoint is completed.

Savepoints are similar to these regular checkpoints. Savepoints are triggered by the user and the updated Checkpoints will not automatically expire when they are completed. The user can create a Savepoint when canceling a job using the command line or through the REST API

Environmental installation


Prerequisites

JDK must be 1.8+, complete JAVA_HOME configuration

Install Hadoop and ensure the normal operation of HADOOP (SSH without password, HADOOP_HOME)

Flink installation (Standalone)

Upload and unzip

[root@CentOS ~]# tar -zxf flink-1.10.0-bin-scala_2.11.tgz -C /usr/
[root@CentOS flink-1.10.0]# tree -L 1 ./
./
!"" bin #执⾏脚本⽬录
!"" conf #配置⽬录
!"" examples #案例jar
!"" lib # 依赖的jars
!"" LICENSE
!"" licenses
!"" log # 运⾏⽇志
!"" NOTICE
!"" opt # 第三⽅备⽤插件包
!"" plugins
#"" README.txt
8 directories, 3 files

Configure flink-conf.yaml

[root@CentOS flink-1.10.0]# vi conf/flink-conf.yaml
#==============================================================================
# Common
#==============================================================================
jobmanager.rpc.address: CentOS
taskmanager.numberOfTaskSlots: 4
parallelism.default: 3

Configure salves

[root@CentOS flink-1.10.0]# vi conf/slaves
CentOS

Start Flink

[root@CentOS flink-1.10.0]# ./bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host CentOS.
Starting taskexecutor daemon on host CentOS.
[root@CentOS flink-1.10.0]# jps
6978 NameNode
7123 DataNode
109157 StandaloneSessionClusterEntrypoint
7301 SecondaryNameNode
109495 TaskManagerRunner
109544 Jps

Check if the startup is successful

Users can access Flink's WEB UI address: http://centos8081

Quick access

Import dependencies

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.9.2</version>
</dependency>
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-streaming-scala_2.11</artifactId>
    <version>1.10.0</version>
</dependency>

Client program

import org.apache.flink.streaming.api.scala._
object FlinkWordCountQiuckStart {
    def main(args: Array[String]): Unit = {
        //1.创建流计算执⾏环境
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        //2.创建DataStream - 细化
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .map(word=>(word,1))
        .keyBy(0)
        .sum(1)
        //4.将计算的结果在控制打印
        counts.print()
        //5.执⾏流计算任务
        env.execute("Window Stream WordCount")
    }
}

Introduce the maven package plugin

<build>
    <plugins>
        <!--scala编译插件-->
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>4.0.1</version>
            <executions>
                <execution>
                    <id>scala-compile-first</id>
                    <phase>process-resources</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <!--创建fatjar插件-->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-shade-plugin</artifactId>
            <version>2.4.3</version>
            <executions>
                <execution>
                    <phase>package</phase>
                    <goals>
                        <goal>shade</goal>
                    </goals>
                    <configuration>
                        <filters>
                            <filter>
                                <artifact>*:*</artifact>
                                <excludes>
                                    <exclude>META-INF/*.SF</exclude>
                                    <exclude>META-INF/*.DSA</exclude>
                                    <exclude>META-INF/*.RSA</exclude>
                                </excludes>
                            </filter>
                        </filters>
                    </configuration>
                </execution>
            </executions>
        </plugin>
        <!--编译插件-->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.2</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
            <executions>
                <execution>
                    <phase>compile</phase>
                    <goals>
                        <goal>compile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

Use mvn package to package

Use the WEB UI to submit tasks

compile

compile






使⽤mvn package打包

使⽤WEB UI提交任务

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-vtLyN2db-1583327569694)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1583321610682.png)]

Guess you like

Origin blog.csdn.net/origin_cx/article/details/104662633