HA configuration of Flink service


By default, each Flink cluster has only one JobManager, which will cause a single point of failure (SPOF). If this JobManager fails, no new tasks can be submitted, and the running program will also fail. With JobManager HA, the cluster can recover from JobManager failures, thereby avoiding single points of failure. Users can configure Flink cluster HA (high availability) in Standalone or Flink on Yarn cluster mode.
Flink's HA requires Zookeeper and HDFS, so first install and start zk and hdfs.

Standalone mode HA

In Standalone mode, the basic idea of ​​high availability of JobManager is that there is one AliveJobManager and multiple Standby JobManagers at any time. The Standby JobManager can take over the cluster and become the Alive JobManager when the Alive JobManager hangs. This avoids a single point of failure. Once a StandbyJobManager takes over the cluster, the program can continue to run. There is no clear distinction between Standby JobManagers and Alive JobManager instances. Each JobManager can become Alive or Standby. The configuration steps are as follows

Modify the masters configuration file to
add the master node configuration as follows:

[root@server01 conf]# vi masters
server01:8081
server02:8081
server03:8081

Modify conf/flink-conf.yaml to
add high-availability related configuration, configure the following configuration items

high-availability: zookeeper

# The path where metadata for master recovery is persisted. While ZooKeeper stores
# the small ground truth for checkpoint and leader election, this location stores
# the larger objects, like persisted dataflow graphs.
#
# Must be a durable file system that is accessible from all nodes
# (like HDFS, S3, Ceph, nfs, ...)
#
high-availability.storageDir: hdfs://server01:9000/flink/ha

# The list of ZooKeeper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2:clientPort,..." (default clientPort: 2181)
#
high-availability.zookeeper.quorum: server03:2181

Copy the modified configuration file to other Flink nodes

root@server01 conf]# scp masters root@server03:/opt/apps/flink/conf
root@server01 conf]# scp masters root@server02:/opt/apps/flink/conf
root@server01 conf]# scp flink-conf.yaml root@server03:/opt/apps/flink/conf
root@server01 conf]# scp flink-conf.yaml root@server02:/opt/apps/flink/conf

Start the cluster

[root@server01 flink]# bin/start-cluster.sh 
Starting HA cluster with 3 masters.
Starting standalonesession daemon on host server01.
Starting standalonesession daemon on host server02.
Starting standalonesession daemon on host server03.
Starting taskexecutor daemon on host server01.
Starting taskexecutor daemon on host server02.
Starting taskexecutor daemon on host server03.

Flink On Yarn mode HA

Normally submit the Flink program based on Yarn, whether it is using the yarn-session mode or the yarn-cluster mode, the Yarn-based application will run as long as the corresponding Flink cluster process "YarnSessionClusterEntrypoint" is killed, and the Yarn-based Flink task will fail. Automatic retry, so running Flink tasks based on Yarn, it is also necessary to build HA, here again we need to use zookeeper to complete

Modify the yarn-site.xml
of all Hadoop nodes to increase the maximum number of application submission attempts in the yarn-site.xml of all Hadoop nodes, and add the following configuration

<property>
<name>yarn.resourcemanager.am.max-attempts</name>
<value>4</value>
</property>

Copy the modified file to other hadoop nodes

[root@server01 hadoop]# scp yarn-site.xml  root@server02:/opt/apps/hadoop/etc/hadoop
yarn-site.xml                                                                                100% 1146   414.4KB/s   00:00    
[root@server01 hadoop]# scp yarn-site.xml  root@server03:/opt/apps/hadoop/etc/hadoop
yarn-site.xml                                                                                100% 1146   401.7KB/s   00:00    
[root@server01 hadoop]# 

Restart hdfs and zk

Modify the contents of flink-conf.yaml
as follows:

high-availability: zookeeper

# The path where metadata for master recovery is persisted. While ZooKeeper stores
# the small ground truth for checkpoint and leader election, this location stores
# the larger objects, like persisted dataflow graphs.
#
# Must be a durable file system that is accessible from all nodes
# (like HDFS, S3, Ceph, nfs, ...)
#
high-availability.storageDir: hdfs://server01:9000/flink/ha

# The list of ZooKeeper quorum peers that coordinate the high-availability
# setup. This must be a list of the form:
# "host1:clientPort,host2:clientPort,..." (default clientPort: 2181)
#
high-availability.zookeeper.quorum: server03:2181

yarn.application-attempts: 10

start up

[root@server01 conf]# ../bin/yarn-session.sh -n 2
2020-08-21 16:29:04,396 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.address, server01
2020-08-21 16:29:04,398 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.rpc.port, 6123
2020-08-21 16:29:04,398 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: jobmanager.heap.size, 1024m
2020-08-21 16:29:04,398 INFO  org.apache.flink.configuration.GlobalConfiguration            - Loading configuration property: 
2020-08-21 16:29:37,803 INFO  org.apache.flink.runtime.rest.RestClient                      - Rest client endpoint started.
Flink JobManager is now running on server03:43760 with leader id 0d61bb85-a445-4de1-8095-6316288dee5e.
JobManager Web Interface: http://server03:34306

Through yarn, we can see the flink cluster we started, and we
Insert picture description here
can see that flink jobmanager is started on server03
Insert picture description here

Enter the corresponding node and kill the corresponding "YarnSessionClusterEntrypoint" process.

[root@server03 zookeeper]# jps
7506 DataNode
7767 QuorumPeerMain
8711 TaskManagerRunner
8760 Jps
7625 NodeManager
8761 Jps
8251 StandaloneSessionClusterEntrypoint
[root@server03 zookeeper]# kill -9 8251
[root@server03 zookeeper]# jps
9057 NodeManager
9475 YarnSessionClusterEntrypoint
7767 QuorumPeerMain
8711 TaskManagerRunner
9577 Jps
8958 DataNode
[root@server03 zookeeper]# kill -9 9475

Observe in Yarn that the "applicationxxxx_0001" job information is still available after retrying
Insert picture description here

Guess you like

Origin blog.csdn.net/zhangxm_qz/article/details/108204917