Flink: Configure high availability of Yarn-based JobManager

Preface

JobManager is used to coordinate the scheduling and resource management of each Flink task. By default, each Flink cluster has only one JobManager instance. Then it means there is a single point of failure. If JobManager crashes, new tasks cannot be submitted, and running tasks will fail.

The high availability of JobManager can restore the JobManager after the JobManager hangs, thereby eliminating single points of failure.

Both Flink's independent deployment and deployment on Yarn can use JobManager for high availability, and most of them are deployed on Yarn in the production environment.

Yarn-based JobManager high availability still only runs one JobManager (ApplicationMaster) instance, but when the JobManager crashes, Yarn will restart the JobManager.

Configuration

Flink's high availability needs to rely on zookeeper, so install zookeeper before configuration. In addition, only two configuration files need to be modified: yarn-site.xml and flink-conf.yaml.

1. Configure yarn-site.xml and add the following configuration.

<property>
  <name>yarn.resourcemanager.am.max-attempts</name>
  <value>10</value>
  <description>
    application master 最大的执行次数.
    默认值是2(表示可以容忍JobManager发生一次故障)
  </description>
</property>

2. Configure flink-conf.yaml

The number of times the application attempts to start. 10 means that it can be restarted 9 times (9 restarts + 1 initial start). Yarn will consider the task as a failure when all 10 starts fail. It should be noted that the upper limit configured here is yarn.resourcemanager.am.max-attempts, that is, the maximum cannot exceed 10. If you need to increase the parameters here, you should also include yarn.resourcemanager.am.max-attempts Need to be adjusted higher.

high-availability:zookeeper
high-availability.zookeeper.quorum: hd01:2181,hd02:2181,hd03:2181
high-availability.storageDir: hdfs:///flink/recovery
high-availability.zookeeper.path.root: /flink
yarn.application-attempts: 10

3. Start yarn-session

yarn-session.sh -n 2 -s 2 -jm 1024 -tm 1024 -nm test -d

note

Different versions of Yarn adopt different methods when closing the Container after a task error occurs.

YARN 2.3.0 <version <2.4.0: When the application master fails, all Containers will restart.

YARN 2.4.0 <version <2.6.0: When the application master fails, the TaskManager containers will still remain alive. The advantage of this is that the startup time is faster and there is no need to wait for container resources to be obtained again.

YARN 2.6.0 <= version: On the basis of 2.4-2.6, an interval time is set. The value of this interval time is equal to Flinks' Akka timeout value. When an error occurs in the task, the task will be restarted at this time interval. The task will be killed by the system when the number of retries exceeds the value of yarn.application-attempts set by Flink. This can prevent a long-time failed task from exhausting the number of attempts to restart the application.

Guess you like

Origin blog.csdn.net/x950913/article/details/108567927