Last article a brief description of the basis for the deployment of Flink standalone cluster in a production environment if only one jobmanager, then hang up once the nodes, all running task will be interrupted, the impact is relatively large, at least in a production environment jobmanager to ensure high availability of the at least two nodes, and may be run jobmanager taskmanager two instances to one physical node, and a plurality taskmanager plurality jobmanager coexist highly available, fault zookeeper availability rely recovery, So first cluster ready zookeeper, zookeeper recommendations of the independent set up a cluster, do not use the built-in single-node flink zookeeper, before the original environment is as follows:
bigdata1 - jobmanager
bigdata2, bigdata3, bigdata4 - task manager
Currently zookeeper clusters: bigdata1, bigdata2, bigdata3, port number 2181
Next To expand jobmanager together to achieve high availability in bigdata4 run jobmanager above, and bigdata1 of jobmanager.
First, start the configuration in a node, here bigdata1 start the configuration:
Configuration: conf / flink-conf.yaml find the High Availability configuration section, which is the default comment is, do not use high availability, need to manually remove the comment and add some configuration items, the specific configuration is as follows:
high-availability: zookeeper high-availability.storageDir: file:///data/flink/ha high-availability.zookeeper.quorum: bigdata1:2181,bigdata2:2181,bigdata3:2181 high-availability.zookeeper.path.root: /flink high-availability.cluster-id: /flink_cluster
high-availability default is NONE, said they did not use high-availability, into a zookeeper here
high-availability.storageDir this is highly available storage for some larger objects for restoration, the document recommends configuring all nodes have access to the resources recommended hdfs, here is the configuration of the local file system, the effectiveness of specific needs to be verified, it is recommended to use the production environment hdfs
high-availability.zookeeper.quorum zookeeper cluster configuration
high-availability.zookeeper.path.root configuration flink path zookeeper in the entire cluster to be unified, here is / flink; if it is more than a zookeeper flink clusters using the same cluster, this should be distinguished.
Identify high-availability.cluster-id cluster, the entire cluster to be consistent, have this cluster-id specified in the catalog and under storageDir under zookeeper for coordinating data storage necessary
After the above configuration is correct, save the file
Configuration masters, file: conf / masters, add nodes bigdata4
Meanwhile conf / slaves remain unchanged at bigdata2, bigdata3, bigdata4
Then flink-conf.yaml and masters configuration synchronization to all other nodes in the cluster, while ensuring that the service is up and running zookeeper
Execution: . Bin / Start-Cluster SH start the cluster, start bigdata4 will find more out StandaloneSessionClusterEntrypoint process, this time by the client to perform zookeeper get / flink / flink_cluster / leader / rest_server_lock view the current jobmanager master can usually be seen bigdata1
You can then try to bigdata1 above StandaloneSessionClusterEntrypoint kill off the process by bigdata4: 8081 visit web ui, this time failover flink log may be error, wait a little while, then the interface will be loaded successfully, normal to see slots and task managers and detailed task description at this time jobmanager successful failover to achieve high availability, while viewing the zookeeper above the node will switch to bigdata4
另外注意配置高可用之后,之前的flink-conf.yaml中的配置项jobmanager.rpc.port就不再生效,这个配置项只针对之前的单个jobmanager的独立集群,现在这个端口会自动选择并且多个jobmanager都是不一样的,但是我们不用去关心他,对使用flink没有任何影响.
以上就是flink jobmanager高可用的配置,配置起来还是比较简单的,推荐在生产环境中使用,集群稳定性更好.
参考文档: https://ci.apache.org/projects/flink/flink-docs-release-1.9/zh/ops/jobmanager_high_availability.html