[FLINK] FLIP-6 on YARN

YARN TaskManager Runner FLINK-4929

The YARN TaskManager Runner has the following responsibilities:

Read the configuration and all environment variables and compute the effective configuration
Start all services (Rpc, High Availability, Security, etc)
Instantiate and start the Task Manager Runner

YARN Application Master Runner FLINK-4928

The Application Master Runner is the master process started in a YARN container when submitting the Flink-on-YARN job to YARN.

It has the following data available:

Its responsibility is the following:

Read all configuration and environment variables, computing the effective configuration
Start all shared components (Rpc, HighAvailability Services)
Start the ResourceManager
Start the JobManager Runner

YARN Resource Manager FLINK-4927

The Flink YARN Resource Manager communicates with YARN's Resource Manager to acquire and release containers.

It is also responsible to notify the JobManager eagerly about container failures.

YARN Client FLINK-4930

The FLIP-6 YARN client can follow parts of the existing YARN client.

The main difference is that it does not wait for the cluster to be fully started and for all TaskManagers to register. It simply submits

Set up all configurations and environment variables
Set up the resources: Flink jar, utility jars (logging), user jar
Set up attached tokens / certificates
Submit the Yarn application
Listen for leader / attempt to connect to the JobManager to subscribe to updates
Integration with the Flink CLI (command line interface)

Yarn HighAvailability Services FLINK-5254

The Yarn HighAvailability Services should be

Default

This option takes the YARN Application's working directory as HA storage
It automatically uses that working directory for the BlobStore
It creates a HDFS based "RunningJobsRegistry" (see below)
ResourceManager leader election has a pre-configured leader, via the configuration, pointing to the AppMaster address.

ZooKeeper Based

The ZooKeeper based services use ZooKeeper for the ResourceManager and JobManager leader election. That way, they are safe against network partition scenarios that otherwise lead to "split brain" situations