Flink from entry to proficiency series (3)

4. Flink runtime architecture

4.1. System Architecture

Flink is a distributed parallel stream processing system. Simply put, it consists of multiple processes, which generally run on different machines.

For a distributed system, many thorny problems need to be faced. The core problems are: resource allocation and management in the cluster, process coordination and scheduling, persistent and highly available data storage, and fault recovery. Flink does not It will handle all the problems by itself, but utilizes the existing cluster architecture and services, so that it can focus on the core work-distributed data flow processing.

Flink can be configured to run as an independent (Standalone) cluster, and can also be easily integrated with some cluster resource management tools, such as YARN, Kubernetes, and Mesos. Flink also does not provide persistent distributed storage by itself, but directly utilizes existing distributed file systems (such as HDFS) or object storage (such as S3). For highly available configurations, Flink relies on Apache ZooKeeper.

4.1.1. Overall composition

In Flink's runtime architecture, the two most important components are:

  • Job Manager (JobManger)
  • Task Manager (TaskManager).

For a job submitted for execution, JobManager is the real "manager" (Master), responsible for managing scheduling, so there can only be one without considering high availability; and TaskManager is "worker" (Worker, Slave ), responsible for performing tasks to process data, so there can be one or more. The system of Flink's job submission and task processing is shown in the figure below.

insert image description here
Here we must first explain the "client". In fact, the client is not part of the processing system, it is only responsible for job submission. Specifically, it calls the main method of the program, converts the code into a "Dataflow Graph", and finally generates a JobGraph, which is sent to the JobManager.

After submitting, the execution of the task has nothing to do with the client; we can choose to disconnect from the JobManager on the client, or we can keep the connection. When we submitted the job in the command before, we added the -d parameter to indicate the detached mode (detached mode), that is, to disconnect.

Of course, the client can connect to the JobManager at any time to obtain the status and execution results of the current job, or send a request to cancel the job. Regardless of whether the operations related to "flink run" are performed through the Web UI or the command line, they are all implemented through the client.

JobManagers and TaskManagers can be started in different ways:

  • As a process of an independent (Standalone) cluster, start it directly on the machine
  • start in the container
  • Scheduled and started by the resource management platform, such as YARN, K8S

This actually corresponds to different deployment methods. After the TaskManager is started, the JobManager will establish a connection with it, and convert the job graph (JobGraph) into an executable "execution graph" (ExecutionGraph) and distribute it to the available TaskManager, and then the TaskManager will Execute specific tasks.

4.1.2, Job Manager (JobManager)

JobManager is the core of task management and scheduling in a Flink cluster, and is the main process that controls application execution. That is to say, each application should be controlled and executed by a unique JobManager.

Of course, in a high-availability (HA) scenario, there may be multiple JobManagers. At this time, only one is the running leader node (leader), and the others are standby nodes (standby).

The JobManger in turn consists of 3 different components.

4.1.2.1、JobMaster

JobMaster is the core component of JobManager, responsible for processing individual jobs (Job).

Therefore, there is a one-to-one correspondence between the JobMaster and the specific Job. Multiple Jobs can run in a Flink cluster at the same time, and each Job has its own JobMaster. It should be noted that in the early version of Flink, there is no concept of JobMaster; while the concept of JobManager is relatively small, which actually refers to the so-called JobMaster now.

  1. When a job is submitted, the JobMaster will first receive the application to be executed. The "application" mentioned here is generally submitted by the client, including: Jar package, data flow graph (dataflow graph), and job graph (JobGraph).
  2. JobMaster will convert JobGraph into a physical data flow graph, which is called "ExecutionGraph", which contains all tasks that can be executed concurrently.
  3. JobMaster will send a request to the Resource Manager (ResourceManager) to apply for the necessary resources to execute the task.
  4. Once it has acquired enough resources, it distributes the execution graphs to the TaskManagers that actually run them.
  5. While running, the JobMaster will be responsible for all operations that require central coordination, such as the coordination of checkpoints.
4.1.2.2, Resource Manager (ResourceManager)

ResourceManager is mainly responsible for the allocation and management of resources, and there is only one in the Flink cluster.

The so-called "resources" mainly refer to the task slots of TaskManager. The task slot is the resource allocation unit in the Flink cluster, which includes a set of CPU and memory resources used by the machine to perform calculations. Each task (Task) needs to be assigned to a slot
for execution.

Note here to distinguish Flink's built-in ResourceManager from
the ResourceManager of other resource management platforms (such as YARN). Flink's ResourceManager has different specific implementations for different environments and resource management platforms (such as Standalone deployment, or YARN).

  • In Standalone deployment, because the TaskManager is started separately (there is no Per-Job mode), the ResourceManager can only distribute the task slots of the available TaskManagers, and cannot start a new TaskManager independently.
  • However, when there is a resource management platform, there is no such restriction. When a new job applies for resources, ResourceManager will assign TaskManagers with free slots to JobMaster. If the ResourceManager does not have enough task slots, it can also initiate a session to the resource provider platform, requesting a container to start the TaskManager process.
  • In addition, ResourceManager is also responsible for stopping idle TaskManagers to release computing resources.
4.1.2.3, Dispatcher (Dispatcher)

Dispatcher is mainly responsible for providing a REST interface for submitting applications, and is responsible for starting a new JobMaster component for each newly submitted job.
Dispatcher will also launch a Web UI to easily display and monitor job execution information. Dispatcher is not required in the architecture and may be ignored in different deployment modes.

4.1.3, Task Manager (TaskManager)

TaskManager is the working process in Flink, and it does the specific calculation of data flow, so it is also called "Worker".

There must be at least one TaskManager in the Flink cluster; of course, due to the consideration of distributed computing, there are usually multiple TaskManagers running, and each TaskManager contains a certain number of task slots.

Slot is the smallest unit of resource scheduling, and the number of slots limits the number of tasks that TaskManager can process in parallel.

  • After startup, the TaskManager will register its slots with the resource manager;
  • After receiving the instruction from the resource manager, the TaskManager will provide one or more slots to the JobMaster, and the JobMaster can assign tasks to execute.
  • During execution, the TaskManager can buffer data and exchange data with other TaskManagers running the same application.

4.2. Job submission process

4.2.1, high-level abstract perspective

The submission process of Flink will vary with different deployment modes and resource management platforms. First, from a high-level perspective, let's take a look at how each component interacts and collaborates macroscopically when a job is submitted.
insert image description here
As shown in the figure above, the specific steps are as follows:

  1. Generally, the client (App) submits the job to the JobManager through the REST interface provided by the distributor.
  2. The Dispatcher starts the JobMaster and submits the job (including the JobGraph) to the JobMaster.
  3. The JobMaster parses the JobGraph into an executable ExecutionGraph, obtains the required number of resources, and then requests resources (slots) from the resource manager.
  4. The resource manager determines whether there are enough resources available; if not, a new TaskManager is started.
  5. After the TaskManager starts, it registers its available task slots (slots) with the ResourceManager.
  6. The resource manager notifies the TaskManager to provide slots for new jobs.
  7. TaskManager connects to the corresponding JobMaster and provides slots.
  8. JobMaster distributes the tasks that need to be executed to TaskManager.
  9. TaskManager executes tasks and can exchange data with each other.

If the deployment mode is different, or the cluster environment is different (such as Standalone, YARN, K8S, etc.), some of these steps may be different or omitted, and some components may run in the same JVM process.

4.2.2. Standalone mode (Standalone)

In the standalone mode (Standalone), there are only two deployment modes: session mode and application mode. The overall process of the two is very similar:

  • TaskManager needs to be started manually, so when ResourceManager receives a request from JobMaster, it will directly ask TaskManager to provide resources.
  • When the JobMaster is started, the session mode is pre-started, and the application mode is started when the job is submitted.

The overall process of submission is shown in the figure below.

insert image description here

4.2.3, YARN cluster

4.2.3.1. Session mode

In session mode, we need to start a YARN session first, which will create a Flink cluster.
insert image description here
Only the JobManager is started here, and the TaskManager can be started dynamically as needed. Inside the JobManager, since the job has not been submitted, only the ResourceManager and Dispatcher are running, as shown in the figure below.
insert image description here
The next step is the process of actually submitting the job, as shown in the following figure:

  1. The client submits the job to the distributor through the REST interface.
  2. The distributor starts the JobMaster and submits the job (including the JobGraph) to the JobMaster.
  3. The JobMaster requests resources (slots) from the resource manager.
  4. The resource manager requests container resources from the YARN resource manager.
  5. YARN starts new TaskManager containers.
  6. After TaskManager starts, it registers its available task slots with Flink's resource manager.
  7. The resource manager notifies the TaskManager to provide slots for new jobs.
  8. TaskManager connects to the corresponding JobMaster and provides slots.
  9. JobMaster distributes the tasks that need to be executed to TaskManager and executes the tasks.

The whole process is almost the same except that it needs to "report" to YARN's resource manager when requesting resources.

4.2.3.2, Single job (Per-Job) mode

In single-job mode, the Flink cluster will not be started in advance, but a new JobManager will be started when the job is submitted. The specific process is shown in the figure below.

insert image description here

  1. The client submits the job to the resource manager of YARN. In this step, the Jar package and configuration of Flink will be uploaded to HDFS at the same time, so that the container of Flink related components can be started later.
  2. YARN's resource manager allocates Container resources, starts Flink JobManager, and submits jobs to JobMaster. The Dispatcher component is omitted here.
  3. The JobMaster requests resources (slots) from the resource manager.
  4. The resource manager requests container resources from the YARN resource manager.
  5. YARN starts new TaskManager containers.
  6. After TaskManager starts, it registers its available task slots with Flink's resource manager.
  7. The resource manager notifies the TaskManager to provide slots for new jobs.
  8. TaskManager connects to the corresponding JobMaster and provides slots.
  9. JobMaster distributes the tasks that need to be executed to TaskManager and executes the tasks.

The only difference lies in the startup method of the JobManager and the omission of the distributor, and the subsequent process is exactly the same as the session mode.

4.2.3.3. Application mode

The application mode is very similar to the submission process of the single job mode, except that the initial submission to the YARN resource manager is no longer a specific job, but the entire application. An application may contain multiple jobs, and these jobs will start their respective JobMasters in the Flink cluster.

4.3 Some important concepts

The core components and overall architecture of Flink runtime also understand the specific process of job submission in different scenarios.
But there are some details that need further consideration:

  • How does a specific job convert from the code we write into a task that TaskManager can execute?
  • How does the jobManager determine the total number of tasks and resources required after receiving the submitted jobs?

4.3.1. Dataflow Graph

Flink is a streaming computing framework. Its program structure actually defines a series of processing operations, and each step of calculation will be called in turn after each data input.

In the Flink code, each processing and transformation operation we define is called an "operator", so our program can be regarded as a pipeline composed of a series of operators, and the data flows in an orderly manner like a water flow . For example, in the previous WordCount code, the socketTextStream() method called based on the execution environment is an operator that reads the text stream; and the flatMap() method behind it is to segment the string data and convert it into a binary group operator.

All Flink programs can be summarized as consisting of three parts: Source, Transformation and Sink.

  • Source means "source operator", which is responsible for reading the data source.
  • Transformation means "transformation operator", which uses various operators for processing.
  • Sink means "sinking operator", which is responsible for the output of data.

insert image description here
At runtime, the Flink program will be mapped into a graph in which all operators are connected together in a logical order, which is called "logical dataflow", or "dataflow graph". After we submit the job, open the Web UI that comes with Flink, and click the job to see the corresponding dataflow, as shown in the figure above. In the data flow diagram, you can clearly see three parts: Source, Transformation, and Sink.

Dataflow graphs are similar to arbitrary directed acyclic graphs (DAGs), which is consistent with other frameworks such as Spark.

Each data flow in the graph starts with one or more source operators and ends with one or more sink operators. In most cases, there is a one-to-one correspondence between the operators in dataflow and the conversion operations in the program.

Does that mean that every method call based on the DataStream API in our code is an operator? it's not true. Except for Source reading data and Sink output data, an intermediate transformation operator (Transformation Operator) must be a transformation processing operation; while there are some method calls in the code, the data transformation is not completed. It may be just a setting of the property, or it may define the transmission method of the data instead of the conversion, or several methods need to be combined to express a complete conversion operation.

For example, in the previous code, we used the method keyBy to define grouping, which is just a data partition operation, not an operator. In fact, in the code, we can see that the data type returned after calling other conversion operations is SingleOutputStreamOperator, indicating that this is an operator operation; and the data type returned after keyBy is KeyedStream

4.3.2, Parallelism (Parallelism)

In most cases, an operator operation should be a task. Is that the number of operators in the program, the number of tasks that are finally executed?

4.3.2.1. What is parallel computing

In big data scenarios, we all rely on distributed architecture for parallel computing to improve data throughput. Since the data can be sent elsewhere after processing an operation, we can assign different operator operation tasks to different nodes for execution. In this way, the tasks are allocated and parallel processing is realized.

However, careful analysis reveals that this "parallelism" is actually not thorough. Because there is an execution order between operators, a piece of data must be executed sequentially; and an operator can only process one piece of data at a time. For example, before WordCount, after a piece of data arrives, we must first use the source operator to read it in, and then do flatMap conversion; while a piece of data is read into the source, the previous data may be being processed by flatMap, so different operator tasks are parallel of. However, if multiple pieces of data arrive at the same time, an operator cannot process them at the same time. We still need to wait for one piece of data to be processed before processing the next piece of data—this does not really improve throughput.

So compared to the above-mentioned "task parallelism", what we really care about is "data parallelism". That is to say, if multiple pieces of data arrive at the same time, we should be able to read them in at the same time and perform flatMap operations on different nodes at the same time.

4.3.2.2. Parallel subtasks and degree of parallelism

How to achieve data parallelism? In fact, it is also very simple. We "copy" multiple copies of an operator operation to multiple nodes. After the data comes, it can be executed on any one of them. In this way, an operator task is split into multiple parallel "subtasks" (subtasks), and then they are distributed to different nodes, which truly realizes parallel computing.

During the execution of Flink, each operator (operator) can contain one or more subtasks (operator subtask), and these subtasks are executed completely independently in different threads, different physical machines or different containers.

insert image description here
The number of subtasks of a particular operator is called its parallelism. In this way, a data stream containing parallel subtasks is a parallel data stream, which requires multiple partitions (stream partition) to distribute parallel tasks. In general, the parallelism of a stream program can be considered as the maximum parallelism among all its operators. In a program, different operators may have different degrees of parallelism.

As shown in the figure above, there are four operators in the current data flow: source, map, window, and sink. Except for the last sink, the parallelism of other operators is 2. The whole program contains 7 subtasks, at least 2 partitions are needed for parallel execution. We can say that the parallelism of this stream processing program is 2.

4.3.2.3. Parallelism setting

In Flink, parallelism can be set in different ways, with different effective ranges and priority levels.

  1. Setting in the code
    In the code, we can simply call the setParallelism() method after the operator to set the parallelism of the current operator:
stream.map(word -> Tuple2.of(word, 1L)).setParallelism(2);

The parallelism set in this way is only valid for the current operator. In addition, we can also directly call the setParallelism() method of the execution environment to set the parallelism globally: env.setParallelism(2);
in this way, the default parallelism of all operators in the code is 2.

We generally do not set the global parallelism in the program, because if the global parallelism is hard-coded in the program, dynamic expansion will not be possible. It should be noted here that since keyBy is not an operator, parallelism cannot be set for keyBy.

  1. Setting when submitting the application
    When using the flink run command to submit the application, you can add the -p parameter to specify the parallelism of the current application execution, which is similar to the global setting of the execution environment:
bin/flink run –p 2 –c com.song.wc.StreamWordCount flink_demo-1.0-SNAPSHOT.jar

If we submit the job directly on the Web UI, we can also directly add the degree of parallelism in the corresponding input box.

  1. Setting in the configuration file
    We can also directly change the default parallelism in the cluster configuration file flink-conf.yaml: parallelism.default: 2, this setting is valid for all jobs submitted on the entire cluster, and the initial value is 1.

Whether it is set in the code or the -p parameter when submitting, it is not necessary; so when the parallelism is not specified, the default parallelism of the cluster in the configuration file will be used. In the development environment, there is no configuration file, and the default The degree of parallelism is the number of CPU cores of the current machine.

We can summarize all parallelism setting methods, and their priorities are as follows:

  • For an operator, first check whether its parallelism is specified separately in the code. This specific setting has the highest priority and will override all subsequent settings.
  • If not set separately, the parallelism set globally by the execution environment in the current code is used.
  • If the code is not set at all, then the parallelism specified by the -p parameter at the time of submission is used.
  • If the -p parameter is also not specified when submitting, then the default degree of parallelism in the cluster configuration file is used.

What needs to be explained here is that the parallelism of an operator is sometimes affected by its own specific implementation. For example, the operator socketTextStream that we used before to read the socket text stream is itself a non-parallel Source operator, so no matter how it is set, its parallelism at runtime is 1, corresponding to only A parallel subtask.

So how to set the degree of parallelism in practice is better? That is, the parallelism is only set for the operator in the code, and the global parallelism is not set, which is convenient for us to dynamically expand the capacity when submitting jobs.

4.3.3. Operator Chain

4.3.3.1. Data transmission between operators

insert image description here
As shown in the figure above, a data stream can transmit data between operators in a one-to-one forwarding mode, or in a disrupted redistributing mode. Which form it is depends on the type of operator.

  1. One-to-one (forwarding)
    In this mode, the data stream maintains the order of partitions and elements. For example, in the source and map operators in the figure, after the source operator reads data, it can directly send it to the map operator for processing. There is no need to re-partition or adjust the order of data between them. This means that the subtasks of the map operator see exactly the same number and order of elements as those generated by the subtasks of the source operator, ensuring a "one-to-one" relationship. Operators such as map, filter, and flatMap are all one-to-one correspondences.

This relationship is similar to narrow dependencies in Spark.

  1. Redistribution (Redistributing)
    In this mode, the partition of the data stream will change. Compare between the map in the figure and the following keyBy/window operator (the keyBy here is the data transmission operator, and the following window and apply methods together constitute the window operator), and between the keyBy/window operator and the Sink operator It's all like this. The subtasks of each operator will send data to different downstream target tasks according to the data transmission strategy. For example, keyBy() is a grouping operation, essentially repartitioning based on the hash value (hashCode) of the key (key); and when the parallelism changes, such as from a window operator with a parallelism of 2, it needs to be passed to the parallel For a Sink operator with a degree of 1, the data transmission method at this time is rebalance, which will evenly distribute the data to the downstream subtasks. These transmission methods will
    cause the process of redistribute, which is similar to shuffle in Spark.

Overall, this relationship between operators is similar to wide dependencies in Spark.

4.3.3.2. Merge operator chain

In Flink, one-to-one operator operations with the same degree of parallelism can be directly linked together to form a "big" task, so that the original operator becomes a part of the real task. As shown below. Each task will be executed by a thread. Such technology is called "Operator Chain".
insert image description here
For example, in the example above, the source and map meet the requirements of the operator chain, so they can be directly merged together to form a task; because the parallelism is 2, the merged task also has two parallel sub-tasks. Task. In this way, the job represented by this data flow diagram will eventually have 5 tasks executed in parallel by 5 threads.

Why does Flink have such a design as an operator chain? This is because linking operators into tasks is a very effective optimization: it can reduce switching between threads and buffer-based data exchange, and improve throughput while reducing latency.

By default, Flink will perform link merging according to the principle of operator chains. If we want to prohibit merging or define ourselves, we can also make some specific settings for operators in the code:

// 禁用算子链
.map(word -> Tuple2.of(word, 1L)).disableChaining();
// 从当前算子开始新链
.map(word -> Tuple2.of(word, 1L)).startNewChain()

4.3.4, job graph (JobGraph) and execution graph (ExecutionGraph)

The data flow graph (dataflow graph) directly mapped by the Flink program is also called the logical flow graph (logical StreamGraph), because they represent the high-level view of the calculation logic. When it comes to the specific execution link, we also need to consider the parallel child Assignment of tasks, transmission of data between tasks, and optimization of merging operator chains.

In order to illustrate how to execute a stream processing program in the end, Flink needs to parse the logical flow graph and convert it into a physical data flow graph.

In this conversion process, there are several different stages that will generate graphs of different levels, the most important of which are JobGraph and ExecutionGraph. The graph of task scheduling execution in Flink can be divided into four layers according to the order of generation:

Logic flow graph (StreamGraph) → job graph (JobGraph) → execution graph (ExecutionGraph) → physical graph (Physical Graph).

We can recall the StreamWordCount program that processed the socket text stream earlier:

env.socketTextStream().flatMap().keyBy(0).sum(1).print();

If the parallelism is set to 2 when submitting:

bin/flink run –p 2 –c com.song.wc.StreamWordCount flink_demo-1.0-SNAPSHOT.jar

Then, according to the previous analysis, except socketTextStream() is a non-parallel Source operator, its parallelism is always 1, and the parallelism of other operators is 2.
Next, let's analyze the evolution process of the program corresponding to the four-layer scheduling diagram, as shown in the figure below.
insert image description here

  1. Logic flow graph (StreamGraph)
    This is the initial DAG graph generated based on the code written by the user through the DataStream API, which is used to represent the topology of the program. This step is usually done on the client side. We can see that the nodes in the logical flow diagram completely correspond to the four-step operator operations in the code: source operator Source (socketTextStream()) → flat map operator Flat Map (flatMap()) → group aggregation operator Keyed Aggregation(keyBy/sum()) → output operator Sink(print()).

  2. Job Graph (JobGraph)
    StreamGraph is optimized to generate a Job Graph (JobGraph), which is a data structure submitted to the JobManager and determines the division of all tasks in the current job. The main optimization is: Link multiple eligible nodes together to form a task node to form an operator chain, which can reduce the consumption of data exchange. JobGraph is generally also generated on the client side and passed to JobMaster when the job is submitted. In the figure above, the parallelism of the grouping aggregation operator (Keyed Aggregation) and the output operator Sink (print) is 2, and it is a one-to-one relationship, which meets the requirements of the operator chain, so they will be merged together to become one task node.

  3. Execution Graph (ExecutionGraph)
    After JobMaster receives the JobGraph, it will generate an execution graph (ExecutionGraph) based on it. ExecutionGraph is the parallelized version of JobGraph and is the core data structure of the scheduling layer. As can be seen from the figure above, the biggest difference from JobGraph is that the parallel subtasks are split according to the degree of parallelism, and the way of data transmission between tasks is clarified.

  4. Physical Graph
    After JobMaster generates an execution graph, it will distribute it to TaskManagers; each TaskManager will deploy tasks according to the execution graph, and the final physical execution process will also form a "graph", which is generally called a physical graph (Physical Graph). Graph). This is only a diagram at the specific execution level, not a specific data structure. Corresponding to the above figure, the physical diagram is mainly based on the execution diagram to further determine the location of data storage and the specific method of sending and receiving. With the physical diagram, TaskManager can process and calculate the passed data.

So we can see that four operator operations are defined in the program: source (Source) -> conversion (flatMap) -> group aggregation (keyBy/sum) -> output (print); after the merge operator chain is optimized, There are only three task nodes; after considering the degree of parallelism, there are a total of 5 parallel subtasks, which ultimately require 5 threads to execute.

4.3.5. Tasks and Task Slots

4.3.5.1, Task Slots (Task Slots)

Each worker (that is, TaskManager) in Flink is a JVM process, which can start multiple independent threads to execute multiple subtasks (subtasks) in parallel.

So if you want to execute 5 tasks, you don't necessarily need 5 TaskManagers, we can let TaskManager execute tasks in multiple threads. If 5 threads can run at the same time, then only one TaskManager can meet the running requirements of our previous program.

Obviously, the computing resources of TaskManager are limited, not all tasks can be executed in parallel on a TaskManager. The more tasks in parallel, the fewer resources each thread will have.

How many tasks can a TaskManager process in parallel? In order to control the amount of concurrency, we need to make a clear division of the resources occupied by each task on the TaskManager, which is the so-called task slots.

The concept of slot is actually no stranger in the distributed framework. The so-called "slot" is a visual expression, and in TaskManager, we
can set multiple slots, as long as the assigned tasks are inserted, they can be executed in parallel.

Each task slot (task slot) actually represents a fixed-size subset of the TaskManager's computing resources. These resources are used to independently execute a subtask.
insert image description here
If a TaskManager has three slots, it will divide the managed memory into three equally, and each slot occupies a separate portion. In this way, when we execute a subtask on the slot, it is equivalent to demarcating a piece of memory for "special purpose", and there is no need to compete with tasks from other jobs for memory resources. So now we only need 2 TaskManagers to process 5 assigned tasks in parallel, as shown in the figure above.

4.3.5.2. Setting the number of task slots

We can set the number of slots for TaskManager through the configuration file of the cluster:

taskmanager.numberOfTaskSlots: 8

By adjusting the number of slots, we can control the isolation level between subtasks.

Specifically, if a TaskManager has only one slot, it means that each task will run in an independent JVM (of course, the JVM may be started through a specific container); and a TaskManager with multiple slots means Multiple subtasks can share the same JVM. The difference between them is: the former tasks run completely independently, the isolation level is higher, and the impact on each other can be minimized; while the latter tasks running in the same JVM process will share TCP connections and heartbeat messages, and may also share Datasets and data structures, which reduce the running overhead of each task and improve performance while reducing the isolation level.

It should be noted that slots are currently only used to isolate memory and do not involve CPU isolation. In a specific application, the number of slots can be configured as the number of CPU cores of the machine, so as to avoid competition for CPU between different tasks as much as possible. This is why the default parallelism of the development environment is set to the number of CPUs of the machine.

4.3.5.3. Task-to-task slot sharing

In this way, as many tasks as there are in total, we need as many slots to process them in parallel. However, when actually submitting jobs for testing, we will find that our previous WordCount program sets the parallelism to 2 submissions, and there are a total of 5 parallel subtasks. Even if the cluster has only 2 task slots, it can be successfully submitted and run. This is why?

We can continue to expand on the previous example. If we keep the parallelism of the sink task as 1, and set the global parallelism as 6 when submitting the job, then the first two task nodes will each have 6 parallel subtasks, and the entire stream processing program will have 13 subtasks. So for a cluster configuration with 2 TaskManagers, each with 3 slots, can it still work normally?

insert image description here

The answer is no problem at all . This is because by default, Flink allows subtasks to share slots. As shown in the figure above, as long as they belong to the same job, parallel subtasks of different task nodes can be executed on the same slot. So for the first task node source→map, its six parallel subtasks must be assigned to different slots (if they are in the same slot, the data cannot be parallelized), and the second task node keyBy/window/apply Parallel subtasks can share slots with the first task node.

So the final result becomes: the parallel subtasks of each task node are lined up, occupying different slots; and the subtasks of different task nodes can share slots. In a slot, all tasks processed by the program can be executed here. We call it the running pipeline (pipeline) that saves the entire job .

This feature seems a bit strange: don't we want parallel processing and isolation between tasks, why is it allowed to share slots here?

This is because a slot corresponds to a set of independent computing resources. When sharing was not done before, each task occupied a slot equally, but in fact, different tasks occupy different resources.

For example, in the first two tasks here, although the source/map is obtained by merging two operators into the operator chain, it is only basic data reading and simple conversion. The calculation time is extremely short, and generally does not require too much memory. Space; while the window operations performed by the window operator often involve a large amount of data, state storage, and calculations. We generally call such tasks "resource-intensive" (intensive) tasks. When they are equally allocated to independent slots, we will find that the source/map and sink tasks can be completed quickly when a large amount of data arrives, but the window task takes a long time; so the downstream sink task occupies The slot will wait to be idle, and the upstream source/map task is limited by the downstream processing capacity, and will block the corresponding resource and start waiting after a part of the data is quickly processed (equivalent to processing back pressure). In this way, there is a great imbalance in the utilization of resources, "the busy one dies, and the idle one dies".

The idea to solve this problem is to allow slot sharing. When we put resource-intensive and non-intensive tasks in a slot at the same time, they can allocate the proportion of resource occupation by themselves, so as to ensure that the heaviest tasks are evenly distributed to all TaskManagers. Another benefit of slot sharing is that it allows us to save complete job pipelines .

In this way, even if a TaskManager fails and goes down, other nodes will not be affected at all, and the tasks of the job can continue to be executed.

In addition, parallel subtasks of the same task node cannot share slots, so after allowing slot sharing, the number of slots required to run a job is exactly the maximum parallelism of all operators in the job. In this way, when we consider how many slot resources the current cluster needs to configure, we don't need to calculate in detail how many parallel subtasks a job contains in total. It is enough to only look at the maximum degree of parallelism .

Of course, Flink allows slot sharing by default. If you want the task corresponding to a certain operator to completely occupy a slot, or only a certain part of the operators share the slot, you can also specify it manually by setting the "slot sharing group" (SlotSharingGroup):

.map(word -> Tuple2.of(word, 1L)).slotSharingGroup(“1”);

In this way, only subtasks belonging to the same slot sharing group will enable slot sharing; tasks between different groups are completely isolated and must be assigned to different slots. In this scenario, the total number of slots required is the sum of the maximum parallelism of each slot sharing group.

4.3.5.4, the relationship between task slots and parallelism

Intuitively, slot is set by TaskManager to execute tasks in parallel, so is it the same as parallelism?

Slot and parallelism are both related to the parallel execution of programs, but they are completely different concepts. In simple terms,task slot是静态的概念, refers to the concurrent execution capability of TaskManager, which can
taskmanager.numberOfTaskSlotsbe configured through parameters;而并行度(parallelism)是动态概念, is the actual concurrency capability used by TaskManager when running programs, and can be参数 parallelism.defaultconfigured through . In other words,并行度如果小于等于集群中可用 slot 的总数,程序是可以正常执行的,因为 slot 不一定要全部占用; while如果并行度大于可用 slot 总数,导致超出了并行能力上限,那么程序就只好等待资源管理器分配更多的资源了.

Let us give another specific example below. Suppose there are 3 TaskManagers in total, and the number of slots in each TaskManager is set to 3, then there are 9 task slots in total, as shown in the figure below, which means that the cluster can execute up to 9 tasks in parallel.
insert image description here

And we define the processing operations of the WordCount program as four conversion operators:

source→ flatMap→ reduce→ sink

When all operators have the same degree of parallelism, it is easy to see that source and flatMap can merge operator chains, so there are finally three task nodes. If we don't have any parallelism settings, and the default parallelism.default=1 in the configuration file, then the default parallelism of the program is 1, and there are 3 tasks in total. Since tasks of different operators can share task slots, only one slot is finally occupied. Only one of the 9 slots is used, and 8 are free, as shown in the figure below.
insert image description here
If we change the default parameters, or set the parallelism to 2 when submitting the job, then there are 6 tasks in total, and 2 slots will be occupied after sharing the task slot, as shown in the figure below. Similarly, 7 slots are idle, and computing resources are not fully utilized. So it can be seen that setting an appropriate degree of parallelism can improve efficiency.
insert image description here
You can directly set the degree of parallelism to 9, so that all 27 tasks will completely occupy 9 slots. This is the maximum degree of parallelism that can be executed under the current cluster resources, and the computing resources are fully utilized, as shown in the figure below.

insert image description here

In addition, consider the scenario where the degree of parallelism is set individually for an operator. For example, if we consider that the output may be written to a file, we would like not to write multiple files in parallel, so we need to set the parallelism of the sink operator to 1. At this time, the parallelism of other operators is still 9, so there will be 19 subtasks in total. According to the principle of slot sharing, they will eventually occupy all 9 slots, and the sink task is only executed on one of the slots, as shown in the figure below.
insert image description here

From this example, it can be clearly seen that the parallelism of the entire stream processing program should be the largest among all operators, which also represents the number of slots required to run the program.

Guess you like

Origin blog.csdn.net/prefect_start/article/details/129430711