1. Background:
Flink task deployment uses a k8s-based standalone cluster. First deploy the Flink cluster on the container and then submit the Flink task. The submission of the Flink task is performed simultaneously with the creation and registration of the taskmanager.
Two, the problem
If the cluster has 35 taskmanagers and 140 slots, and the parallelism of one vertex is <140, the tasks belonging to the vertex are unevenly distributed on the taskmanagers, resulting in unbalanced node load.
As follows,
-
The flink topology has 5 vertex, two of which have a parallelism of 140, and the other three parallelisms are set to 10, 30, and 35 according to the number of kafka partitions. The maximum parallelism of the task is 140, and the task resource configuration is: 35 [4core 8gb] taskManager nodes.
-
It can be found through the web ui that even if cluster.evenly-spread-out-slots: true is configured , the other three vertex tasks will still be scheduled to the same taskmanager.
3. Optimization method
1. Problem analysis
- The appeal question can be simplified to:
Suppose a task topology logic is: Vertex A(p=2)->Vertex B(p=4)->Vertex C(p=2).
Based on the division strategy of slot sharing and local data transmission priority, it is divided into four ExecutionSlotSharingGroup : {A1, B1, C1}, {A2, B2, C2}, {B3}, {B4},
if the resource configuration divides each Taskmanager For 2 Slots, the following allocations may occur:
Slot1 | Slot2 | |
---|---|---|
TaskManager1 | {A1,B1,C1} | {A2,B2,C2} |
TaskManager2 | {B3} | {B4} |
The current Slot division is to divide the memory evenly, and there is no limit to the CPU. Appeal distribution will lead to unbalanced node load. If A and C Tasks consume more computing resources, TaskManager1 will become the bottleneck of computing. Ideally, we hope that the distribution method is:
Slot1 | Slot2 | |
---|---|---|
TaskManager1 | {A1,B1,C1} | {B3} |
TaskManager2 | {A2,B2,C2} | {B4} |
2. Optimization
modify policy
- When applying for slots for ExecutionSlotSharingGroup , first sort them by the number of Tasks they contain, and prioritize groups with more Tasks
- Delay task scheduling, and wait for the number of registered TaskManagers to be large enough to evenly distribute ExecutionSlotSharingGroup before applying for Slots for them
Effect
- Optimized task scheduling: multiple tasks of the same vertex are evenly scheduled to different taskmanager nodes
4. Performance comparison
1. CPU load comparison
-
Before optimization: The CPU load among nodes is relatively distributed, and some nodes are in a 100% high load state for a long time
-
After optimization: The CPU load between nodes is relatively concentrated, and the nodes will not be in a 100% load state for a long time
1.2 Add another CPU usage comparison
It can be seen from the topology diagram that there are two tasks with different parallelism degrees of 200/480. By balancing the task sharegroup, the CPU load balance of each tm node is realized, so that we can subsequently compress the resource quota of tm.
2. Data Backlog
After optimization, the data backlog is reduced by half compared to before, with better processing capability and lower data latency under the same resource conditions.
- Before optimization:
- Optimized:
6. Thinking
1. Task balance
For topology: Vertex A(p=3)->Vertex B(p=4)->Vertex C(p=1). will be distributed as follows
Slot1 | Slot2 | |
---|---|---|
TaskManager1 | {A1,B1,C1} | {A3,B3} |
TaskManager2 | {A2,B2} | {B4} |
Vertex B->Vertex C has four data transmission channels (B1->C1), (B2->C1), (B3->C1), (B4->C1), for non-forward connections, no matter which subtask is assigned to In the group, there are at least three channels that require cross-node communication.
Then, if you balance the tasks first when grouping: {A1, B1}, {A3, B3}, {A2, B2}, {B4, C1}, no matter how you schedule them later, they will be balanced. But when task num% slot num ! = 0, there is still a situation where tasks gather in a single tm.
2. Improvements to delayed scheduling
During the period of flink generating the execution plan, the delay strategy is generated according to the topology logic to reduce the user's operation perception