Xiaomi's Flink-based real-time computing resource governance practice

Abstract: This article is compiled from Zhang Jiao, a senior software engineer of Xiaomi, who shared in the Flink Forward Asia 2022 production practice session. The content of this article is mainly divided into four parts:

  1. Development Status and Scale
  2. Framework layer governance practice
  3. Platform Layer Governance Practice
  4. Future Planning and Outlook

Click to view the original video & speech PPT

1. Development Status and Scale

1

As shown in the figure above, the lower layer is the basic service, including: unified metadata service, unified authority management, unified task scheduling, and unified data integration.

On top of this are various distributed engines, including data sources, data acquisition, message middleware, data calculation and data query. Flink is mainly located in the data computing module. It is currently the de facto standard for real-time computing, and is constantly developing offline computing scenarios, moving towards a faster, more stable and easier-to-use batch processing engine.

2

Currently, the Xiaomi Flink platform runs 5,000+ user jobs and about 12,000 data integration jobs. They use a total of about 130,000 CPU cores and 460TB of memory. We can see that the resource consumption is still very huge.

3

At present, when users use Flink to develop real-time jobs, there are various problems. I have unified and summarized these problems into two categories of problems, namely, experience tax problems and non-experience tax problems.

  • Experience tax mainly refers to the waste of resources caused by users' lack of experience when developing Flink jobs. Including the inability to accurately estimate the real resources needed by the job, the operation and maintenance pressure caused by the backlog caused by improper resource settings, and some users setting up large resource redundancy for stability and reduced operation and maintenance.
  • The non-experience tax refers to the problems that Flink job developers who already have certain experience may encounter. For example, because the internal Flink framework does not support fine-grained resource management, resource waste, large resources have to be set for a long time to cope with short-term traffic peaks, and resources cannot be dynamically adjusted for traffic fluctuation scenarios.

4

After introducing the various problems that exist when users develop Flink real-time jobs. Let's take a look at the results of resource waste caused by these problems. Although clusters with different resource configurations have different heap memory usage and CPU utilization, they are relatively low overall. The average resource utilization rate of the cluster where user jobs are located is only about 35%, and the lowest is even only about 20%, which causes a huge waste of resources.

5

As shown in the figure above, in the past six months, both the user's Flink job and data integration job have nearly doubled. If this trend continues to grow, there will be a huge gap in cluster resources and even greater waste of resources, so it is imminent to manage cluster resources.

6

The stability of Flink's real-time operations is very important, so we propose that the basic principle of resource governance is to reduce costs but not quality. We need to achieve the goal of saving resources on the premise of ensuring that stability is not greatly affected. Therefore, based on this basic principle, we proposed a method of data-driven, value quantification, continuous in-depth business, continuous business promotion and collection of business feedback, forming a closed channel channel.

2. Framework layer governance practice

7

The main logic of flexible scheduling is concentrated in JobMaster. We have developed a brand new module DynamicSchedulerManager as a controller for flexible scheduling. It is mainly responsible for pulling and aggregating various elasticity-related indicators collected from various TaskManagers. Then these indicators and the rules pulled from HDFS are uniformly processed and triggered by Drools, and adjusted according to the two categories of vertical scaling and horizontal scaling according to the triggering results. Vertical scaling is mainly adjusted for the resources of a single Container. Currently, adjustment results do not support persistence. Horizontal scaling is mainly adjusted for the degree of parallelism, and the adjustment results can be persisted. Drools is an open source rule engine, and rules can be dynamically tuned and updated as needed.

8

There are two main sources of elasticity key indicators.

  • The built-in indicators of TaskManager and Task include CPU Load and Task idle indicators for CPU adjustment.

  • Metrics such as on-heap/off-heap memory utilization, GC times and frequency, traffic from third-party connectors, and backlog for memory and concurrency tuning.

9

Next, a specific example will be used to describe the implemented memory adjustment rules. The left side of the figure above is the memory model diagram of TaskManager after Flink 1.10. I believe that students who have a certain understanding of Flink should be familiar with this graph. According to the calculation rules of the Java heap size, it is assumed that the remaining space of the old generation after FullGC is M. The recommended size of the entire heap is 3~4 times M. Assuming that the recommended value of the heap size is 3M, combined with the Flink memory model diagram, we can calculate the recommended TaskManager memory size. The relevant formula is shown in the figure above. This calculation result is only an approximate value and is not exact. The real TaskManager's memory estimation process is far more complicated than this process.

10

Next, share the entire process of restarting in situ for resource expansion.

First, the AppMaster (essentially the DynamicSchedulerManager within the JobMaster), which sends a request to the ResourceManager to increase resources. This request specifies the ContainerID and the target resource value. The Scheduler will allocate during the scheduling cycle, return the token of the new resource, and start a listener.

Then, AppMaster will use the new token to send a resource expansion request to NodeManager. ContainerManager will notify Containers Monitor to update resource monitoring and execution in a synchronous manner.

At the same time, it also updates the accounting and metrics information of container resources. NodeStatusUpdater will send resource update messages to RM in the form of heartbeat. Therefore, the scheduler cancels the listener registered previously, and the entire resource expansion process is completed.

11

The process of shrinking resources is slightly different from expanding resources, the main reason is that it does not require a new token to access the expanded resources. Therefore, after RM has shrunk resources, it directly notifies AM and NM of the reduced Container information through heartbeat. After NM obtains the information of the reduced Container, it notifies the ContainerManager and updates its own related metrics information. ContainerManager will notify ContainersMonitor to update its resource monitoring and execution, and then update its internal resource accounting and metrics.

12

Next, share the practice related to parallelism adjustment. It needs to rely on the AdaptiveScheduler proposed by Flink 1.13 version. After Drools processes through rules and indicators and determines the amount of parallelism that needs to be scaled, it needs to check first to determine whether parallelism can be scaled. If the check fails, the adjustment will be withdrawn directly, otherwise Executing will be notified to proceed. Adjustment. Executing declares new resource requirements and triggers a restart. If scaling is successful, the new degree of parallelism will be persisted. If parallelism is reduced, resources also need to be released. The implementation is to start a timed task, periodically check and release the idle slot.

13

As mentioned above, a verification is required when adjusting the degree of parallelism, and this verification will judge the scaling conditions. For example, whether the parallelism has reached the maximum parallelism and cannot be scaled, whether the scaling ratio is appropriate, whether it is necessary to increase the scaling ratio to quickly consume the backlog, and whether scaling will cause data skew, etc. If data distribution is uneven due to scaling, it is likely to affect the stability of the job. In addition, the DAG graph before and after concurrent scaling must be compared to avoid changes in the DAG graph, resulting in incorrect calculation logic or failure to recover stateful jobs normally.

In addition, whether resources are sufficient is also very important for expanding concurrency. Therefore, judgment must be made in advance, otherwise, during the process of expanding concurrency, resource application timeout may occur, which may seriously affect the operation.

14

Currently, we have designed and implemented a variety of scheduling strategies to deal with various elastic scenarios. According to the scheduling cycle, there is fixed-time scheduled scheduling to deal with scenarios with relatively regular traffic fluctuations, periodic scheduling to deal with general scenarios, and active scheduling scenarios that can be judged based on certain thresholds and automatically triggered. According to the different triggering subjects, the automatic triggering of the framework without manual intervention is realized; and the manual strategy is triggered by the user's manual intervention.

15

During this process, we also encountered various problems. For example, periodic triggering cannot cope with traffic surges. Especially when the cycle-triggered elasticity reduces job resources, the problem occurs very frequently. Finally, we use the FullGC Monitor to judge based on the threshold, thereby triggering active scaling for processing.

The variety of triggering scenarios also leads to frequent restarts of jobs within the framework, even restarts at the APP level, which affects the stability of jobs. In addition, frequent alarms also increase our operation and maintenance pressure.

By increasing the stable interval of the trigger and adding the scaling and orchestration function to allow multiple adjustments for the same Container to be merged, we have maintained the stability of the job well.

In addition, each scaling of resources will lead to a restart within the framework. Students who are familiar with Flink know that this kind of restart usually cannot guarantee the end-to-end Exactly once semantics, which may lead to data duplication. For such scenarios, we introduce incremental savepoints, so that scenarios that need to ensure end-to-end data consistency can be solved by restoring from incremental savepoints.

16

Due to the tuning of Flink's parallelism, the maximum parallelism setting cannot be exceeded. Therefore, in order to better discover such scenarios, we calculated and processed the maximum parallelism of each operator of the job, and displayed it to the user on the page. In addition, we also set the default minimum and maximum parallelism value to 512 to avoid situations that cannot be adjusted.

17

After a series of adjustments in elastic scaling, we have achieved some results in resource management. As shown in the figure above, it shows that the traffic of a Kafka (Talos) Topic changes periodically and dynamically over time.

18

As shown in the figure above, after vertical elasticity is enabled, the memory of TaskManager is also adjusted as the previous traffic changes. The total memory size of TaskManager configured by default for this job is about 131,000. You can see that the scaling effect is in line with expectations.

19

For jobs with relatively stable traffic, after the vertical elasticity is enabled, the memory drops to a reasonable value directly, and continues to run stably without any change.

20

From the results of the overall elasticity, we have saved 34% of the memory configured by the user job, that is, saved one-third of the total memory configured by the user job, and improved the cluster heap memory utilization by about 10%.

Currently, our tuning strategy is conservative and has not been extended to all jobs. With further promotion, this data will be further improved, and the optimization effect and benefits will be more obvious.

3. Platform layer governance practice

21

The memory intelligence suggestion is mainly to recommend the appropriate JobManager and TaskManager memory suggestions for the job based on the historical heap memory usage of the job and the job profile.

Currently, we have memory recommendations for about a quarter of all user jobs online. The accumulated memory can be saved more than 21TB, and the estimated monthly cost savings is more than 420,000 RMB.

22

Usually, when developing Flink jobs, users need to apply for a real-time queue and configure the required resources. This queue provides resources by user or user group. Since each user or user group has its own independent queue, user jobs run more stably and platform operation and maintenance costs are lower.

Under the new architectural design, we let the real-time queues all use the same queue. Unified real-time queues share resources among all users and user groups, improving resource utilization. At the same time, it can also use the remaining resources of the entire queue as a buffer for flexible scheduling. Additionally, it reduces the cost of understanding for new users when developing real-time jobs.

23

The unified real-time queue enables all jobs to use the public queue to submit jobs. However, since the parallelism configured by the code defaults to the highest priority, the parallelism set by the user on the platform will not take effect, resulting in uncontrolled user job resource usage.

In order to deal with this situation, under the premise of ensuring that the Chain remains unchanged, we force the parallelism of the job to not exceed the parallelism configured on the platform, so that we can check whether the cluster resources are sufficient in advance and avoid the abuse of resources.

24

In addition, for jobs with elastic scheduling enabled, elastic audit logs are provided, and the time when the job is triggered elastically and the trigger reason are collected, so that we and users can verify the cause and effect of elastic scheduling triggering, and optimize the elastic rules accordingly.

Generally speaking, our governance for real-time computing resources can be divided into the following five stages:

  • The first stage, extensive configuration. Users can configure it at will, with little or no control.
  • The second stage, memory suggestion. Provide governance recommendations for already long-running jobs, which requires us to constantly push and assist users to make adjustments.
  • The third stage is to shield the queue. Improve queue resource utilization.
  • The fourth stage is concurrency limitation. Use technology to control the misuse of resources.
  • The fifth stage is elastic scaling. Use technical means to adapt resources to achieve the purpose of automatic governance.

4. Future Planning and Outlook

25

Future planning and outlook:

  • Container-level elasticity, that is, the persistence of vertical scaling. Avoid the adjustment effect disappearing after the job is restarted, and have to be adjusted again.
  • Develop smarter and more stable elastic rule solutions, and support user-defined elastic rules. The current rules are a general solution provided by us, applicable to many scenarios, but the effect may be quite different in different scenarios.
  • Support richer scenes to make the effect more obvious. The main goal is to increase the CPU and memory utilization of the cluster to about 70% to 80%.
  • Allow more business parties to access. In the end, the desired effect is that it is fully enabled by default and can take effect for all jobs.

Click to view the original video & speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132200465