Youzan Flink real-time task resource optimization exploration and practice

With the completion of K8sization of Flink and the completion of real-time cluster migration, more and more Flink real-time tasks are running on K8s clusters. Flink K8sization improves the ability of real-time clusters to expand and shrink elastically during major promotions, and better reduces major promotions. The cost of machine expansion and contraction during the period. At the same time, because K8s has a dedicated team for maintenance within the company, Flink K8sization can also better reduce the company's operation and maintenance costs.

However, the current Flink K8s task resources are configured by the user on the real-time platform. The user himself has less experience in how much resources are configured for real-time tasks. Therefore, there are situations in which user resources are configured more but cannot be used in practice. For example, a Flink task can actually meet the needs of business processing by 4 concurrent tasks. As a result, the user configures 16 concurrent tasks. This situation will lead to a waste of real-time computing resources, which will have a certain impact on the real-time cluster resource level and the cost of the underlying machine. Based on this background, this article explores and practices Flink task resource optimization from the aspects of Flink task memory and message capability processing.

1. Flink computing resource types and optimization ideas

1.1 Flink computing resource types

I think the resources needed to run a Flink task can be divided into five categories:

  1. Memory resources
  2. Local disk (or cloud disk) storage
  3. Dependent external storage resources. Such as HDFS, S3, etc. (task status/data), HBase, MySQL, Redis, etc. (data)
  4. CPU resources
  5. Network card resources

At present, memory and CPU resources are the most used for Flink tasks. Local disks, dependent external storage resources, and network card resources are generally not bottlenecks. Therefore, in this article, we will discuss Flink from two aspects: the memory and CPU resources of Flink tasks. Real-time task resources are optimized.

1.2 Flink real-time task resource optimization ideas

For Flink's real-time task resource analysis ideas, we believe that there are two main points:

  • One is to analyze real-time tasks from the perspective of task memory and heap memory.
  • On the other hand, it starts with the real-time task message processing capability to ensure that while meeting the business side's data processing requirements, CPU resources are used as reasonably as possible.

Then combine the relevant indicators obtained from the real-time task memory analysis and the reasonableness of the real-time task concurrency to obtain a real-time task resource preset value. After fully communicating with the business side, adjust the real-time task resources, and finally achieve the rationalization of real-time task resource allocation. Purpose, so as to better reduce the cost of machine use.

1.2.1 Task memory perspective

So how to analyze the heap memory of Flink tasks? Here we combine the Flink task GC log for analysis. The GC log contains the changes and usage of memory in different areas of the GC heap each time. At the same time, according to the GC log, it is also possible to obtain the remaining space of a Taskmanager after each Full GC. It can be said that obtaining the GC logs of real-time tasks makes us a prerequisite for real-time task memory analysis.

GC log content analysis, here we use the open source GC Viewer tool to perform specific analysis. After each analysis, we can obtain GC related indicators. The following is a partial result of analyzing a GC log through GC Viewer:

The above analysis of the total heap size of a single Flink Taskmanager through the GC log, the memory space allocated by the young generation, the old generation, the remaining size of the old generation after Full GC, etc., of course, there are many other indicators, related indicator definitions can be viewed on Github.

The most important thing here is the remaining size of the old age after Full GC. According to the Java heap size calculation rule in the book "The Definitive Guide to Java Performance Optimization", set the remaining size of the old age after Full GC to M, then the correct size is recommended 3. ~ 4 times M, the new generation is 1 ~ 1.5 times M, the old generation should be 2 ~ 3 times M, of course, for the real memory configuration, you can increase the corresponding ratio according to the actual situation to prevent traffic surges .

Therefore, through the GC logs of Flink tasks, we can calculate the total recommended heap memory size for real-time tasks. When we find that the recommended heap memory and the actual real-time task's heap memory are too large, we think that we can reduce the real-time tasks of the business side. Memory configuration, thereby reducing the use of machine memory resources.

1.2.2 The perspective of task message processing capabilities

For the analysis of Flink task message processing capability, we mainly look at the input unit time of the data source consumed by the real-time task and whether it matches the message processing capabilities of each Operator / Task of the real-time task. Operator is an operator of a Flink task, and Task is a physical carrier that is executed together after one or more operators are chained up.

The data source we generally use Kafka internally. The unit time input of Kafka Topic can be obtained by calling the Kafka Broker JMX indicator interface. Of course, you can also call Flink Rest Monitoring related APIs to obtain real-time tasks. All Kafka Source Task unit time inputs are added. can. However, since back pressure may have an impact on the input on the Source side, here we directly use the Kafka Broker indicator JMX interface to obtain the Kafka Topic unit time input.

After getting the real-time task Kafka Topic unit time input, the following is to judge whether the message processing capability of the real-time task matches the data source input. The overall message processing capability of a real-time task will be affected by the slowest Operator / Task. For example, the input of Kafka Topic consumed by the Flink task is 20000 Record/S, but there is a Map operator with a concurrency of 10. The business party in the Map operator calls Dubbo, and a Dubbo interface is 10 ms from request to return. , Then the processing capacity of the Map operator is 1000 Record / S (1000 ms / 10 ms * 10 ), so the real-time task processing capacity will be reduced to 1000 Record / S.

Since the processing of a message record will flow within a Task, we are trying to find the slowest Task logic in a real-time task. If all the Chains from the Source end to the Sink end are established, we will find the logic of the slowest Operator. At the source code level, we have added a custom metric for the processing time of a single record for Flink Task and Operator, which can then be obtained through the Flink Rest API. We will traverse all the tasks in a Flink task, query the JobVertex (the point of JobGraph) where the slowest task is processed, and then get the total output of all tasks of the JobVertex, and finally compare it with the Kafka Topic unit time input, and judge Whether the real-time task message processing capability is reasonable.

Suppose the input per unit time of the real-time task Kafka Topic is S, the concurrency of JobVertex represented by the slowest task is P, the output per unit time of the JobVertex of the slowest task is O, and the maximum message processing time of the slowest task is processed. Is T, then analyze through the following logic:

  1. When O is approximately equal to S, and 1 second / T * P is much greater than S, the task concurrency will be reduced.
  2. When O is approximately equal to S, and 1 second / T * P is approximately equal to S, the task concurrency is not adjusted.
  3. When O is much smaller than S, and 1 second / T * P is much smaller than S, the task concurrency will be increased.

At present, it is mainly 1. This situation is unreasonable in terms of CPU usage. Of course, due to different time periods, the flow of real-time tasks is different, so we will have a periodic detection task. If a certain real-time task is detected multiple times in a row When this situation is met, an alarm will be automatically notified to the platform administrator to optimize and adjust resources.
The following figure is an analysis of resource logic from the two perspectives of Flink task memory and message processing capabilities:

2. Analysis and practice of Flink from a memory perspective

2.1 Flink task garbage collector selection

The Flink task is essentially a Java task, so it will also involve the choice of garbage collector. Choosing a garbage collector generally requires reference from two perspectives:

  1. Throughput, that is, task execution time per unit of time / (task execution time + garbage collection time), of course, it does not mean that reducing the GC pause time will increase the throughput, because reducing the GC pause time will increase your GC frequency.
  2. delay. If your Java program involves interacting with the outside, the delay will affect the experience of using the outside request.

I think Flink tasks are still a type of Java task that focuses on throughput, so more considerations will be made from the perspective of throughput. Of course, it does not mean that the delay is not considered at all. After all, there is a heartbeat among JobManager, TaskManager, and ResourceManager. If the delay is too large, there may be the possibility of heartbeat timeout.

At present, our JDK version is internal JDK 1.8. The new generation garbage collector uses Parallel Scavenge, so the old generation garbage collector can only choose from Serial Old or Parallel Old. Since the CPU limit of each Pod of our Flink k8s task is 0.6-1 core, and the maximum can only use 1 core, we use Serial Old for the garbage collector of the old age, and multi-threaded garbage collection is between a single core. There may be consumption of thread switching.

2.2 Real-time task GC log acquisition

After setting up the garbage collector, the next step is to obtain the GC log of the Flink task. The Flink task composition is generally a single JobManager + multiple TaskManagers. Here you need to obtain the GC log of the TaskManager for analysis. Is it necessary to get all TaskManagers? Here we sort according to the number of Young GC of TaskManager, and take the top 16 TaskManager for analysis. YoungGC times can be obtained through Flink Rest API.

The GC log of Flink on Yarn real-time tasks can be viewed directly by clicking on the log link of TaskManager, and then accessed via HTTP, it can be downloaded to the local. The GC log of the Flink On k8s task will first be written to the cloud disk mounted by the Pod and mounted based on the k8s hostpath volume. We internally use Filebeat to monitor and collect log file changes, and finally output to the downstream Kafka Topic. We will have a custom log server internally, which will consume Kafka's log records, automatically perform disk placement and management, and provide a log download interface to the outside. Through the log download interface, you can download the GC log of the TaskManager that needs to be analyzed.

2.3 Analyze Flink task memory based on GC Viewer

GC Viewer is an open source GC log analysis tool. Before using GC Viewer, you need to clone the GC Viewer project code locally, and then compile and package it to use its functions.

When analyzing a real-time task heap memory, first download the Flink TaskManager log to the local, and then use the GC Viewer to analyze the log. If you think that multiple Taskmanager GC log analysis is slow, you can use multithreading. All of the above operations can be coded to automatically produce analysis results. The following is the command line analyzed by GC Viewer:

java -jar gcviewer-1.37-SNAPSHOT.jar gc.log summary.csv

The above parameter gc.log represents the name of a Taskmanager GC log file, and summary.csv represents the result of log analysis. The following are the results of our platform's memory analysis for a real-time task:

The following is the description of some parameters in the above screenshot:

  1. RunHours, Flink task running hours
  2. YGSize, the maximum allocated amount of new-generation heap memory for a TaskManager, in megabytes
  3. YGUsePC, the maximum usage rate of a TaskManager new generation heap
  4. OGSize, the maximum amount of heap memory allocated for a TaskManager in the old age, in megabytes
  5. OGUsePC, the maximum usage rate of a TaskManager old generation heap
  6. YGCoun, a TaskMnager Young GC times
  7. YGPerTime, the pause time of a TaskMnager Young GC, in seconds
  8. FGCount, a TaskMnager Full GC count
  9. FGAllTime, the total time of a TaskMnager Full GC, in seconds
  10. Throught, Task Manager throughput
  11. AVG PT (analysis result avgPromotion parameter), the average size of the object each time Young GC is promoted to the old generation
  12. Rec Heap, the recommended heap size
  13. RecNewHeap, the recommended new generation heap size
  14. RecOldHeap, the recommended old age heap size

Most of the above memory analysis results can be obtained through GC Viewer analysis, but the recommended heap size, recommended young generation heap size, and recommended old generation heap size are set according to the memory optimization rules in section 1.2.1.

3. Analysis and practice of Flink from the perspective of message processing

3.1 Real-time task Kafka Topic input acquisition per unit time

To analyze the message processing capabilities of Flink tasks, the first step is to obtain the Kafka data source Topic of the real-time task. Currently, if the data source is not Kafka, we will not perform analysis. Flink tasks are generally divided into two categories: Flink Jar tasks and Flink SQL tasks. The Flink SQL task is relatively simple to obtain the Kafka data source. It directly parses the Flink SQL code, then obtains the parameters behind With, and then filters out the Sink table. If the Conector type of SQLCreateTable is Kafka, it can be obtained through the parameters after SQLCreateTable with Specific Kafka Topic.

The Kafka Topic data source of the Flink Jar task is relatively cumbersome. We have a real-time task blood analysis service inside. The PackagedProgram is automatically constructed for the Flink Jar task. The PackagedProgram is a class inside Flink. Then, through the PackagedProgram, we can get a Flink In the StreamGraph of the Jar task, there are all StreamNodes of Source and Sink in the StreamGraph. Through reflection, we can get the specific Source Function in the StreamNode. If it is a Kafka Source Sunction, we will get its Kafka Topic. The following is a screenshot of the StreamGraph class:

After obtaining the Kafka Topic data source of the Flink task, the next step is to obtain the number of message records entered per unit time for the topic, which can be obtained through the Kafka Broker JMX Metric interface, and we are obtained through the external interface provided by the internal Kafka management platform .

3.2 Automatically detect the slowest task of Flink message processing

First of all, we have added a metric for the processing time of a single record of Flink Task in the source code layer. This metric can be obtained through the Flink Rest API. The next step is to use the Flink Rest API to traverse all the Tasks of the Flink task to be analyzed. Flink Rest Api has such an interface:

base_flink_web_ui_url/jobs/:jobid

This interface can get all Vertexs of a task. A Vertex can be simply understood as a JobVertex in the Flink task JobGraph. JobVertex represents a piece of execution logic in a real-time task.

After obtaining all the Vertex of the Flink task, the next step is to obtain the metric for each Vertex specific task to process a single record. You can use the following interface:

You need to add ?get= (specific meitric) after the above-mentioned Rest API link metrics, for example: metrics?get=0.Filter.numRecordsOut, 0 means the id of the Vertex Task, and Filter.numRecordsOut means the specific metric name. We internally use taskOneRecordDealTime to represent the task processing time metric for a single record, and then use 0.taskOneRecordDealTime to obtain a single record processing time indicator of a task. The above interface supports multiple index queries, that is, use a comma to separate it after get.

The final automatic detection of the slowest task of Flink message processing The overall steps are as follows:

  1. Get all Vertexs of a real-time task
  2. Traverse each Vertex, then get the taskOneRecordDealTime of all concurrent tasks of this Vertex, and record the maximum value
  3. All Vertex single record processing metric maximum value is compared to find the Vertex with the slowest processing time.

The following is the result of our real-time platform analysis of a Flink real-time task:

Fourth, Youzan Flink real-time task resource optimization practice

Now that there is a way to analyze the memory and message processing capabilities of Flink tasks, the next step is to implement specific practices on the real-time platform. Our real-time platform regularly scans all running Flink tasks every day. In terms of task memory, we can combine real-time task GC logs and calculate the recommended heap memory size for Flink tasks according to memory optimization rules, and compare them with the actual allocated Flink tasks. If the difference between the two heap memory is too large, we believe that the memory configuration of the Flink task is wasted, and then we will alert the platform administrator to optimize it.

After receiving the alarm prompt, the platform administrator will also determine whether the real-time task message capability is reasonable. If the message processing is the slowest Vertex (a certain piece of real-time logic), the sum of the number of message records processed by all tasks per unit time is approximately equal to the real-time task The input of Kafka Topic consumed per unit time, but through the concurrency of Vertex and the single message processing metric, it is calculated that the number of message records processed by the Vertex per unit time is much larger than the unit input of Kafka Topic, then it is considered that the Flink task can appropriately reduce the concurrency degree. The specific adjustments will be adjusted after communicating with the business side. The overall Flink task resource optimization operation process is as follows:

Five, summary

At present, Youzan real-time computing platform has taken the first step in the exploration of Flink task resource optimization. Real-time tasks that can be optimized are discovered through automation, and then the platform administrator intervenes in the analysis to finally determine whether the resources of the Flink task can be adjusted. In the entire link of real-time task resource optimization, automation is still not enough at present, because human factors are still needed in the second half. In the future, we plan to fully automate the optimization of Flink task resources. Combining the resource usage in different periods of the real-time task history, automatically speculate and adjust the resource configuration of real-time tasks, so as to achieve the purpose of improving the resource utilization of the entire real-time cluster.

At the same time, in the future, we will cooperate with students on the metadata platform to analyze the possibility of resource optimization in real-time tasks from more aspects. They have accumulated a lot of optimization experience in the original offline task resources, and they can also refer to and learn from in the future. Applied to the optimization of real-time task resources.

Of course, the most ideal is that the resource usage of real-time tasks can automatically and flexibly expand and shrink by itself. I have heard from community students in this regard before, and you are also welcome to discuss with me.

Author|Shen Lei

Original link

This article is the original content of Alibaba Cloud and may not be reproduced without permission.

Guess you like

Origin blog.csdn.net/weixin_43970890/article/details/114063765