The practice of cost reduction and efficiency increase of Flink data integration service in Xiaohongshu

Abstract: This article is compiled from the sharing of Yuan Kui, a real-time engine research and development engineer, in the Flink Forward Asia 2022 data integration session. The content of this article is mainly divided into four parts:

  1. Xiaohongshu's real-time service reduces costs and increases efficiency background
  2. Flink and offline hybrid practice
  3. Problems encountered in practice and solutions
  4. future outlook

Click to view the original video & speech PPT

1. The background of Xiaohongshu’s real-time service cost reduction and efficiency increase

1.1 Features of Xiaohongshu Flink usage scenarios

1

The Flink features of Xiaohongshu include the following three items:

  • First, cloud-native, complex multi-cloud, domestic and overseas architecture. Since its establishment, Xiaohongshu has built all its technical systems on the public cloud, and is a native of the cloud in the true sense.

    We have cooperated with many cloud vendors, such as AWS, Tencent Cloud, Huawei Cloud, Alibaba Cloud and so on. After years of development, business data has also been distributed to different cloud vendors. Cloud native itself will bring natural benefits, such as resource isolation and expansion are very easy.

  • Second, the data integration link is long, and there is a phenomenon that resources are preempted from each other during peak hours. Take data integration as an example. Under the multi-cloud architecture, data is often transmitted across clouds, so data integration tasks are important and indispensable. We have built an exclusive cluster for Flink's data integration in the past, but with the increase of data integration tasks, more and more resource preemption has occurred.

    Because Flink integration tasks are all batch tasks, most of them will run intensively at the same time in the early morning, and some tasks will fail because they cannot seize resources. At the same time, the overall utilization rate of the entire resource pool is also relatively low, because there are relatively few batch tasks running during the day, and resources are idle at this time.

  • Third, the high-quality and low-quality jobs of data integration are run by the Flink flow mode engine. There are some historical reasons, one is that the batch mode engine of the early Flink version is not mature, and the other is that the stream mode is relatively simple, it is fast, and there is no need to consider the problem of intermediate data storage. It is a better choice when resources are abundant.

1.2 Xiaohongshu Flink data integration service

2

There are many types of typical data integration in Xiaohongshu, such as Hive to Clickhouse, Hive to Doris, Hive to MySQL, Mongo to Hive, etc.

On the right side of the figure above is a top picture. A data source performs a Mongo lookup join, which is divided into two streams and written to the downstream. This is a typical Flink data integration task.

1.3 General Environmental Requirements for Cost Reduction and Efficiency Increase

3

With the development of Xiaohongshu, the infrastructure has become more and more perfect, and the use of resources has become more standardized. The era of barbaric application of resources in the past is over, and now we are paying more and more attention to the CPU utilization of the cluster.

In this context, let's look at Flink's resource clusters. On the one hand, our current Flink resource cluster mainly adopts the exclusive mode, and some small resource pools have fewer tasks, which are prone to resource fragmentation and waste of resources. On the other hand, the cluster of Flink integration tasks has resource preemption at night, but cannot be used during the day because the resources are idle, which will cause the overall resource utilization rate to be low.

4

For the above two problems, what are the solutions to improve the overall resource utilization? Can be divided into the following two points:

  • First, how to avoid small-scale clusters. We can merge small-scale clusters, and then cooperate with K8s Resource Quota for resource isolation. In addition, we have a better solution, which is to use the offline mixed cluster provided by the container team. Migrate the tasks of the small-scale cluster to the offline hybrid cluster, and then release the resources of the small-scale cluster.
  • Second, how to reduce resource preemption during peak periods. From the perspective of the platform, we can optimize the scheduling of resources and refine the priority of tasks. From the perspective of the Flink engine, we can promote Flink's batch mode engine, because the batch mode engine has lower requirements on resources. But our entry point is different, we consider it from the perspective of resources.

1.4 Comparison of Flink stream mode/batch mode from the perspective of cost reduction and efficiency increase

5

Next, let's compare Flink's stream mode and batch mode from the perspective of resources.

Flink's flow mode engine has no concept of stages when it is running, and data flows in the form of pipelines. This requires all operators and concurrent resources to be ready in real time so that the program can run normally. For the batch mode engine, tasks are divided into several stages, and the next stage can only be run after the previous stage is completed, and only some operators and concurrently obtained resources can be run.

From another point of view, some aggregation-type batch tasks will inevitably introduce State and Watermark when running in stream mode, which requires more CPU and memory resources. In the batch mode engine, State and Watermark are not needed, only Shuffle intermediate data is required, which also has high requirements for disk, but disk is cheaper than CPU and memory.

This is the comparison between stream mode and batch mode from the perspective of resources, and it is also some considerations for switching batch tasks from stream mode to batch mode to run.

2. Flink and offline hybrid practice

2.1 K8s cluster in offline mixed deployment

6

Let's first look at what is offline mixing. Generally, companies will have two types of services. One is online service, which is characterized by long running time and tidal nature of service traffic and resource utilization. That is to say, when there are many users during the day, the resource utilization rate will be high and the traffic will be high, and when the number of users decreases at night, the resource utilization rate will also drop. The other is offline work. It will only run for a period of time, and the resource utilization rate is very high during the running period, and it is generally not sensitive to delay. As long as the running ends before a certain time point, the resources will be idle.

The so-called offline mixing refers to the use of idle resources in online services for offline operations to improve the overall utilization of resources. For offline business, it can greatly reduce the resource usage cost. During the mixed running of offline tasks, online services need to be protected, and operations such as resource suppression may be performed on the operation of offline services.

7

The figure above is a schematic diagram of an offline hybrid cluster. The container team collects the idle resources of each online service cluster to form a resource cluster. From the user's point of view, only some virtual nodes can be seen, but in fact each virtual node corresponds to one or more real resource nodes. For users, the use of virtual clusters is the same as that of real exclusive clusters, the only difference is that the resources of virtual nodes may be constantly changing. The container team provided an offline hybrid cluster, and we happened to have offline tasks and pressure on resource utilization, so it hit it off.

2.2 Features of Offline Tasks Suitable for Offline Mixing

8

Which tasks are suitable for migration, the main considerations are the following three characteristics:

  • The first is that the tasks migrated must be non-delay sensitive, because offline resources will be compressed in the offline mixed cluster, and the running time of offline tasks may be longer.

  • The second is that tasks must have the characteristics of tides, and it is necessary to select offline tasks that run in large numbers just when resources are idle and migrate to them. Generally speaking, online services have relatively idle resources at night, while offline tasks are concentrated and run more at night, which is more suitable.

  • The third is to have fault tolerance, because the resources of offline tasks may be compressed and pods may be evicted during offline mixing, so tasks need to have certain fault tolerance.

2.3 Flink tasks suitable for offline mixing

9

For batch tasks, since the Pod may be evicted, when it is evicted, it may re-consume data when it is pulled up on other nodes, resulting in data duplication, so we have to choose the Sink side to support idempotent insertion or not care Batch task migration of duplicate data. For the batch mode engine, we need to chain all operators together as much as possible, and select this part of the task migration. Because if the operators are not chained together, the intermediate data will be placed on the disk, which will have higher requirements on the resource nodes. Try to choose the batch task migration that runs in large numbers at night, because the resources of the offline mixed cluster are relatively idle at night. Generally, an offline mixed cluster is not suitable for upstream tasks, but because it has some idle resources during the day that can support part of the streaming tasks, we also choose to migrate some low-quality streaming tasks, and these streaming tasks need to be able to tolerate Fail Over, a delay is allowed for a period of time.

2.4 Flink and offline co-construction

10

First, we will deploy a Flink exclusive cluster with no exclusive nodes on it, and then the container team will deploy the virtual nodes to our exclusive cluster. Behind the virtual node is a controller and a real resource node. When we submit a task, we only need to submit the task to the virtual node, and the deployment will pull up the JobManager Pod on the virtual node. Finally, the creation process will be sent by the controller of the virtual node to the real resource node behind it for execution.

We use the Flink Native K8s method, so the TaskManager is pulled by the JobManager. This creation process is the same as that of the deployment, and will also be sent by the virtual node to the real resource node for execution. That is to say, in the end, the Pods of JobManager and TaskManager all run on the resource nodes behind them, and there is only a mirror image of the Pods on the virtual nodes. For K8s resources such as Configmaps, Service, and Ingress, its source data is stored in ETCD, and only a part of it needs to be synchronized.

In this way, we can submit tasks normally in the Flink exclusive cluster, and can operate Pod through the kubectl command normally. For us, using an offline virtual cluster is the same as using a normal Flink exclusive cluster. Of course, there are some problems in the implementation process. For example, JobManager and TaskManager belong to two clusters, how to communicate between them, how to collect logs and monitoring indicators, etc. These are some problems in engineering implementation, so I won’t go into details here .

3. Problems encountered in practice and solutions

The last part is about some problems we encountered in practice. As cloud natives, the questions here also focus on some problems and solutions we encountered on cloud native.

3.1 Avoid the remaining of temporary data files on the host

11

The first question is how to avoid the remnants of temporary data files on the host machine. Anyone who has used K8s container technology will encounter such a problem. By default, a container is started, and the temporary data files in the container are stored in the docker disk. If the temporary data file is too large, it will affect the running stability of docker. At this time, we can mount another data disk in the container and let the temporary data file be written to this data disk, so that it will not affect the running stability of docker sex.

In K8s, data disks are generally mounted through the hostPath volume mount method. The advantage of this method is that a host’s mount directory can be specified. The mount method is simple, but the hostPath mount method depends on the cleaning of the temporary files of the program itself. logic. If the Pod exits abnormally, for example, when it encounters OOM and is killed by K8s, the temporary data file cleanup logic has not had time to execute and the Pod has ended, then the temporary data file will remain on the host machine. When there are more and more remaining files, occupying the entire data disk, it will affect the stability of task operation. So how did we solve it?

K8s has a mounting method called emptyDir, which has the same life cycle as Pod. Therefore, regardless of whether the Pod ends normally or abnormally, the temporary data files in the emptyDir mount directory will be cleaned up after the end, which reduces the dependence on program cleanup logic.

One thing to note here is that emptyDir cannot specify the mount directory, and the kubelet working directory is used by default for storage. Generally, this directory is in the system disk. If no processing is done, writing temporary files to the system disk may affect the stability of the system operation. Therefore, we generally need to change the working directory of kubelet to another data disk when booting. .

3.2 OOM problem of batch mode in cloud native scenario

12

The second question is the OOM problem of batch mode in cloud native scenarios. This task runs very smoothly in the stream mode engine, but OOM problems frequently occur after switching to the batch mode engine to run.

This task still has two operators after the chain, that is to say, a data shuffle will be performed in the middle, and OOM occurs at this stage of writing the shuffle data. From the monitoring diagram in the upper right corner of the above figure, you can clearly see two stages. The first stage is the stage of writing Shuffle data. In some cases, the work-set soars. Once the container limit is exceeded, OOM Kill will be triggered.

When this happens, first of all, we first observe the heap memory usage from Flink's webui. Currently, the heap memory usage is normal. From the GC monitoring interface, we can also see that the GC situation is normal. Then we suspect that there may be a leak in the off-heap memory, so we enter the Pod and check the usage of RSS through the pmap command. That is, in the picture in the lower right corner, you can see that the RSS is also normal, and the RSS is only about 7G, which does not reach the limit of 20G, which means that it is not caused by an out-of-heap memory leak.

The answer is already here. The work-set indicator can be simply understood as RSS+Page Cache. RSS is normal, but work-set has soared again, so we can suspect that OOM is caused by Page Cache.

13

Following this line of thought, we log in to the machine node to check the machine log. As shown in the figure above, we found a call stack, and we can see that the OOM is caused by applying for Page Cache. In fact, the performance of the cloud disk is insufficient. When Shuffle data is written in a large amount to the Page Cache in an instant, the data cannot be flushed to the disk in time, resulting in memory overuse and triggering OOM Kill.

We have a temporary solution. Increase the number of Pods, reduce the amount of data processed by a single Pod, and then try to distribute the Pods to different machine nodes to reduce the pressure on the machine nodes. Or upgrade the machine kernel, and limit the current by adjusting kernel parameters. In addition, we can also start from the Flink engine itself and directly limit the current in the Shuffle data stage.

4. Future Outlook

14

The direction that Xiaohongshu will explore in the future mainly includes the following three parts.

  • First, batch mode applies deep digging. We hope to go deep into users, explore more usage scenarios of batch mode engines, and truly promote Flink's stream-batch integration.
  • Second, use the Resource Quota function of K8s to merge multiple small clusters of the business side to reduce the resource fragmentation problem of the machine.
  • Third, serverless is an important goal for the deployment of batch mode engines in cloud-native environments, but forcibly deploying as serverless means that if the pod is killed, the intermediate data will be cleaned up, which will affect the fault recovery of the task. At this time, remote Shuffle Service The value of shuffle is reflected. Using Remote Shuffle Service can effectively reduce partial dependence on local disks, improve resource utilization, and help cloud native architecture.

Click to view the original video & speech PPT

Guess you like

Origin blog.csdn.net/weixin_44904816/article/details/132353305