Share a Flink checkpoint failure problem and solution

Do not miss passing through

Click the blue word to follow us

I have been in contact with Flink for a while, and I have encountered some problems. Among them, there is a problem of restarting the job due to a checkpoint failure. I have encountered many times. After the restart, it can generally return to normal. I didn't care too much. I have encountered it frequently in the past 2 days. Record the solution and analysis process.

Our flink test environment has 3 nodes. The deployment architecture is to deploy a HDFS DataNode node on each flink node, and hdfs is used for flink checkpoint and savepoint

phenomenon

Looking at the log, there are 3 datanodes alive, the file copy is 1, but the file writing fails

There are 3 datanode(s) running and no node(s) are excluded

I searched the Internet for this error report, but there is no direct answer. I looked at the log of namenode and there is no more direct information.

Everything is normal on 50070 web ui, there is still a lot of free space in the datanode, and the utilization rate is less than 10%

I tried to put a file on hdfs and then get it, all ok, indicating that there is no problem with the hdfs service, and the datanode also works

Log phenomenon 1

Continue to flip through the namenode log and notice some warning messages. At

this time, I suspect that there is a problem with the block placement strategy.

Follow the log prompt to open the corresponding debug switch to
modify

etc/hadoop/log4j.properties

turn up

log4j.logger.org.apache.hadoop.fs.s3a.S3AFileSystem=WARN

Copy this format and add below

log4j.logger.org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy=DEBUG
log4j.logger.org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor=DEBUG
log4j.logger.org.apache.hadoop.net.NetworkTopology=DEBUG

Restart the namenode, and then rerun the flink job

Log phenomenon 2

The problem we see at this time is that the rack awareness strategy cannot be satisfied, because we did not provide a rack mapping script. The default is the same rack, but it shouldn’t matter if you think about it carefully.

Because many HDFS in the production environment do not actually configure rack mapping scripts, and the problem that causes checkpoint failure is not always present, at least the put/get files are normal.

At this time, I started to consider looking at the source code of hdfs. According to the log call stack above, first see BlockPlacementPolicyDefault and related DatanodeDescriptor

These source codes roughly mean that when selecting a datanode for a block, some checks should be performed on the datanode, such as looking at the remaining space and seeing how busy it is

When our problem recurs, the observation log will find some key information related to this

This log means that the storage space is 43G, and the allocation block actually needs more than 100M, but the scheduled size exceeds 43G, so we think that the normal datanode, the namenode thinks it has insufficient space

the reason

What is the meaning of scheduled size? According to the code, you can see that the scheduled size is the block size multiplied by a counter. The counter actually represents a counter for the number of new file blocks. HDFS estimates the storage space that may be required based on these two parameters, which is equivalent to pre-booking a certain amount for each datanode Space, the predetermined space will be adjusted back after the actual occupied space is calculated after the file is written.

After understanding this principle, it can be judged that the datanode has been reserved too much space in a period of time.

Flink's checkpoint mechanism can refer to this article https://www.jianshu.com/p/9c587bd491fc
roughly means that many task threads on taskmanager will write hdfs

Looking at the directory structure of hdfs, there are a large number of checkpoint files named like uuid, and each file is very small

When our job is more concurrent, more checkpoint files will be created on hdfs accordingly. Although our file size is only a few K, the reserved space in each datanode is 128M times the number of files allocated (The file is very small, no more than 128M), then how many files can be reserved for 43G of space? In addition, there are more than 300, and three nodes are up to 900. We have multiple jobs and the total concurrency is relatively large. This problem is prone to occur before the reserved space is completely released.

I know that hdfs is not suitable for storing small files. The reason is that a large number of small files will cause inode consumption and block location metadata growth, which makes the namenode memory tight. This example also shows that
when the blocksize is set large, the file size is much smaller than the blocksize. Such small files will cause the datanode to be "unavailable" directly.

Solution

The block size is not a cluster attribute, but a file attribute, which can be set by the client. At this time, each taskmanager and jobmanager of flink is the "client" of hdfs. According to the flink document, we can do the following configuration
1. In conf/flink-conf Specify a hdfs configuration file path in .yaml

fs.hdfs.hadoopconf: /home/xxxx/flink/conf

Here is the same directory as the configuration file path of flink

2. Put in 2 configuration files, one is core-site.xml and the other is hdfs-site.xml

core-site.xml can be left alone, if checkpoint and savepoint specify specific hdfs addresses,

Add the blockSize configuration to hdfs-site.xml, for example, here we set it to 1M

How to set the specific block size, you need to observe your own job status and adjust the file size flexibly.

Restart the flink cluster and submit the job. You can observe the size of the fsimage of hdfs when running. Be careful not to cause the metadata to be too large due to too small blocks and too many small files.

summary

We have synchronized this issue to the cluster automated deployment script, and will specifically add the blocksize configuration during deployment.

The flink checkpoint solution that relies on hdfs is slightly bloated for lightweight stream computing scenarios. The distributed storage of checkpoint requires hdfs whether it is direct filesystem or rocksDB. In fact, considering the principle and data type of checkpoint, es should also be a good choice. Unfortunately, the community does not provide such a program.

Previous wonderful recommendations

Summary of Tencent, Ali, Didi Backstage Interview Questions-(including answers)

Interview: The most comprehensive multi-threaded interview questions in history!

The latest Alibaba pushes Java back-end interview questions

JVM is difficult to learn? That's because you didn't read this article seriously

—END—

Follow the author's WeChat public account—"JAVA Rotten Pigskin"

Learn more about java back-end architecture knowledge and the latest interview book

Everything you order is pretty, I take it seriously

Article source: https://club.perfma.com/article/1733519

Guess you like

Origin blog.csdn.net/yunzhaji3762/article/details/108878543