How does flink state TTL (Time To Live) deal with the rapidly increasing state? How does the checkpoint mechanism work?

In Flink's streaming computing operations, we often encounter situations where the number of states continues to accumulate, resulting in an increasing amount of state. For example, a very long time window is defined in the job. For these situations, if they are not handled well, they often cause OOM in the heap memory, or the continuous increase in the amount of off-heap memory (RocksDB) causes the quota limit of the container to be exceeded, resulting in frequent job crashes and business failures in stable and normal operations.

Starting from Flink 1.6, the community has introduced the State TTL feature, which allows the keyed state defined in the job to be automatically cleaned up after timeout (usually, most states in Flink are Keyed states, and only a few places will use Operator Status, so the “status” in this article refers to the Keyed status), and provides multiple setting parameters, which can flexibly set the timing of timestamp update and the visibility of the expired status to respond to different demand scenarios.

Essentially, the State TTL function adds a "time stamp" to each keyed state of Flink, and Flink updates this time stamp when the state is created, written, or read (optional), and determines whether the state has expired. If the status is expired, it will also decide whether to return to an expired but not yet cleaned state and so on based on the visibility parameter. The state cleaning is not instant, but a Lazy algorithm is used to implement it, thereby reducing the impact of state cleaning on performance.

1. Usage of State TTL function

In order to become familiar with a feature, the most intuitive way is to understand its usage. In Flink's official documentation, usage examples are as follows:

import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.time.Time;

StateTtlConfig ttlConfig = StateTtlConfig
    .newBuilder(Time.seconds(1))
    .setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
    .setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
    .build();
    
ValueStateDescriptor<String> stateDescriptor = new ValueStateDescriptor<>("text state", String.class);
stateDescriptor.enableTimeToLive(ttlConfig);

As you can see, to use the State TTL function, you must first define a StateTtlConfig object. This StateTtlConfig object can be created through the Builder Pattern. The typical usage is to pass in a Time object as the TTL time, and then set the update type (Update Type) and state visibility (State Visibility), these two functions The meaning will be described in detail in the following article. After the StateTtlConfig object is constructed, the State TTL function can be enabled in the State Descriptor declared later.

It can also be seen from the above code that the expiration time specified by the State TTL function is not globally effective, but is bound to a specific state. In other words, if you want to be effective for all states, you need to pass in the StateTtlConfig object for all the state definitions used.

For more cases of State TTL use, please refer to the official https://github.com/apache/flink/tree/master/flink-end-to-end-tests/flink-stream-state-ttl-test/src/ main/java/org/apache/flink/streaming/tests, it provides many test cases for reference.

2. Parameter description of StateTtlConfig

(1) TTL : Indicates the expiration time of the state, which is an org.apache.flink.api.common.time.Time object. Once the TTL is set, if the timestamp of the last visit + TTL exceeds the current time, it indicates that the state has expired (this is a simplified statement, please refer to org.apache.flink.runtime.state.ttl for a rigorous definition. The implementation of expired in the TtlUtils class).
(2) UpdateType : Represents the timing of the update of the status timestamp, which is an Enum object. If it is set to Disabled, it means that the timestamp is not updated; if it is set to OnCreateAndWrite, it means that the timestamp will be updated when the state is created or every time it is written; if it is set to OnReadAndWrite, the timestamp will be updated except when the state is created and written In addition, reading will also update the status timestamp.
(3) StateVisibility : Indicates how to deal with the state that has expired but has not been cleaned up. It is also an Enum object. If it is set to ReturnExpiredIfNotCleanedUp, then even if the timestamp of this state indicates that it has expired, it will be returned to the caller as long as it has not been actually cleared; if it is set to NeverReturnExpired, then once this state has expired, it will never Will be returned to the caller, and will only return to an empty state, avoiding the interference caused by the expired state.
(4) TimeCharacteristic and TtlTimeCharacteristic: Represents the time mode applicable to the State TTL function, which is still an Enum object. The former has been marked as Deprecated, and it is recommended that the new code adopt the new TtlTimeCharacteristic parameter. As of Flink 1.8, only one time mode of ProcessingTime is supported, and State TTL support for EventTime mode is still under development.
(5) CleanupStrategies : Represents the cleanup strategy of expired objects. At present, there are three Enum values. When set to FULL_STATE_SCAN_SNAPSHOT, it corresponds to the EmptyCleanupStrategy class, which means that the expired state is not actively cleaned up. When a full snapshot (Snapshot / Checkpoint) is executed, a smaller state file will be generated, but the local state will not decrease. Only when the job is restarted and restored from the last snapshot point, the local state will actually decrease, so the memory pressure problem may still not be solved. In order to deal with this problem, Flink also provides incremental cleanup enumeration values, namely INCREMENTAL_CLEANUP for Heap StateBackend (corresponding to the IncrementalCleanupStrategy class), and ROCKSDB_COMPACTION_FILTER valid for RocksDB StateBackend (corresponding to RocksdbCompactFilterCleanupStrategy class). For the incremental cleanup function, Flink It can be configured to perform a cleanup operation every time a few records are read, and you can specify how many invalid records to clean up each time; for the state cleanup of RocksDB, it is implemented by calling FlinkCompactionFilter written in C++ language through JNI. The bottom layer is The failure state filtering is achieved through the background Compaction operation provided by RocksDB.

The characteristic of streaming job is that it runs 7*24 hours, the data is not repeated consumption, is not lost, and it is guaranteed to be calculated only once, and the real-time data output is not delayed, but when the state is very large, the memory capacity is limited, or the instance running crashes, or need In the case of extended concurrency, how to ensure the correct management of the state and the correct execution of the task when the task is re-executed, the state management is particularly important. Therefore, let’s analyze the checkpoint next.

3、checkpoint

The checkpoint mechanism is the cornerstone of Flink reliability. It can ensure that when a certain operator fails due to some reasons (such as abnormal exit), the Flink cluster can restore the state of the entire application flow graph to a state before the failure to ensure the application Consistency of flow graph status. The principle of Flink's checkpoint mechanism comes from the "Chandy-Lamport algorithm" algorithm. (Distributed snapshot calculation)

3.1, checkpoint execution process in flink

When each application that needs checkpoint is started, Flink's JobManager creates a CheckpointCoordinator for it, and the CheckpointCoordinator is solely responsible for making a snapshot of this application.
Insert picture description here
1. The CheckpointCoordinator periodically sends a barrier to all source operators of the stream application.
2. When a source operator receives a barrier, it suspends the data processing process, then makes a snapshot of its current state, and saves it to the specified persistent storage, and finally reports its snapshot production to CheckpointCoordinator, and at the same time Broadcast the barrier to all its downstream operators to resume data processing
3. After the downstream operator receives the barrier, it will suspend its own data processing process, and then make a snapshot of its own related state and save it to the specified persistent storage , And finally report its own snapshot to CheckpointCoordinator, and broadcast the barrier to all its downstream operators to resume data processing.
4. Each operator continuously creates a snapshot according to step 3 and broadcasts it downstream, until the barrier is passed to the sink operator, and the snapshot creation is completed.
5. After CheckpointCoordinator receives the reports of all operators, it considers that the snapshot of the cycle is successfully made; otherwise, if the report of all operators is not received within the specified time, it is considered that the snapshot of this cycle has failed;

3.2. In the case of multiple parallelism and multiple Operators, the CheckPoint process

(1) JobManager sends CheckPointTrigger to Source Task, Source Task will insert CheckPoint barrier in the data stream;
Insert picture description here(2) Source Task takes a snapshot of itself and saves it to the back end of the state;
Insert picture description here
(3) Source Task sends the barrier along with the data stream downstream Send;
Insert picture description here
(4) When the downstream Operator instance receives the CheckPoint barrier, it will take a snapshot of itself. In the
Insert picture description here
Insert picture description here
above figure, there are 4 Operator instances with state, and the corresponding state backend can be imagined as filling in 4 grids. The entire CheckPoint process can be regarded as the process of the Operator instance filling in its own grid. The Operator instance writes its own state to the corresponding grid in the state backend. When all the grids are filled, it can simply be considered that a complete CheckPoint is finished.

The above is just a snapshot process, the entire CheckPoint execution process is as follows

1. The CheckPointCoordinator of the JobManager sends CheckPointTrigger to all SourceTasks, and the Source Task will insert the CheckPoint barrier in the data stream.
2. When the task receives all the barriers, it continues to pass the barrier to its own downstream, and then executes the snapshot by itself and changes its state Write asynchronously to persistent storage. Incremental CheckPoint only writes the latest part of the update to the external storage; in order to make CheckPoint downstream as soon as possible, it will first send the barrier to the downstream, and then synchronize the snapshot itself.
3. When the task completes the backup, the address of the backup data (state handle) Notify the CheckPointCoordinator of JobManager;
if the duration of CheckPoint exceeds the timeout set by CheckPoint and CheckPointCoordinator has not collected all the State Handle, CheckPointCoordinator will think that this CheckPoint failed and will take all the states generated by this CheckPoint All data is deleted.
4. Finally, the CheckPoint Coordinator will encapsulate the entire StateHandle into a completed CheckPoint Meta and write it to hdfs.

3.3, barrier alignment

Insert picture description here
What is barrier alignment?
(1) Once the Operator receives CheckPoint barrier n from the input stream, it cannot process any data records from the stream until it receives barrier n from all other inputs. Otherwise, it will mix the records belonging to snapshot n and the records belonging to snapshot n + 1;

(2) The stream that receives barrier n is temporarily shelved. The records received from these streams will not be processed, but put into the input buffer.

(3) In the second picture above, although the barrier corresponding to the digital stream has arrived, the data 1, 2, and 3 after the barrier can only be placed in the buffer, waiting for the barrier of the letter stream to arrive;

(4) Once all input streams have received barrier n, the operator will send the pending output data in the buffer, and then send CheckPoint barrier n downstream

Here will also take a snapshot of itself, after that, the Operator will continue to process the records from all input streams, and process the records from the input buffer before processing the records from the stream.

What is barrier misalignment?
In Figure 2 above, when there are other input stream barriers that have not yet arrived, the data 1, 2, and 3 after the arrived barrier will be placed in the buffer, waiting for the arrival of the barriers of other streams to be processed, and the barriers are not aligned. It means that when there are other streams of barriers that have not arrived, in order not to affect performance, you don't need to bother, and directly process the data after the barrier. After the barriers of all streams have arrived, you can checkpoint the Operator.

Why is barrier alignment necessary? Is it possible to misalign?
When Exactly Once, the barrier must be aligned. If the barrier is not aligned, it becomes At Least Once;

Barrier alignment is actually the process of aligning data with multiple upstream streams. The implication is that if the Operator instance has only one input stream, there is no barrier alignment at all, and you and yourself are always aligned by default.

The following are the questions that bloggers have encountered in development

(1) Kafka has only one partition, it makes no difference if it is accurate once, at least once?
Answer: If there is only one partition, the source task parallelism of the corresponding flink task can only be 1. There is indeed no difference, there will not be at least one existence, it must be exactly one time. Because only the barriers are not aligned, it is possible to repeat processing. Here, the parallelism is already 1, and the default is aligned. Only when there are multiple parallelisms upstream, the barriers that are sent to the downstream by multiple parallelisms need to be aligned. The degree of parallelism will not cause barrier misalignment, so it must be accurate once. In fact, it is still necessary to understand that barrier alignment means that Exactly Once will not be repeatedly consumed, and barrier misalignment means that At Least Once may be repeatedly consumed. Here there is only a single degree of parallelism and there will be no barrier misalignment, so there will be no at least once semantics;

(2) In order to make CheckPoint downstream as soon as possible, the barrier will be sent to the downstream first, and then the snapshot will be synchronized by itself. In this step, what if the synchronization snapshot is slow after the barrier is sent down? The downstream has been synchronized. Haven't you?
Answer: It may happen that the downstream snapshot is earlier than the upstream snapshot, but this does not affect the snapshot result, but the downstream snapshot is more timely. I only need to ensure that the downstream process all the data before the barrier and does not process the data after the barrier. Then take a snapshot, then the downstream also supports accurate one time. Don't think about this problem from the overall perspective. If you think about the upstream and downstream examples separately, you will find that the status of the upstream and downstream is accurate, and there is neither loss nor double counting. One thing to note here, if there is an Operator's CheckPoint that fails or the CheckPoint timeout will also cause it to fail, then the JobManager will consider the entire CheckPoint to fail. The failed CheckPoint cannot be used to restore the task. All checkpoints of the operators must be successful, then the CheckPoint can be considered as successful this time and can be used to restore the task;

(3) The CheckPoint semantics of Flink in my program is set Exactly Once, but the data is duplicated in my mysql? The program sets CheckPoint once every 1 minute, but writes data to mysql once in 5 seconds and commits;
A: Flink requires that the end to end must be implemented for each exact time. TwoPhaseCommitSinkFunction. If your chk-100 is successful, after 30 seconds, because it commits once every 5 seconds, you have actually written 6 batches of data into mysql, but suddenly the program hangs and resumes from chk100. In this case, the previously submitted Six batches of data will be written repeatedly, so repeated consumption occurs. There are two cases for the accuracy of Flink. One is the accuracy inside Flink and the other is the end-to-end accuracy. This blog describes the accuracy of Flink internally. I will post another blog to introduce Flink in detail later. How to achieve end-to-end precision once.

Guess you like

Origin blog.csdn.net/qq_44962429/article/details/107689051