Flink Stream Batch Integrated Computing (11): TableEnvironment of PyFlink Tabel API

Table of contents

overview

Set restart policy

What is flink's restart strategy (Restart strategy)

Flink's restart strategy (Restart strategy) actual combat

Four restart strategies of flink

FixedDelayRestartstrategy (fixed delay restart strategy)

FailureRateRestartstrategy (failure rate restart strategy)

NoRestartstrategy (no restart strategy)

Configure State Backends and Checkpointing

Checkpoint

Enable and configure

Select State backend

MemoryStateBackend

FsStateBackend

RocksDBStateBackend

State backend comparison

overview

The first step in writing a Flink Python Table API program is to create a TableEnvironment. This is the entry class for Python Table API jobs.

get_config() returns the table config, and the runtime behavior of the Table API can be defined through the table config.

t_env = TableEnvironment.create(Environmentsettings.in_streaming_mode())
t_env.get_config().get_configuration().set_string("parallelism.default", "1")

Set restart policy

In TableConfig, configure them by setting key-value options.

What is flink's restart strategy (Restart strategy)

Restartstrategy, restart strategy, when encountering unpredictable problems such as machines or codes that cause the Job or Task to hang up, it will pull up the Job or the affected Task according to the configured restart strategy to re-execute, so that the job can be restored to Previously executed normally. The restart strategy in Flink determines whether to restart the Job or Task, as well as the number of restarts and the interval between each restart.

Flink's restart strategy (Restart strategy) actual combat

The function of flink's Restart strategy is to improve the robustness and fault tolerance of the task, and ensure that the task can produce data in real time.

Setting the restart strategy has a lot to do with the company's data processing business needs. Set different strategies for processing tasks according to different business needs.

In fact, it is more common to encounter the above problems. For example, sometimes because of data problems (non-standard, null, etc.), you may encounter various abnormal errors when processing these dirty data, such as null Pointer, array out of bounds, data type conversion error, etc.

Maybe you will say that you just need to filter out this kind of dirty data, or if you catch exceptions, it will not cause the problem of continuous restart of the Job.

Therefore, in daily development, we must try our best to ensure the robustness of the code, but we must also configure the Restart strategy of the Flink Job.

Four restart strategies of flink

The default restart strategy is specified through Flink's flink-conf.yaml. This configuration parameter restart-strategy defines which strategy will be used.

If the checkpoint is not started, the no restart strategy will be used. If the checkpoint mechanism is started but the restart strategy is not specified, the fixed-delay strategy will be used to retry Integer.MAX_VALUE times.

The configuration parameter restart-strategy defines which strategy is used.

Fixed interval (Fixed delay)

Failure rate

No restart

FixedDelayRestartstrategy (fixed delay restart strategy)

FixedDelayRestartstrategy is a fixed delay restart strategy. The program tries to restart the job according to the number of restarts set in the cluster configuration file or in the program. If the number of attempts exceeds the given maximum number and the program has not started, the job will be stopped. In addition, continuous The time to wait between reboots.

It can be configured as follows in flink-conf.yaml:

restart-strategy: fixed-delay

#Indicates the maximum number of job restarts, Integer.MAX_VALUE if checkpoint is enabled , otherwise 1 .

restart-strategy.fixed-delay.attempts: 3

#If the setting minutes can be similar to 1 min , this parameter indicates the time interval between two restarts. It may be helpful to delay the restart when the program interacts with the external system. If checkpoint is enabled, the delay restart time is 10 seconds, otherwise Use the value of akka.ask.timeout .

restart-strategy.fixed-delay.delay: 10 s

The python program sets the restart policy to "fixed-delay":

table_env.get_config().get_configuration().set_string("restart-strategy", "fixed-delay")
table_env.get_config().get_configuration().set_string("restart-strategy.fixed-delay.attempts", "3")
table_env.get_config().get_configuration().set_string("restart-strategy.fixed-delay.delay", "30s")

FailureRateRestartstrategy (failure rate restart strategy)

FailureRateRestartstrategy is the failure rate restart strategy. Restart the job after a failure occurs. If the number of failures within a fixed time interval exceeds the set value, the job will fail and stop. This restart strategy also supports setting the wait between two consecutive restarts time.

It can be configured as follows in flink-conf.yaml:

restart-strategy: failure-rate

restart-strategy.failure-rate.max-failures-per-interval:

#The maximum number of restarts allowed in a fixed time interval, default 1

restart-strategy.failure-rate.failure-rate-interval: 5 min 

#Fixed time interval, default 1 minute

restart-strategy.failure-rate.delay: 10 s

#Delay time between two consecutive restart attempts, the default is akka.ask.timeout

The python program sets the restart policy to "fixed-delay":

table_env.get_config().get_configuration().set_string("restart-strategy", "failure-rate")
table_env.get_config().get_configuration().set_string("restart-strategy.failure-rate.delay", "1s")
table_env.get_config().get_configuration().set_string("restart-strategy.failure-rate.failure-rate-interval", "1 min")
table_env.get_config().get_configuration().set_string("restart-strategy.failure-rate.max-failures-per-interval", "1")

NoRestartstrategy (no restart strategy)

NoRestartstrategy The job does not restart the strategy, and directly fails to stop. The configuration in flink-conf.yaml is as follows:

restart-strategy: none

Configure State Backends and Checkpointing

Checkpoint

In order to make State fault-tolerant, Flink needs State checkpoint (state checkpoint).

Checkpoints allow Flink to restore the state and processing position of streams, giving programs the same semantics as faultless execution.

Prerequisites used by Checkpoint:

A persistent data source that can replay records within a certain time range.

For example, persistent message queues: Apache Kafka, RabbitMQ, Amazon Kinesis, Google PubSub or file systems: HDFS, S3, GFS, NFS, Ceph...

State persistent storage system, usually a distributed file system: HDFS, S3, GFS, NFS, Ceph...

Enable and configure

Checkpoint is not enabled by default. The StreamExecutionEnvironment object calls enableCheckpointing(n) to enable Checkpointing, where n is the Checkpointing interval in milliseconds.

Checkpoint configuration items include:

#Set the checkpoint mode to EXACTLY_ONCE

#Checkpoint supports these two modes : exactly once exactly-once or at least once at-least-once

# For most applications, EXACTLY_ONCE is preferred. at-least-once may be used in some applications that require ultra-low latency (a few milliseconds).

table_env.get_config().get_configuration().set_string("execution.checkpointing.mode", "EXACTLY_ONCE")

#Checkpoint minimum interval time , set to 5000 , means that the next checkpoint will start at least 5 seconds after the previous checkpoint is completed

table_env.get_config().get_configuration().set_string("execution.checkpointing.interval", "3min")

#Checkpoint timeout time: If the checkpoint is not completed within the timeout period , the ongoing checkpoint will be aborted .

table_env.get_config().get_configuration().set_string("execution.checkpointing.timeout", "10min")

#Checkpoint concurrency: The maximum number of checkpoints that can be run at the same time. When the maximum value is reached, one of them must be completed before a new one can be started.

table_env.get_config().get_configuration().set_string("execution.checkpointing.max-concurrent-checkpoints", "2")

Select State backend

The storage location of Checkpoint depends on the configured State backend (JobManager memory, file system, database...).

By default, State is stored in TaskManager memory and Checkpoint is stored in JobManager memory.

Flink comes with the following out-of-the-box state backends:

MemoryStateBackend

FsStateBackend

RocksDBStateBackend

In the absence of configuration, the system uses MemoryStateBackend by default

MemoryStateBackend

Use MemoryStateBackend to take a snapshot of State in checkpoint, and send the snapshot data to JobManager to confirm the completion of checkpoint, and then the snapshot will be stored in the memory heap of JobManager.

FsStateBackend

FsStateBackend needs to be configured with a file system URL, such as "hdfs://namenode:40010/flink/checkpoint" or "file:///data/flink/checkpoints".

FsStateBackend holds the data being processed in the TaskManager's memory.

During Checkpoint, the state snapshot is written to a file in the file system directory, and the path of the file is passed to the JobManager and stored in its memory.

FsStateBackend operates asynchronously by default to avoid blocking handlers while writing state snapshots. If you want to disable async, you can set in the FsStateBackend constructor

RocksDBStateBackend

RocksDBStateBackend needs to be configured with a file system URL, such as "hdfs://namenode:40010/flink/checkpoint" or "file:///data/flink/checkpoints".

RocksDBStateBackend holds the data being processed in RocksDB, and RocksDB is under the data directory of TaskManager.

During Checkpoint, the entire RocksDB is written to a file in the file system directory, and the path of the file will be passed to the JobManager and stored in its memory.

RocksDBStateBackend is also generally asynchronous. Currently the only supported incremental checkpoint.

The amount of state that can be saved with RocksDB is limited only by the amount of disk space available. This also means that the maximum throughput that can be achieved is lower, and all reads/writes in the background must be serialized and deserialized to retrieve/store State, which is also more expensive than using heap-based memory.

State backend comparison

StateBackend

in-flight

checkpoint

throughput

Recommended usage scenarios

MemoryStateBackend 

TM Memory

JM Memory

high

Debug, stateless, or have no requirements for data loss or duplication

FsStateBackend     

TM Memory

FS/HDFS

high

Normal state, window, KV structure

RocksDBStateBackend

RocksDB on TM

FS/HDFS

Low

Super large state, super long window, large KV structure

#Set the statebackend type to "rocksdb" , other options include "filesystem" and " jobmanager"

#You can also set this property to the full class name of StateBackendFactory

# e.g. org.apache.flink.contrib.streaming.state.RocksDBStateBackendFactory

table_env.get_config().get_configuration().set_string("state.backend", "rocksdb")

#Set the checkpoint directory required by RocksDB statebackend

table_env.get_config().get_configuration().set_string("state.checkpoints.dir", "file:///tmp/checkpoints/")

Guess you like

Origin blog.csdn.net/victory0508/article/details/131554234