Flink Stream Batch Integrated Computing (19): State of PyFlink DataStream API

Table of contents

keyed state

Keyed DataStream

UseKeyedState

Implemented a simple counting window

Status validity period (TTL)

Cleaning up expired data

Clean up during full snapshot

Incremental data cleaning

Clean up during RocksDB compression

Operator StateOperator State

Broadcast State broadcast state


keyed state

Keyed DataStream

To use keyed state, you first need to specify the key (primary key) for the DataStream. This primary key is used for state partitioning (and also for partitioning the records themselves in the data stream).

Use keyBy(KeySelector) of the Java/Scala API in DataStream or key_by(KeySelector) of the Python API to specify the key. It will generate a KeyedStream, which in turn allows the use of keyed state operations.

The Keyselector function receives a single record as input and returns the key of this record. The key can be of any type, but its calculation method must be deterministic.

Flink 's data model is not based on key-value pairs, so it is not necessary to physically encapsulate the data set into key and value. Key is "virtual". They are defined as functions based on real data to manipulate grouping operators.

UseKeyedState

The keyed state interface provides access interfaces for different types of states, which all act on the key of the current input data.

In other words, these states are only available on KeyedStream, which can be obtained through stream.keyBy(...) on the Java/Scala API and stream.key_by(...) on the Python API.

All supported status types are as follows:

ValueState<T>: Saves a value that can be updated and retrieved

Liststate<T>: Holds a list of one element. Data can be appended to this list and retrieved from the current list.

ReducingState<T>: Holds a single value representing the aggregation of all values ​​added to the state. But using add(T) to add elements will use the provided ReduceFunction for aggregation.

AggregatingState<IN, OUT>: Holds a single value representing the aggregation of all values ​​added to the state. Elements added using add(IN) are aggregated using the specified AggregateFunction.

MapState<UK, UV>: maintains a mapping list. You can add key-value pairs to the state and get an iterator reflecting all current mappings.

All types of states also have a clear() method, which clears the state data under the current key, which is the key of the current input element.

Implemented a simple counting window

We use the first element of the tuple as the key. The function stores the number of occurrences and the sum in "ValueState".

Once the number of occurrences reaches 2, the average value is sent downstream, and the state is cleared to start over. Note that we store a separate value for each distinct key (the first element in the tuple).

You must create a StateDescriptor to get the corresponding state handle. This holds the state name, the type of value the state holds, and may contain user-specified functions such as ReduceFunction.

Depending on the state type, you can create ValueStateDescriptor, ListstateDescriptor, AggregatingStateDescriptor, ReducingStateDescriptor or MapStateDescriptor.

State is accessed through the RuntimeContext and therefore can only be used within rich functions.

from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment, FlatMapFunction, RuntimeContext
from pyflink.datastream.state import ValueStateDescriptor

class CountWindowAverage(FlatMapFunction):
    def __init__(self):
        self.sum = None

    def open(self, runtime_context: RuntimeContext):
        descriptor = ValueStateDescriptor(
            "average",  # the state name
            Types.PICKLED_BYTE_ARRAY()  # type information
        )
        self.sum = runtime_context.get_state(descriptor)

    def flat_map(self, value):
        # access the state value
        current_sum = self.sum.value()
        if current_sum is None:
            current_sum = (0, 0)
        # update the count
        current_sum = (current_sum[0] + 1, current_sum[1] + value[1])
        # update the state
        self.sum.update(current_sum)
        # if the count reaches 2, emit the average and clear the state
        if current_sum[0] >= 2:
            self.sum.clear()
            yield value[0], int(current_sum[1] / current_sum[0])


env = StreamExecutionEnvironment.get_execution_environment()
env.from_collection([(1, 3), (1, 5), (1, 7), (1, 4), (1, 2)]) \
    .key_by(lambda row: row[0]) \
    .flat_map(CountWindowAverage()) \
    .print()

env.execute()
# the printed output will be (1,4) and (1,5)

Status validity period (TTL)

Any type of keyed state can have a validity period (TTL). If TTL is configured and the status value has expired, the corresponding value will be cleared as much as possible. All status types support single-element TTL. This means that list elements and map elements will expire independently.

Before using state TTL, you need to build a StateTtlConfig configuration object. Then pass the configuration to the state descriptor to enable the TTL function.

from pyflink.common.time import Time
from pyflink.common.typeinfo import Types
from pyflink.datastream.state import ValueStateDescriptor, StateTtlConfig

ttl_config = StateTtlConfig \
  .new_builder(Time.seconds(1)) \
  .set_update_type(StateTtlConfig.UpdateType.OnCreateAndWrite) \
  .set_state_visibility(StateTtlConfig.StateVisibility.NeverReturnExpired) \
  .build()

state_descriptor = ValueStateDescriptor("text state", Types.STRING())
state_descriptor.enable_time_to_live(ttl_config)

TTL configuration has the following options:

The first parameter of newBuilder represents the validity period of the data and is required.

TTL update strategy (default is OnCreateAndWrite):

StateTtlConfig.UpdateType.OnCreateAndWrite - only updates on create and write

StateTtlConfig.UpdateType.OnReadAndWrite - also updates when reading

The visibility of data when it expires but has not yet been cleared is configured as follows (default is NeverReturnExpired):

    StateTtlConfig.StateVisibility.NeverReturnExpired - does not return expired data

    ( Note: In a PyFlink job, both the read and write caches of the state will be invalidated, which will result in some performance loss)

    In the case of NeverReturnExpired , the expired data is as if it does not exist, regardless of whether it is physically deleted. This is useful in scenarios where expired data cannot be accessed, such as sensitive data.

StateTtlConfig.StateVisibility.ReturnExpiredIfNotCleanedUp - will return expired but not cleaned data

    ( Note: In PyFlink jobs, the read cache of the state will be invalidated, which will cause some performance loss)

    ReturnExpiredIfNotCleanedUp will return before the data is physically deleted.

Cleaning up expired data

By default, expired data will be deleted when reading, such as ValueState#value, and a background thread will clean it regularly (if StateBackend supports it). Background cleanup can be turned off through StateTtlConfig configuration.

from pyflink.common.time import Time
from pyflink.datastream.state import StateTtlConfig

ttl_config = StateTtlConfig \
  .new_builder(Time.seconds(1)) \
  .disable_cleanup_in_background() \
  .build()

A finer-grained background cleanup policy can be configured. In the current implementation, HeapStateBackend relies on incremental data cleaning, and RocksDBStateBackend utilizes compression filters for background cleaning.

Clean up during full snapshot

You can enable a policy that cleans up when taking a full snapshot, which can reduce the size of the entire snapshot. The current implementation does not clean up the local state, but when restoring from the last snapshot, expired data that has been deleted will not be restored.

This strategy can be configured through the StateTtlConfig configuration. This strategy is invalid in the incremental checkpoint mode of RocksDBStateBackend.

from pyflink.common.time import Time
from pyflink.datastream.state import StateTtlConfig

ttl_config = StateTtlConfig \
  .new_builder(Time.seconds(1)) \
  .cleanup_full_snapshot() \
  .build()

This cleanup method can be enabled or disabled at any time through StateTtlConfig, such as when restoring from a savepoint.

Incremental data cleaning

Currently only the Heap state backend supports the incremental clearing mechanism.

Incremental cleaning of state data, performed during state access or/and processing. If there is no state access and no data processing, expired data will not be cleaned. Incremental cleaning increases data processing time.

If a state has this cleanup policy turned on, a lazy global iterator of all states will be maintained in the storage backend.

Each time incremental cleanup is triggered, expired numbers are selected from the iterator for cleanup.

from pyflink.common.time import Time
from pyflink.datastream.state import StateTtlConfig

ttl_config = StateTtlConfig \
  .new_builder(Time.seconds(1)) \
  .cleanup_incrementally(10, True) \
  .build()

This strategy has two parameters. The first is the number of entries in the state that are checked on each cleanup, which is triggered on every state access.

The second parameter indicates whether to trigger cleanup when each record is processed. The Heap backend checks 5 statuses by default and turns off triggering cleanup on each record.

Clean up during RocksDB compression

If you use RocksDB state backend, Flink's customized compression filter for RocksDB will be enabled. RocksDB periodically merges and compresses data to reduce storage space. The RocksDB compression filter provided by Flink will filter out expired state data during compression.

from pyflink.common.time import Time
from pyflink.datastream.state import StateTtlConfig

ttl_config = StateTtlConfig \
  .new_builder(Time.seconds(1)) \
  .cleanup_in_rocksdb_compact_filter(1000) \
  .build()

After Flink processes a certain number of state data, it will use the current timestamp to detect whether the state in RocksDB has expired. You can specify the number of states to be processed through the StateTtlConfig.newBuilder(...).cleanupInRocksdbCompactFilter(long queryTimeAfterNumEntries) method.

The more frequently the timestamp is updated, the more timely the status will be cleaned up. However, since compression will have the overhead of calling JNI, it will affect the overall compression performance.

The default background cleanup policy of RocksDB backend will be performed every 1,000 pieces of data processed.

Notice:

    Invoking the TTL filter during compression can slow things down. The TTL filter needs to parse the timestamp of the last access and perform an expiration check on each state that will participate in compression. For collection state types (such as lists and maps), each element in the collection is checked.

    For list status where the length of elements is not fixed after serialization, the TTL filter needs to additionally call Flink's java serializer during each JNI call to determine the position of the next unexpired data.

    For existing jobs, this cleanup method can be enabled or disabled through StateTtlConfig at any time, such as after restarting from a savepoint.

Operator StateOperator State

Python DataStream API still cannot support operator status

Operator state (or unkeyed state) is the state bound to a parallel operator instance. Kafka Connector is an inspiring example of using operator state in Flink.

Each parallel instance of Kafka consumer maintains a map of topic partitions and offsets as its operator state.

When the degree of parallelism changes, the operator state supports redistribution of the state to each parallel operator instance. There are several different schemes for handling the redistribution process.

In a typical stateful Flink application you don't need to use operator state. It is mostly used as a special type of state. Used to implement source/sink, and in scenarios where the state cannot be partitioned without a primary key.

Broadcast State broadcast state

Python DataStream API still can't support broadcast state

The broadcast state is a special operator state. It was introduced to support use cases where elements in a stream need to be broadcast to all downstream tasks. Broadcast state is used in these tasks to keep all subtasks in the same state.

This state is then accessible in the second data flow that processes the record. One can imagine a series of low-throughput streams containing rules for processing elements in other streams, and this example makes use of broadcast state naturally.

Considering use cases like the one above, broadcast state differs from other operator states in that:

It has map format,

It is only available in some special operators. The input of these operators is a broadcast data stream and a non-broadcast data stream,

Such operators can have multiple broadcast states with different names.

Guess you like

Origin blog.csdn.net/victory0508/article/details/132538882