Flink stream-batch integrated calculation (17): StreamExecutionEnvironment of PyFlink DataStream API

Table of contents

StreamExecutionEnvironment

Watermark

Introduction to watermark strategy

Use the Watermark strategy

Built-in watermark generator

Handle idle data sources

How operators handle Watermark

The way to create DataStream

Created from list object

Created using DataStream connectors

Created using Table & SQL connectors

StreamExecutionEnvironment

To write a Flink Python DataStream API program, you first need to declare an execution environment StreamExecutionEnvironment, which is the context for the execution of the streaming program.

You will use it to set the properties of the job (such as default concurrency, restart strategy, etc.), create the source, and finally trigger the execution of the job.

env = StreamExecutionEnvironment.get_execution_environment()
env.set_runtime_mode(RuntimeExecutionMode.BATCH)
env.set_parallelism(1)

After creating a StreamExecutionEnvironment, you can use it to declare data sources. Data sources pull data from external systems (such as Apache Kafka, Rabbit MQ, or Apache Pulsar) into Flink jobs.

For simplicity, this tutorial reads a file as the data source.

ds = env.from_source(
    source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
                                               input_path)
                     .process_static_file_set().build(),
    watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
    source_name="file_source"
)

Watermark

In most cases, the data flowing to the operator is in the order in which events are generated. However, it is not ruled out that due to network, distribution and other reasons, out-of-order may occur. The so-called out-of-order refers to the events received by Flink . The order of the events is not strictly in accordance with the Event Time order of the events .

In order to solve the problem of out-of-order data, flink introduces watermark . Introducing the watermark mechanism will wait for late data for a period of time, and trigger calculations when the waiting time is up. If the data delay is large, it will usually be discarded or processed separately.

In order to use event time semantics, a Flink application needs to know the field corresponding to the event timestamp, which means that every element in the data stream needs to have an assignable event timestamp. It is usually used to access / extract the timestamp from a field in the element using the TimestampAssigner API .

Introduction to watermark strategy

The allocation of timestamps goes hand in hand with the generation of watermarks , which can tell the Flink application the progress of event time. It can configure the watermark generation method by specifying WatermarkGenerator .

When using the Flink API , you need to set up a WatermarkStrategy that contains both TimestampAssigner and WatermarkGenerator . The WatermarkStrategy tool class also provides many commonly used watermark strategies, and users can also build their own watermark strategies in certain necessary scenarios .

Use the Watermark strategy

WatermarkStrategy can be used in two places in a Flink application, the first is directly on the data source, and the second is directly after operations on non-data sources.

The first method is better because the data source can use watermark to generate information about shards / partitions /splits in the logic . Using this approach, data sources can often track watermarks more accurately , and overall watermark generation will be more accurate. Specifying the WatermarkStrategy directly on the source means that you must use a specific data source interface.

The second way (setting the WatermarkStrategy after any transformation operation ) should only be used if the strategy cannot be set directly on the data source.

Built-in watermark generator

Watermark policies define how watermarks are generated in stream sources. WatermarkStrategy is the generator / factory for the WatermarkGenerator that generates watermarks and the TimestampAssigner that assigns record internal timestamps .

BoundedOutOfOrderness ( Duration ) is a common built-in strategy for creating WatermarkStrategy .

for_bound_out_of_ordernness(max_out_of_orderness : pyflink.common.time.Duration) creates a watermark policy for recording out-of-order situations, but can set an upper limit on the degree of out-of-order events.

Out-of-order binding B means that once an event with timestamp T is encountered , no events older than ( TB ) will occur again.

for_bound_out_of_ordernness(5)

for_mononous_timestamps() creates a watermarking strategy for timestamps that increase monotonically.

Watermarks are generated periodically and strictly follow the latest timestamp in the data. The delay introduced by this strategy is mainly the periodic interval of generating watermarks.

WatermarkStrategy.for_monotonous_timestamps()

with_timestamp_assigner(timestamp_assigner:pyflink.common.watermark_strategy.TimestampAssigner)

Creates a new WatermarkStrategy that uses the given TimestampAssigner by implementing the TimestampAssigner interface.

Parameters: timestamp_assigner The given TimestampAssigner.

Return: WaterMarkStrategy wrapping TimestampAssigner.

watermark_strategy = WatermarkStrategy.for_monotonous_timestamps()

with_timestamp_assigner(MyTimestampAssigner())

Handle idle data sources

If a partition / shard in the data source does not send event data for a period of time, it means that the WatermarkGenerator will not get any new data to generate watermarks . We call such data sources idle inputs or idle sources. In this case, problems arise when some other partitions are still sending event data. Since the downstream operator watermark is calculated by taking the minimum value of all different upstream parallel data source watermarks , its watermark will not change.

To solve this problem, you can use WatermarkStrategy to detect idle inputs and mark them as idle. WatermarkStrategy provides a tool interface withIdleness(Duration.ofMinutes(1)) for this purpose

with_idleness(idle_timeout:pyfrink.common.time.Duration)

Create a new rich WatermarkStrategy that also performs idle detection in the created WatermarkGenerator .

Parameters: idle_timeout – idle timeout.

Return : New watermark policy configured with idle detection.

How operators handle Watermark

Under normal circumstances, before forwarding the watermark to the downstream, the operator needs to completely process the event that triggers it. For example, WindowOperator will first calculate all window data triggered by the watermark , and will be sent to the downstream if and only if all data generated by the calculation triggered by the watermark are forwarded to the downstream. In other words, all data elements resulting from the occurrence of this watermark will be emitted before this watermark .

The same rules apply to TwoInputstreamOperator . However, in this case, the operator's current watermark will take the minimum value of its two inputs.

The way to create DataStream

Created from list object

from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment


env = StreamExecutionEnvironment.get_execution_environment()
ds = env.from_collection(
    collection=[(1, 'aaa|bb'), (2, 'bb|a'), (3, 'aaa|a')],
    type_info=Types.ROW([Types.INT(), Types.STRING()]))

Created using

Use the add_source function. This function only supports FlinkKafkaConsumer and is only used in streaming execution mode.

from pyflink.common.serialization import JsonRowDeserializationSchema
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import FlinkKafkaConsumer


env = StreamExecutionEnvironment.get_execution_environment()
# the sql connector for kafka is used here as it's a fat jar and could avoid dependency issues
env.add_jars("file:///path/to/flink-sql-connector-kafka.jar")
deserialization_schema = JsonRowDeserializationSchema.builder() \
    .type_info(type_info=Types.ROW([Types.INT(), Types.STRING()])).build()

kafka_consumer = FlinkKafkaConsumer(
    topics='test_source_topic',
    deserialization_schema=deserialization_schema,
    properties={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds = env.add_source(kafka_consumer)

Use the from_source function. This function only supports NumberSequenceSource and FileSource custom data sources and is only used in streaming execution mode.

from pyflink.common.typeinfo import Types
from pyflink.common.watermark_strategy import WatermarkStrategy
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.connectors import NumberSequenceSource

env = StreamExecutionEnvironment.get_execution_environment()
seq_num_source = NumberSequenceSource(1, 1000)
ds = env.from_source(
    source=seq_num_source,
    watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
    source_name='seq_num_source',
    type_info=Types.LONG())

Created using

First create the table using Table & SQL connectors and then convert it to DataStream.

from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(stream_execution_environment=env)

t_env.execute_sql("""
        CREATE TABLE my_source (
          a INT,
          b VARCHAR
        ) WITH (
          'connector' = 'datagen',
          'number-of-rows' = '10'
        )
    """)

ds = t_env.to_append_stream(

    t_env.from_path('my_source'),

    Types.ROW([Types.INT(), Types.STRING()]))