Flink stream-batch integrated calculation (18): PyFlink DataStream API calculation and Sink

Table of contents

1. Perform conversion operations on the data stream in the previous section, or use sink to write data to an external system.

2. File Sink

File Sink

Format Types 

Row-encoded Formats 

Bulk-encoded Formats 

bucket allocation

rolling strategy

3. How to output the results

Collect data to the client. The execute_and_collect method will collect data into the client memory.

Send the results to the DataStream sink connector

Send results to Table & SQL sink connector

4. Execute the PyFlink DataStream API job.


1. Perform conversion operations on the data stream in the previous section, or use sink to write data to an external system.

This tutorial uses FileSink to write the resulting data to a file.

def split(line):
    yield from line.split()

# compute word count
ds = ds.flat_map(split) \
    .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \
    .key_by(lambda i: i[0]) \
    .reduce(lambda i, j: (i[0], i[1] + j[1]))

ds.sink_to(
    sink=FileSink.for_row_format(
        base_path=output_path,
        encoder=Encoder.simple_string_encoder())
    .with_output_file_config(
        OutputFileConfig.builder()
        .with_part_prefix("prefix")
        .with_part_suffix(".ext")
        .build())
    .with_rolling_policy(RollingPolicy.default_rolling_policy())
    .build()
)

The sink_to function sends DataStream data to a custom sink connector, only supports FileSink, and can be used in batch and streaming execution modes.

2. File Sink

Streaming File Sink is a new feature introduced in Flink 1.7 to solve the following problems:

In big data business scenarios, there is often a scenario: external data is sent to Kafka, and Flink acts as a middleware to consume Kafka data and perform business processing; after the processing is completed, the data may need to be written to the database or file system, such as writing hdfs.

Streaming File Sink can be used to write partition files to a file system that supports the Flink FileSystem interface and supports Exactly-Once semantics. The Exactly-Once implemented by this sink is guaranteed based on the two-stage submission mode implemented by Flink checkpoint. It is mainly used in scenarios such as real-time data warehouse, topic splitting, and hour-based analysis and processing.

Streaming File Sink is a connector added after community optimization and is recommended for use.

Streaming File Sink is more flexible and powerful, and can implement the serialization method by itself

Streaming File Sink has two methods to output to a file: row encoding format forRowFormat and block encoding format forBulkFormat.

forRowFormat is relatively simple. It only provides SimpleStringEncoder to write text files, and the encoding can be specified.

Since streaming data itself is unbounded, streaming data writes data into buckets. By default, the bucketing strategy based on the system time (yyyy-MM-dd--HH) is used. In bucketing, the output is split into part files according to the rolling strategy.

Flink provides two bucketing strategies . The bucketing strategy implements

org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner 接口:

BasePathBucketAssigner, regardless of bucket, all files are written to the root directory;

DateTimeBucketAssigner, bucketing based on system time (yyyy-MM-dd--HH).

In addition, you can also implement the BucketAssigner interface and customize the bucketing strategy.

Flink provides two rolling strategies , and the rolling strategy implements

org.apache.flink.streaming.api.functions.sink.filesystem.RollingPolicy 接口:

DefaultRollingPolicy Rolls the file when the maximum bucket size is exceeded (default is 128 MB), or the rolling period is exceeded (default is 60 seconds), or no data is written and the inactivity timeout occurs (default is 60 seconds);

OnCheckpointRollingPolicy Rolls the file when checkpointing.

File Sink

File Sink writes incoming data into the bucket. Considering that the input stream can be unbounded, the data in each bucket is organized into Part files of limited size. It can be configured to write data into the bucket based on time. For example, you can set the data to be written into a new bucket every hour. This means that the bucket will contain records received within an hourly interval.

The data in the bucket directory is split into multiple Part files. For each Subtask of the Sink of the corresponding bucket that receives data , each bucket will contain at least one Part file. Additional Part files will be created based on the configured rolling strategy . For the default policy of rolling based on the size of the Part file, you need to specify the timeout for the longest time the file is open and the timeout for the inactive state after the file is closed. For scrolling every time a Checkpoint is created, the user can also add other conditions based on size or time.  Row-encoded Formats  Bulk-encoded Formats 

Important : Checkpoint function needs to be turned on when using it in  STREAMING mode  . The file is only generated when Checkpoint is successful. If the Checkpoint function is not turned on, the file will always stay in the or state, and the downstream system will not be able to safely read the file data.FileSink   in-progress  pending 

Format Types 

FileSink Not only supports Row-encoded but also Bulk-encoded , such as  Apache Parquet . These two formats can be constructed through the following static methods:

  • Row-encoded sink: FileSink.forRowFormat(basePath, rowEncoder)
  • Bulk-encoded sink: FileSink.forBulkFormat(basePath, bulkWriterFactory)

When creating a Sink of Row-encoded Format or Bulk-encoded Format , you must specify the path of the bucket and the logic for encoding the data.

Row-encoded Formats 

Row-encoded Format needs to be specified  Encoder, which is used to serialize a single row of data during the output data to the file  Outputstream.

In addition to the bucket assigner , RowFormatBuilder also allows users to specify the following properties:

  • Custom RollingPolicy: Custom rolling policy overrides DefaultRollingPolicy
  • bucketCheckInterval (default = 1 min): check interval based on rolling policy settings
data_stream = ...
sink = FileSink \
    .for_row_format(OUTPUT_PATH, Encoder.simple_string_encoder("UTF-8")) \
    .with_rolling_policy(RollingPolicy.default_rolling_policy(
        part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 * 60 * 1000)) \
    .build()
data_stream.sink_to(sink)

This example creates a simple Sink that assigns records to hourly buckets by default. The example also specifies a rolling strategy, which will roll the In-progress status file when any of the following three conditions is met :

  • Contains at least 15 minutes of data
  • Never receive new records with a delay of more than 5 minutes
  • File size has reached 1GB (after writing last record)

Bulk-encoded Formats 

The creation of Bulk-encoded Sink is similar to Row-encoded , but does not need to be specified ; Defines the logic for how new data is added and refreshed, and how to ultimately determine which encoding character set to use for a batch of records. Encoder BulkWriter.Factory BulkWriter 

Flink has five built - in BulkWriter factory classes:

  • ParquetWriterFactory
  • AvroWriterFactory
  • SequenceFileWriterFactory
  • CompressWriterFactory
  • OrcBulkWriterFactory

Important  Bulk-encoded Format supports only one scrolling strategy from inherited  CheckpointRollingPolicy classes. It will scroll at every Checkpoint . It is also possible to scroll based on size or processing time.

bucket allocation

The bucket logic defines how data is distributed into subdirectories within the base output directory.

Row-encoded Format and Bulk-encoded Format are used  DateTimeBucketAssigner as the default allocator. The default allocator  creates hourly buckets based on the system's default time zone DateTimeBucketAssigner using the format .  yyyy-MM-dd--HH Both date format (  i.e.  bucket size) and time zone can be configured manually.

 .withBucketAssigner(assigner) You can also specify custom ones  by calling methods in the formatting constructor BucketAssigner.

Flink has two built-in BucketAssigners :

  • DateTimeBucketAssigner :Default time-based allocator
  • BasePathBucketAssigner : Allocate all files to be stored on the base path (single global bucket)

PyFlink only supports  DateTimeBucketAssigner and  BasePathBucketAssigner .

rolling strategy

RollingPolicy Defines when to close a given In-progress Part file, transition it to Pending state, and then transition to Finished state. Files in the Finished state can be viewed and the validity of the data can be guaranteed, and will not be restored in the event of a failure. In  STREAMING mode, the rolling strategy combined with the Checkpoint interval ( the Pending status of the file is converted to the Finished status until the next Checkpoint is successful ) jointly controls whether Part files are visible to downstream readers and the size and number of these files. In mode, the Part file becomes visible to the downstream at the end of the Job , and the rolling strategy only controls the maximum Part file size. BATCH 

Flink has two built-in RollingPolicies :

  • DefaultRollingPolicy
  • OnCheckpointRollingPolicy

PyFlink only supports  DefaultRollingPolicy and  OnCheckpointRollingPolicy .

3. How to output the results

Print

ds.print()

Collect results to client

Collect data to the client. The execute_and_collect method will collect data into the client memory.

with ds.execute_and_collect() as results:

    for result in results:

        print(result)

Send the results to the DataStream sink connector

The add_sink function sends DataStream data to the sink connector. This function only supports FlinkKafkaProducer, JdbcSink and StreamingFileSink, and is only used in streaming execution mode.

from pyflink.common.typeinfo import Types
from pyflink.datastream.connectors import FlinkKafkaProducer
from pyflink.common.serialization import JsonRowSerializationSchema

serialization_schema = JsonRowSerializationSchema.builder().with_type_info(
    type_info=Types.ROW([Types.INT(), Types.STRING()])).build()

kafka_producer = FlinkKafkaProducer(
    topic='test_sink_topic',
    serialization_schema=serialization_schema,
    producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})

ds.add_sink(kafka_producer)

The sink_to function sends DataStream data to a custom sink connector. It only supports FileSink and can be used in batch and streaming execution modes.

from pyflink.datastream.connectors import FileSink, OutputFileConfig
from pyflink.common.serialization import Encoder

output_path = '/opt/output/'
file_sink = FileSink \
    .for_row_format(output_path, Encoder.simple_string_encoder()) \  .with_output_file_config(OutputFileConfig.builder().with_part_prefix('pre').with_part_suffix('suf').build()) \
    .build()
ds.sink_to(file_sink)

Send results to Table & SQL sink connector

Table & SQL connectors are also used to write to DataStream. First the DataStream is converted to Table and then written to the Table & SQL sink connector.

from pyflink.common import Row
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(stream_execution_environment=env)
# option 1:the result type of ds is Types.ROW
def split(s):
    splits = s[1].split("|")
    for sp in splits:
        yield Row(s[0], sp)

ds = ds.map(lambda i: (i[0] + 1, i[1])) \
       .flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \
       .key_by(lambda i: i[1]) \
       .reduce(lambda i, j: Row(i[0] + j[0], i[1]))

# option 1:the result type of ds is Types.TUPLE
def split(s):
    splits = s[1].split("|")
    for sp in splits:
        yield s[0], sp

ds = ds.map(lambda i: (i[0] + 1, i[1])) \
       .flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \
       .key_by(lambda i: i[1]) \
       .reduce(lambda i, j: (i[0] + j[0], i[1]))

# emit ds to print sink
t_env.execute_sql("""
        CREATE TABLE my_sink (
          a INT,
          b VARCHAR
        ) WITH (
          'connector' = 'print'
        )
    """)

table = t_env.from_data_stream(ds)
table_result = table.execute_insert("my_sink")

4. Execute the PyFlink DataStream API job.

PyFlink applications are lazy-loaded and submitted to the cluster for execution only after they are fully built.

To execute an application, you simply call env.execute().

env.execute()

Guess you like

Origin blog.csdn.net/victory0508/article/details/132494637