Table of contents
Collect data to the client. The execute_and_collect method will collect data into the client memory.
Send the results to the DataStream sink connector
Send results to Table & SQL sink connector
4. Execute the PyFlink DataStream API job.
1. Perform conversion operations on the data stream in the previous section, or use sink to write data to an external system.
This tutorial uses FileSink to write the resulting data to a file.
def split(line):
yield from line.split()
# compute word count
ds = ds.flat_map(split) \
.map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \
.key_by(lambda i: i[0]) \
.reduce(lambda i, j: (i[0], i[1] + j[1]))
ds.sink_to(
sink=FileSink.for_row_format(
base_path=output_path,
encoder=Encoder.simple_string_encoder())
.with_output_file_config(
OutputFileConfig.builder()
.with_part_prefix("prefix")
.with_part_suffix(".ext")
.build())
.with_rolling_policy(RollingPolicy.default_rolling_policy())
.build()
)
The sink_to function sends DataStream data to a custom sink connector, only supports FileSink, and can be used in batch and streaming execution modes.
2. File Sink
Streaming File Sink is a new feature introduced in Flink 1.7 to solve the following problems:
In big data business scenarios, there is often a scenario: external data is sent to Kafka, and Flink acts as a middleware to consume Kafka data and perform business processing; after the processing is completed, the data may need to be written to the database or file system, such as writing hdfs.
Streaming File Sink can be used to write partition files to a file system that supports the Flink FileSystem interface and supports Exactly-Once semantics. The Exactly-Once implemented by this sink is guaranteed based on the two-stage submission mode implemented by Flink checkpoint. It is mainly used in scenarios such as real-time data warehouse, topic splitting, and hour-based analysis and processing.
Streaming File Sink is a connector added after community optimization and is recommended for use.
Streaming File Sink is more flexible and powerful, and can implement the serialization method by itself
Streaming File Sink has two methods to output to a file: row encoding format forRowFormat and block encoding format forBulkFormat.
forRowFormat is relatively simple. It only provides SimpleStringEncoder to write text files, and the encoding can be specified.
Since streaming data itself is unbounded, streaming data writes data into buckets. By default, the bucketing strategy based on the system time (yyyy-MM-dd--HH) is used. In bucketing, the output is split into part files according to the rolling strategy.
Flink provides two bucketing strategies . The bucketing strategy implements
org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner 接口:
BasePathBucketAssigner, regardless of bucket, all files are written to the root directory;
DateTimeBucketAssigner, bucketing based on system time (yyyy-MM-dd--HH).
In addition, you can also implement the BucketAssigner interface and customize the bucketing strategy.
Flink provides two rolling strategies , and the rolling strategy implements
org.apache.flink.streaming.api.functions.sink.filesystem.RollingPolicy 接口:
DefaultRollingPolicy Rolls the file when the maximum bucket size is exceeded (default is 128 MB), or the rolling period is exceeded (default is 60 seconds), or no data is written and the inactivity timeout occurs (default is 60 seconds);
OnCheckpointRollingPolicy Rolls the file when checkpointing.
File Sink
File Sink writes incoming data into the bucket. Considering that the input stream can be unbounded, the data in each bucket is organized into Part files of limited size. It can be configured to write data into the bucket based on time. For example, you can set the data to be written into a new bucket every hour. This means that the bucket will contain records received within an hourly interval.
The data in the bucket directory is split into multiple Part files. For each Subtask of the Sink of the corresponding bucket that receives data , each bucket will contain at least one Part file. Additional Part files will be created based on the configured rolling strategy . For the default policy of rolling based on the size of the Part file, you need to specify the timeout for the longest time the file is open and the timeout for the inactive state after the file is closed. For scrolling every time a Checkpoint is created, the user can also add other conditions based on size or time. Row-encoded Formats
Bulk-encoded Formats
Important : Checkpoint function needs to be turned on when using it in STREAMING
mode . The file is only generated when Checkpoint is successful. If the Checkpoint function is not turned on, the file will always stay in the or state, and the downstream system will not be able to safely read the file data.FileSink
in-progress
pending
Format Types
FileSink
Not only supports Row-encoded but also Bulk-encoded , such as Apache Parquet . These two formats can be constructed through the following static methods:
- Row-encoded sink:
FileSink.forRowFormat(basePath, rowEncoder)
- Bulk-encoded sink:
FileSink.forBulkFormat(basePath, bulkWriterFactory)
When creating a Sink of Row-encoded Format or Bulk-encoded Format , you must specify the path of the bucket and the logic for encoding the data.
Row-encoded Formats
Row-encoded Format needs to be specified Encoder
, which is used to serialize a single row of data during the output data to the file Outputstream
.
In addition to the bucket assigner , RowFormatBuilder also allows users to specify the following properties:
- Custom RollingPolicy: Custom rolling policy overrides DefaultRollingPolicy
- bucketCheckInterval (default = 1 min): check interval based on rolling policy settings
data_stream = ...
sink = FileSink \
.for_row_format(OUTPUT_PATH, Encoder.simple_string_encoder("UTF-8")) \
.with_rolling_policy(RollingPolicy.default_rolling_policy(
part_size=1024 ** 3, rollover_interval=15 * 60 * 1000, inactivity_interval=5 * 60 * 1000)) \
.build()
data_stream.sink_to(sink)
This example creates a simple Sink that assigns records to hourly buckets by default. The example also specifies a rolling strategy, which will roll the In-progress status file when any of the following three conditions is met :
- Contains at least 15 minutes of data
- Never receive new records with a delay of more than 5 minutes
- File size has reached 1GB (after writing last record)
Bulk-encoded Formats
The creation of Bulk-encoded Sink is similar to Row-encoded , but does not need to be specified ; Defines the logic for how new data is added and refreshed, and how to ultimately determine which encoding character set to use for a batch of records. Encoder
BulkWriter.Factory
BulkWriter
Flink has five built - in BulkWriter factory classes:
- ParquetWriterFactory
- AvroWriterFactory
- SequenceFileWriterFactory
- CompressWriterFactory
- OrcBulkWriterFactory
Important Bulk-encoded Format supports only one scrolling strategy from inherited CheckpointRollingPolicy
classes. It will scroll at every Checkpoint . It is also possible to scroll based on size or processing time.
bucket allocation
The bucket logic defines how data is distributed into subdirectories within the base output directory.
Row-encoded Format and Bulk-encoded Format are used DateTimeBucketAssigner
as the default allocator. The default allocator creates hourly buckets based on the system's default time zone DateTimeBucketAssigner
using the format . yyyy-MM-dd--HH
Both date format ( i.e. bucket size) and time zone can be configured manually.
.withBucketAssigner(assigner)
You can also specify custom ones by calling methods in the formatting constructor BucketAssigner
.
Flink has two built-in BucketAssigners :
DateTimeBucketAssigner
:Default time-based allocatorBasePathBucketAssigner
: Allocate all files to be stored on the base path (single global bucket)
PyFlink only supports DateTimeBucketAssigner
and BasePathBucketAssigner
.
rolling strategy
RollingPolicy
Defines when to close a given In-progress Part file, transition it to Pending state, and then transition to Finished state. Files in the Finished state can be viewed and the validity of the data can be guaranteed, and will not be restored in the event of a failure. In STREAMING
mode, the rolling strategy combined with the Checkpoint interval ( the Pending status of the file is converted to the Finished status until the next Checkpoint is successful ) jointly controls whether Part files are visible to downstream readers and the size and number of these files. In mode, the Part file becomes visible to the downstream at the end of the Job , and the rolling strategy only controls the maximum Part file size. BATCH
Flink has two built-in RollingPolicies :
DefaultRollingPolicy
OnCheckpointRollingPolicy
PyFlink only supports DefaultRollingPolicy
and OnCheckpointRollingPolicy
.
3. How to output the results
ds.print()
Collect results to client
Collect data to the client. The execute_and_collect method will collect data into the client memory.
with ds.execute_and_collect() as results:
for result in results:
print(result)
Send the results to the DataStream sink connector
The add_sink function sends DataStream data to the sink connector. This function only supports FlinkKafkaProducer, JdbcSink and StreamingFileSink, and is only used in streaming execution mode.
from pyflink.common.typeinfo import Types
from pyflink.datastream.connectors import FlinkKafkaProducer
from pyflink.common.serialization import JsonRowSerializationSchema
serialization_schema = JsonRowSerializationSchema.builder().with_type_info(
type_info=Types.ROW([Types.INT(), Types.STRING()])).build()
kafka_producer = FlinkKafkaProducer(
topic='test_sink_topic',
serialization_schema=serialization_schema,
producer_config={'bootstrap.servers': 'localhost:9092', 'group.id': 'test_group'})
ds.add_sink(kafka_producer)
The sink_to function sends DataStream data to a custom sink connector. It only supports FileSink and can be used in batch and streaming execution modes.
from pyflink.datastream.connectors import FileSink, OutputFileConfig
from pyflink.common.serialization import Encoder
output_path = '/opt/output/'
file_sink = FileSink \
.for_row_format(output_path, Encoder.simple_string_encoder()) \ .with_output_file_config(OutputFileConfig.builder().with_part_prefix('pre').with_part_suffix('suf').build()) \
.build()
ds.sink_to(file_sink)
Send results to Table & SQL sink connector
Table & SQL connectors are also used to write to DataStream. First the DataStream is converted to Table and then written to the Table & SQL sink connector.
from pyflink.common import Row
from pyflink.common.typeinfo import Types
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(stream_execution_environment=env)
# option 1:the result type of ds is Types.ROW
def split(s):
splits = s[1].split("|")
for sp in splits:
yield Row(s[0], sp)
ds = ds.map(lambda i: (i[0] + 1, i[1])) \
.flat_map(split, Types.ROW([Types.INT(), Types.STRING()])) \
.key_by(lambda i: i[1]) \
.reduce(lambda i, j: Row(i[0] + j[0], i[1]))
# option 1:the result type of ds is Types.TUPLE
def split(s):
splits = s[1].split("|")
for sp in splits:
yield s[0], sp
ds = ds.map(lambda i: (i[0] + 1, i[1])) \
.flat_map(split, Types.TUPLE([Types.INT(), Types.STRING()])) \
.key_by(lambda i: i[1]) \
.reduce(lambda i, j: (i[0] + j[0], i[1]))
# emit ds to print sink
t_env.execute_sql("""
CREATE TABLE my_sink (
a INT,
b VARCHAR
) WITH (
'connector' = 'print'
)
""")
table = t_env.from_data_stream(ds)
table_result = table.execute_insert("my_sink")
4. Execute the PyFlink DataStream API job.
PyFlink applications are lazy-loaded and submitted to the cluster for execution only after they are fully built.
To execute an application, you simply call env.execute().
env.execute()