Flink Stream Batch Integrated Computing (16): PyFlink DataStream API

Table of contents

Overview

Apache Flink provides the DataStream API for building robust, stateful streaming applications. It provides fine-grained control over state and timing, allowing the implementation of advanced event-driven systems.

The Flink program implemented by the user is composed of two basic building blocks, Stream and Transformation.

Stream is an intermediate result data, and Transformation is an operation that calculates and processes one or more input Streams and outputs one or more result Streams.

When a Flink program is executed, it will be mapped to Streaming Dataflow. A Streaming Dataflow is composed of a set of Stream and Transformation Operators. It is similar to a DAG graph. It starts from one or more Source Operators and ends with one or more Sink Operators when it is started.

FlinkKafkaConsumer is a Source Operator , Map , KeyBy , TimeWindow , and Apply are Transformation Operators , and RollingSink is a Sink Operator .

Pipeline Dataflow

In Flink , programs run in a parallel and distributed manner. A Stream can be divided into multiple Stream partitions ( Stream Partitions ), and an Operator can be divided into multiple Operator Subtasks .

There is an optimization function inside Flink , which optimizes according to the closeness of upstream and downstream operators.

Operators with low density cannot be optimized. Instead, each Operator Subtask is executed independently in a different thread. The parallelism degree of an Operator is equal to the number of Operator Subtasks , and the parallelism degree (total number of partitions) of a Stream is equal to the parallelism degree of the Operator that generated it.

Operators with high compactness can be optimized. After optimization, multiple Operator Subtasks can be strung together to form an Operator Chain, which is actually an execution chain. Each execution chain will be executed in an independent thread on the TaskManager.

The upper part of the figure shows the optimization of two highly compact operators, Source and Map, into an Operator Chain. In fact, an Operator Chain is the concept of a large Operator. The Operator Chain in the figure represents an Operator, keyBy represents an Operator, and Sink represents an Operator. They are connected through a Stream, and each Operator corresponds to a Task at runtime. That is to say, there are 3 Operators corresponding to the upper part of the figure. It is 3 Tasks.

The lower part of the figure is a parallel version of the upper part. Each Task is parallelized into multiple Subtasks. Here only 2 degrees of parallelism are demonstrated, and the Sink operator is 1 degree of parallelism.

Code exampleWorldCount.py

In this chapter, you will learn how to build a simple streaming application using PyFlink and the DataStream API.

Write a simple Python DataStream job.

The program reads a csv file, calculates word frequencies, and writes the results to a results file.

import argparse
import logging
import sys
from pyflink.common import WatermarkStrategy, Encoder, Types
from pyflink.datastream import StreamExecutionEnvironment, RuntimeExecutionMode
from pyflink.datastream.connectors import (FileSource, StreamFormat, FileSink, OutputFileConfig,RollingPolicy)


word_count_data = ["To be, or not to be,--that is the question:--",
                   "Whether 'tis nobler in the mind to suffer",
                   "The slings and arrows of outrageous fortune",
                   "Or to take arms against a sea of troubles,",
                   "And by opposing end them?--To die,--to sleep,--",
                   "No more; and by a sleep to say we end",
                   "The heartache, and the thousand natural shocks",
                   "That flesh is heir to,--'tis a consummation",
                   "Devoutly to be wish'd. To die,--to sleep;--",
                   "To sleep! perchance to dream:--ay, there's the rub;",
                   "For in that sleep of death what dreams may come,",
                   "When we have shuffled off this mortal coil,",
                   "Must give us pause: there's the respect",
                   "That makes calamity of so long life;",
                   "For who would bear the whips and scorns of time,",
                   "The oppressor's wrong, the proud man's contumely,",
                   "The pangs of despis'd love, the law's delay,",
                   "The insolence of office, and the spurns",
                   "That patient merit of the unworthy takes,",
                   "When he himself might his quietus make",
                   "With a bare bodkin? who would these fardels bear,",
                   "To grunt and sweat under a weary life,",
                   "But that the dread of something after death,--",
                   "The undiscover'd country, from whose bourn",
                   "No traveller returns,--puzzles the will,",
                   "And makes us rather bear those ills we have",
                   "Than fly to others that we know not of?",
                   "Thus conscience does make cowards of us all;",
                   "And thus the native hue of resolution",
                   "Is sicklied o'er with the pale cast of thought;",
                   "And enterprises of great pith and moment,",
                   "With this regard, their currents turn awry,",
                   "And lose the name of action.--Soft you now!",
                   "The fair Ophelia!--Nymph, in thy orisons",
                   "Be all my sins remember'd."]


def word_count(input_path, output_path):
    env = StreamExecutionEnvironment.get_execution_environment()
    env.set_runtime_mode(RuntimeExecutionMode.BATCH)
    # write all the data to one file
    env.set_parallelism(1)
    # define the source
    if input_path is not None:
        ds = env.from_source(
            source=FileSource.for_record_stream_format(StreamFormat.text_line_format(),
                                                       input_path)
                             .process_static_file_set().build(),
            watermark_strategy=WatermarkStrategy.for_monotonous_timestamps(),
            source_name="file_source"
        )
    else:
        print("Executing word_count example with default input data set.")
        print("Use --input to specify file input.")
        ds = env.from_collection(word_count_data)


    def split(line):
        yield from line.split()
    # compute word count


    ds = ds.flat_map(split) \
           .map(lambda i: (i, 1), output_type=Types.TUPLE([Types.STRING(), Types.INT()])) \
           .key_by(lambda i: i[0]) \
           .reduce(lambda i, j: (i[0], i[1] + j[1]))
    # define the sink
    if output_path is not None:
        ds.sink_to(
            sink=FileSink.for_row_format(
                base_path=output_path,
                encoder=Encoder.simple_string_encoder())
            .with_output_file_config(
                OutputFileConfig.builder()
                .with_part_prefix("prefix")
                .with_part_suffix(".ext")
                .build())
            .with_rolling_policy(RollingPolicy.default_rolling_policy())
            .build()
        )
    else:
        print("Printing result to stdout. Use --output to specify output path.")
        ds.print()

    # submit for execution
    env.execute()



if __name__ == '__main__':
    logging.basicConfig(stream=sys.stdout, level=logging.INFO, format="%(message)s")
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--input',
        dest='input',
        required=False,
        help='Input file to process.')
    parser.add_argument(
        '--output',
        dest='output',
        required=False,
        help='Output file to write results to.')

    argv = sys.argv[1:]
    known_args, _ = parser.parse_known_args(argv)
    word_count(known_args.input, known_args.output)

Execute the script WorldCount.py

python word_count.py