Flink source code analysis - analysis of a simple program Flink

This article first appeared in headlines on how Flink program is executed? Flink to analyze a simple procedure through source code , welcome attention to the headline number and a micro-channel public number of "big data and artificial intelligence" (micro-letter search bigdata_ai_tech) for more dry, I also welcome the attention CSDN blog .

Prior to this've covered how to set up the local environment and how to create Flink Flink application and how to build Flink source of each basic steps, examples in this article use SocketWindowWordCount official to resolve what is a routine procedure Flink.

The sample program

public class SocketWindowWordCount {
    public static void main(String[] args) throws Exception {
        // the host and the port to connect to
        final String hostname;
        final int port;
        try {
            final ParameterTool params = ParameterTool.fromArgs(args);
            hostname = params.has("hostname") ? params.get("hostname") : "localhost";
            port = params.getInt("port");
        } catch (Exception e) {
            System.err.println("No port specified. Please run 'SocketWindowWordCount " +
                    "--hostname <hostname> --port <port>', where hostname (localhost by default) " +
                    "and port is the address of the text server");
            System.err.println("To start a simple text server, run 'netcat -l <port>' and " +
                    "type the input text into the command line");
            return;
        }
        // get the execution environment
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // get input data by connecting to the socket
        DataStream<String> text = env.socketTextStream(hostname, port, "\n");
        // parse the data, group it, window it, and aggregate the counts
        DataStream<WordWithCount> windowCounts = text
                .flatMap(new FlatMapFunction<String, WordWithCount>() {
                    @Override
                    public void flatMap(String value, Collector<WordWithCount> out) {
                        for (String word : value.split("\\s")) {
                            out.collect(new WordWithCount(word, 1L));
                        }
                    }
                })
                .keyBy("word")
                .timeWindow(Time.seconds(5))

                .reduce(new ReduceFunction<WordWithCount>() {
                    @Override
                    public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                        return new WordWithCount(a.word, a.count + b.count);
                    }
                });
        // print the results with a single thread, rather than in parallel
        windowCounts.print().setParallelism(1);
        env.execute("Socket Window WordCount");
    }
    // ------------------------------------------------------------------------
    /**
     * Data type for words with count.
     */
    public static class WordWithCount {
        public String word;
        public long count;
        public WordWithCount() {}
        public WordWithCount(String word, long count) {
            this.word = word;
            this.count = count;
        }
        @Override
        public String toString() {
            return word + " : " + count;
        }
    }
}

This is the official website of the above SocketWindowWordCountprogram example, it first acquired host command line and socket connection of the Port, and then acquires the execution environment, the data read from the socket connection, parse and transform data, the final output data.
Each procedure contains the following Flink substantially the same parts throughout the several:

Get an execution environment,
Load / create the initial data,
Specifies the conversion of this data,
Specified for placement calculation results,
Trigger execution

Flink execution environment

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

Flink programs are starting from this code, this line of code returns an execution environment, represents the context of the current execution of the program. If the program is called independent, this method returns a the createLocalEnvironment()local execution environment is created LocalStreamEnvironment. Can be seen in its source code:

//代码目录：org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java
public static StreamExecutionEnvironment getExecutionEnvironment() {
    if (contextEnvironmentFactory != null) {
        return contextEnvironmentFactory.createExecutionEnvironment();
    }
    ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
    if (env instanceof ContextEnvironment) {
        return new StreamContextEnvironment((ContextEnvironment) env);
    } else if (env instanceof OptimizerPlanEnvironment || env instanceof PreviewPlanEnvironment) {
        return new StreamPlanEnvironment(env);
    } else {
        return createLocalEnvironment();
    }
}

Obtaining input data

DataStream<String> text = env.socketTextStream(hostname, port, "\n");

This example source data from the socket, this configuration will create the specified socket socket connection, and then create a new data stream, comprising unlimited string received from the socket, the character string received by the decoding system default character set . When the socket connection is closed, the data reading is terminated immediately. Can be found by looking at the source, there is actually constructed by specifying a socket configuration SocketTextStreamFunctionexample, then a steady stream of data to create a data stream read from input socket connection inside.

//代码目录：org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java
@PublicEvolving
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter, long maxRetry) {
    return addSource(new SocketTextStreamFunction(hostname, port, delimiter, maxRetry),
            "Socket Stream");
}

SocketTextStreamFunctionThe class hierarchy is as follows:
SocketTextStreamFunction class diagram

As it can be seen SocketTextStreamFunctiona SourceFunctionsubclass SourceFunctionis the basic interface Flink all stream data sources. SourceFunctionIt is defined as follows:

//代码目录：org/apache/flink/streaming/api/functions/source/SourceFunction.java
@Public
public interface SourceFunction<T> extends Function, Serializable {
    void run(SourceContext<T> ctx) throws Exception;
    void cancel();
    @Public
    interface SourceContext<T> {
        void collect(T element);
        @PublicEvolving
        void collectWithTimestamp(T element, long timestamp);
        @PublicEvolving
        void emitWatermark(Watermark mark);
        @PublicEvolving
        void markAsTemporarilyIdle();
        Object getCheckpointLock();
        void close();
    }
}

SourceFunctionDefines runand canceltwo methods and SourceContextinternal interfaces.

run (SourceContex): for data acquisition logic, and the data can be forwarded to the downstream node from the parameters passed ctx.
cancel (): to cancel the data source, usually in the run method, there will be a continuous loop to generate the data, cancel the method may cause the loop terminates.
SourceContext: source element for emitting function and a possible watermark interface, return type source generating element.

Understanding of SourceFunctionthis interface, and then look at the SocketTextStreamFunctionspecific implementation (mainly runmethod), the logic is very clear, is continuously read data from the specified hostname and Port, press Enter newline delimiter into a string , then forwards the data to the downstream. Now back to StreamExecutionEnvironmentthe socketTextStreammethod, by calling it addSourcereturns an DataStreamSourceinstance. Consider the example where the textvariable is DataStreamtype in the return type of source Why is DataStreamSourceit? This is because DataStreama DataStreamSourceparent class, the following class diagram can be seen, which reflects the features of Java polymorphism.
DataStreamSource class diagram

Dataflow operations

To get to the above DataStreamSource, for flatMap, keyBy, timeWindow, reduceswitching operation.

DataStream<WordWithCount> windowCounts = text
        .flatMap(new FlatMapFunction<String, WordWithCount>() {
            @Override
            public void flatMap(String value, Collector<WordWithCount> out) {
                for (String word : value.split("\\s")) {
                    out.collect(new WordWithCount(word, 1L));
                }
            }
        })
        .keyBy("word")
        .timeWindow(Time.seconds(5))
        .reduce(new ReduceFunction<WordWithCount>() {
            @Override
            public WordWithCount reduce(WordWithCount a, WordWithCount b) {
                return new WordWithCount(a.word, a.count + b.count);
            }
        });

This logic, to get to the top of DataStreamSource data flow were made flatMap, keyBy, timeWindow, reducefour conversion operation, said the following about flatMapthe conversion, the conversion operation three other readers can try to view the source code to understand their own look.

Look at the flatMapsource code approach it, as follows.

//代码目录：org/apache/flink/streaming/api/datastream/DataStream.java
public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
    TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
            getType(), Utils.getCallLocationName(), true);
    return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));
}

There does two things, first, to get a reflection flatMaptype of output operators, a second is generated operator. flink flow calculation is the core concept of the input data stream from a process to a transmission for chain operator, the final output stream to the process. Each of the data processed into a logically operator. The last line of the code above transformeffect is to return a method SingleOutputStreamOperator, which inherits the Datastreamclass and defines some helper methods, easy to operate convection. Before returning, transformthe method also register it with the execution environment. Below this is a schematic representation of a mapping program to Flink Streaming Dataflow of:
Flink basic programming model

The results output

windowCounts.print().setParallelism(1);

Each source Flink procedures are started to sink to the end, where the printmethod is the calculated result of the standard output stream sink. In the actual development of various Connectors, typically provided by a custom official website or Connectors the computed result to the data sink designated areas, such as Kafka, HBase, FileSystem, Elasticsearch like. Here setParallelismis the degree of parallelism provided for this receiver, the value must be greater than zero.

execute program

env.execute("Socket Window WordCount");

Flink remote mode and local mode has two execution modes, the two modes is a little different, here in local mode to resolve. Look at executethe source code of a method, as follows:

//代码目录：org/apache/flink/streaming/api/environment/LocalStreamEnvironment.java
@Override
public JobExecutionResult execute(String jobName) throws Exception {
    // transform the streaming program into a JobGraph
    StreamGraph streamGraph = getStreamGraph();
    streamGraph.setJobName(jobName);
    JobGraph jobGraph = streamGraph.getJobGraph();
    jobGraph.setAllowQueuedScheduling(true);
    Configuration configuration = new Configuration();
    configuration.addAll(jobGraph.getJobConfiguration());
    configuration.setString(TaskManagerOptions.MANAGED_MEMORY_SIZE, "0");
    // add (and override) the settings with what the user defined
    configuration.addAll(this.configuration);
    if (!configuration.contains(RestOptions.BIND_PORT)) {
        configuration.setString(RestOptions.BIND_PORT, "0");
    }
    int numSlotsPerTaskManager = configuration.getInteger(TaskManagerOptions.NUM_TASK_SLOTS, jobGraph.getMaximumParallelism());
    MiniClusterConfiguration cfg = new MiniClusterConfiguration.Builder()
        .setConfiguration(configuration)
        .setNumSlotsPerTaskManager(numSlotsPerTaskManager)
        .build();
    if (LOG.isInfoEnabled()) {
        LOG.info("Running job on local embedded Flink mini cluster");
    }
    MiniCluster miniCluster = new MiniCluster(cfg);
    try {
        miniCluster.start();
        configuration.setInteger(RestOptions.PORT, miniCluster.getRestAddress().get().getPort());
        return miniCluster.executeJobBlocking(jobGraph);
    }
    finally {
        transformations.clear();
        miniCluster.close();
    }
}

This method consists of three parts: a program to convert the stream JobGraph, add the contents of user-defined (or cover) is provided, and to start a miniCluster perform tasks. About JobGraph temporarily do not speak, just say here about the mission, follow the next return miniCluster.executeJobBlocking(jobGraph);source line, as follows:

//代码目录：org/apache/flink/runtime/minicluster/MiniCluster.java
@Override
public JobExecutionResult executeJobBlocking(JobGraph job) throws JobExecutionException, InterruptedException {
    checkNotNull(job, "job is null");
    final CompletableFuture<JobSubmissionResult> submissionFuture = submitJob(job);
    final CompletableFuture<JobResult> jobResultFuture = submissionFuture.thenCompose(
        (JobSubmissionResult ignored) -> requestJobResult(job.getJobID()));
    final JobResult jobResult;
    try {
        jobResult = jobResultFuture.get();
    } catch (ExecutionException e) {
        throw new JobExecutionException(job.getJobID(), "Could not retrieve JobResult.", ExceptionUtils.stripExecutionException(e);
    }
    try {
        return jobResult.toJobExecutionResult(Thread.currentThread().getContextClassLoader());
    } catch (IOException | ClassNotFoundException e) {
        throw new JobExecutionException(job.getJobID(), e);
    }
}

The core logic of this code is to final CompletableFuture<JobSubmissionResult> submissionFuture = submitJob(job);call a MiniClusterclass submitJobmethod, then look at this method:

//代码目录：org/apache/flink/runtime/minicluster/MiniCluster.java
public CompletableFuture<JobSubmissionResult> submitJob(JobGraph jobGraph) {
    final CompletableFuture<DispatcherGateway> dispatcherGatewayFuture = getDispatcherGatewayFuture();
    // we have to allow queued scheduling in Flip-6 mode because we need to request slots
    // from the ResourceManager
    jobGraph.setAllowQueuedScheduling(true);
    final CompletableFuture<InetSocketAddress> blobServerAddressFuture = createBlobServerAddress(dispatcherGatewayFuture);
    final CompletableFuture<Void> jarUploadFuture = uploadAndSetJobFiles(blobServerAddressFuture, jobGraph);
    final CompletableFuture<Acknowledge> acknowledgeCompletableFuture = jarUploadFuture
        .thenCombine(
            dispatcherGatewayFuture,
            (Void ack, DispatcherGateway dispatcherGateway) -> dispatcherGateway.submitJob(jobGraph, rpcTimeout))
        .thenCompose(Function.identity());
    return acknowledgeCompletableFuture.thenApply(
        (Acknowledge ignored) -> new JobSubmissionResult(jobGraph.getJobID()));
}

Here are Dispatcherthe components responsible for receiving job submission, they are persistent, generate JobManagers to perform the job and restore them when host failures. DispatcherThere are two implementations, started under the local environment MiniDispatcher, to start on a cluster environment StandaloneDispatcher. The following is a configuration diagram of the class:
MiniDispatcher class structure in FIG.

Here it's Dispatcherlaunched a JobManagerRunnercommission JobManagerRunnerto the start of the Job JobMaster. Corresponding code is as follows:

//代码目录：org/apache/flink/runtime/jobmaster/JobManagerRunner.java
private CompletableFuture<Void> verifyJobSchedulingStatusAndStartJobManager(UUID leaderSessionId) {
    final CompletableFuture<JobSchedulingStatus> jobSchedulingStatusFuture = getJobSchedulingStatus();
    return jobSchedulingStatusFuture.thenCompose(
        jobSchedulingStatus -> {
            if (jobSchedulingStatus == JobSchedulingStatus.DONE) {
                return jobAlreadyDone();
            } else {
                return startJobMaster(leaderSessionId);
            }
        });
}

JobMasterAfter a series of nested method calls, to eventually perform the following piece of logic:

//代码目录：org/apache/flink/runtime/jobmaster/JobMaster.java
private void scheduleExecutionGraph() {
    checkState(jobStatusListener == null);
    // register self as job status change listener
    jobStatusListener = new JobManagerJobStatusListener();
    executionGraph.registerJobStatusListener(jobStatusListener);
    try {
        executionGraph.scheduleForExecution();
    }
    catch (Throwable t) {
        executionGraph.failGlobal(t);
    }
}

Here executionGraph.scheduleForExecution();called ExecutionGraphstartup method. In view of the structure of the Flink, ExecutionGraphit is where the real is performed, so up to here, from one task to submit to the process of real implementation is over, and then review the implementation of the following processes in the local environment:

The client execution executemethod;
MiniClusterAfter completing most of the tasks delegated the task to direct MiniDispatcher;
DispatcherAfter receiving the job, a instantiated JobManagerRunner, then you start the job with this instance;
JobManagerRunnerNext to the job JobMasterto deal with;
JobMasterUsing the ExecutionGraphmethod of execution to start the whole map, the whole task is started up.