This article first appeared in headlines on how Flink program is executed? Flink to analyze a simple procedure through source code , welcome attention to the headline number and a micro-channel public number of "big data and artificial intelligence" (micro-letter search bigdata_ai_tech) for more dry, I also welcome the attention CSDN blog .
Prior to this've covered how to set up the local environment and how to create Flink Flink application and how to build Flink source of each basic steps, examples in this article use SocketWindowWordCount official to resolve what is a routine procedure Flink.
The sample program
public class SocketWindowWordCount {
public static void main(String[] args) throws Exception {
// the host and the port to connect to
final String hostname;
final int port;
try {
final ParameterTool params = ParameterTool.fromArgs(args);
hostname = params.has("hostname") ? params.get("hostname") : "localhost";
port = params.getInt("port");
} catch (Exception e) {
System.err.println("No port specified. Please run 'SocketWindowWordCount " +
"--hostname <hostname> --port <port>', where hostname (localhost by default) " +
"and port is the address of the text server");
System.err.println("To start a simple text server, run 'netcat -l <port>' and " +
"type the input text into the command line");
return;
}
// get the execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// get input data by connecting to the socket
DataStream<String> text = env.socketTextStream(hostname, port, "\n");
// parse the data, group it, window it, and aggregate the counts
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5))
.reduce(new ReduceFunction<WordWithCount>() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1);
env.execute("Socket Window WordCount");
}
// ------------------------------------------------------------------------
/**
* Data type for words with count.
*/
public static class WordWithCount {
public String word;
public long count;
public WordWithCount() {}
public WordWithCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return word + " : " + count;
}
}
}
This is the official website of the above SocketWindowWordCount
program example, it first acquired host command line and socket connection of the Port, and then acquires the execution environment, the data read from the socket connection, parse and transform data, the final output data.
Each procedure contains the following Flink substantially the same parts throughout the several:
- Get an execution environment,
- Load / create the initial data,
- Specifies the conversion of this data,
- Specified for placement calculation results,
- Trigger execution
Flink execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Flink programs are starting from this code, this line of code returns an execution environment, represents the context of the current execution of the program. If the program is called independent, this method returns a the createLocalEnvironment()
local execution environment is created LocalStreamEnvironment
. Can be seen in its source code:
//代码目录:org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java
public static StreamExecutionEnvironment getExecutionEnvironment() {
if (contextEnvironmentFactory != null) {
return contextEnvironmentFactory.createExecutionEnvironment();
}
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
if (env instanceof ContextEnvironment) {
return new StreamContextEnvironment((ContextEnvironment) env);
} else if (env instanceof OptimizerPlanEnvironment || env instanceof PreviewPlanEnvironment) {
return new StreamPlanEnvironment(env);
} else {
return createLocalEnvironment();
}
}
Obtaining input data
DataStream<String> text = env.socketTextStream(hostname, port, "\n");
This example source data from the socket, this configuration will create the specified socket socket connection, and then create a new data stream, comprising unlimited string received from the socket, the character string received by the decoding system default character set . When the socket connection is closed, the data reading is terminated immediately. Can be found by looking at the source, there is actually constructed by specifying a socket configuration SocketTextStreamFunction
example, then a steady stream of data to create a data stream read from input socket connection inside.
//代码目录:org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java
@PublicEvolving
public DataStreamSource<String> socketTextStream(String hostname, int port, String delimiter, long maxRetry) {
return addSource(new SocketTextStreamFunction(hostname, port, delimiter, maxRetry),
"Socket Stream");
}
SocketTextStreamFunction
The class hierarchy is as follows:
As it can be seen SocketTextStreamFunction
a SourceFunction
subclass SourceFunction
is the basic interface Flink all stream data sources. SourceFunction
It is defined as follows:
//代码目录:org/apache/flink/streaming/api/functions/source/SourceFunction.java
@Public
public interface SourceFunction<T> extends Function, Serializable {
void run(SourceContext<T> ctx) throws Exception;
void cancel();
@Public
interface SourceContext<T> {
void collect(T element);
@PublicEvolving
void collectWithTimestamp(T element, long timestamp);
@PublicEvolving
void emitWatermark(Watermark mark);
@PublicEvolving
void markAsTemporarilyIdle();
Object getCheckpointLock();
void close();
}
}
SourceFunction
Defines run
and cancel
two methods and SourceContext
internal interfaces.
- run (SourceContex): for data acquisition logic, and the data can be forwarded to the downstream node from the parameters passed ctx.
- cancel (): to cancel the data source, usually in the run method, there will be a continuous loop to generate the data, cancel the method may cause the loop terminates.
- SourceContext: source element for emitting function and a possible watermark interface, return type source generating element.
Understanding of SourceFunction
this interface, and then look at the SocketTextStreamFunction
specific implementation (mainly run
method), the logic is very clear, is continuously read data from the specified hostname and Port, press Enter newline delimiter into a string , then forwards the data to the downstream. Now back to StreamExecutionEnvironment
the socketTextStream
method, by calling it addSource
returns an DataStreamSource
instance. Consider the example where the text
variable is DataStream
type in the return type of source Why is DataStreamSource
it? This is because DataStream
a DataStreamSource
parent class, the following class diagram can be seen, which reflects the features of Java polymorphism.
Dataflow operations
To get to the above DataStreamSource, for flatMap
, keyBy
, timeWindow
, reduce
switching operation.
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5))
.reduce(new ReduceFunction<WordWithCount>() {
@Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
This logic, to get to the top of DataStreamSource data flow were made flatMap
, keyBy
, timeWindow
, reduce
four conversion operation, said the following about flatMap
the conversion, the conversion operation three other readers can try to view the source code to understand their own look.
Look at the flatMap
source code approach it, as follows.
//代码目录:org/apache/flink/streaming/api/datastream/DataStream.java
public <R> SingleOutputStreamOperator<R> flatMap(FlatMapFunction<T, R> flatMapper) {
TypeInformation<R> outType = TypeExtractor.getFlatMapReturnTypes(clean(flatMapper),
getType(), Utils.getCallLocationName(), true);
return transform("Flat Map", outType, new StreamFlatMap<>(clean(flatMapper)));
}
There does two things, first, to get a reflection flatMap
type of output operators, a second is generated operator. flink flow calculation is the core concept of the input data stream from a process to a transmission for chain operator, the final output stream to the process. Each of the data processed into a logically operator. The last line of the code above transform
effect is to return a method SingleOutputStreamOperator
, which inherits the Datastream
class and defines some helper methods, easy to operate convection. Before returning, transform
the method also register it with the execution environment. Below this is a schematic representation of a mapping program to Flink Streaming Dataflow of:
The results output
windowCounts.print().setParallelism(1);
Each source Flink procedures are started to sink to the end, where the print
method is the calculated result of the standard output stream sink. In the actual development of various Connectors, typically provided by a custom official website or Connectors the computed result to the data sink designated areas, such as Kafka, HBase, FileSystem, Elasticsearch like. Here setParallelism
is the degree of parallelism provided for this receiver, the value must be greater than zero.
execute program
env.execute("Socket Window WordCount");
Flink remote mode and local mode has two execution modes, the two modes is a little different, here in local mode to resolve. Look at execute
the source code of a method, as follows:
//代码目录:org/apache/flink/streaming/api/environment/LocalStreamEnvironment.java
@Override
public JobExecutionResult execute(String jobName) throws Exception {
// transform the streaming program into a JobGraph
StreamGraph streamGraph = getStreamGraph();
streamGraph.setJobName(jobName);
JobGraph jobGraph = streamGraph.getJobGraph();
jobGraph.setAllowQueuedScheduling(true);
Configuration configuration = new Configuration();
configuration.addAll(jobGraph.getJobConfiguration());
configuration.setString(TaskManagerOptions.MANAGED_MEMORY_SIZE, "0");
// add (and override) the settings with what the user defined
configuration.addAll(this.configuration);
if (!configuration.contains(RestOptions.BIND_PORT)) {
configuration.setString(RestOptions.BIND_PORT, "0");
}
int numSlotsPerTaskManager = configuration.getInteger(TaskManagerOptions.NUM_TASK_SLOTS, jobGraph.getMaximumParallelism());
MiniClusterConfiguration cfg = new MiniClusterConfiguration.Builder()
.setConfiguration(configuration)
.setNumSlotsPerTaskManager(numSlotsPerTaskManager)
.build();
if (LOG.isInfoEnabled()) {
LOG.info("Running job on local embedded Flink mini cluster");
}
MiniCluster miniCluster = new MiniCluster(cfg);
try {
miniCluster.start();
configuration.setInteger(RestOptions.PORT, miniCluster.getRestAddress().get().getPort());
return miniCluster.executeJobBlocking(jobGraph);
}
finally {
transformations.clear();
miniCluster.close();
}
}
This method consists of three parts: a program to convert the stream JobGraph, add the contents of user-defined (or cover) is provided, and to start a miniCluster perform tasks. About JobGraph temporarily do not speak, just say here about the mission, follow the next return miniCluster.executeJobBlocking(jobGraph);
source line, as follows:
//代码目录:org/apache/flink/runtime/minicluster/MiniCluster.java
@Override
public JobExecutionResult executeJobBlocking(JobGraph job) throws JobExecutionException, InterruptedException {
checkNotNull(job, "job is null");
final CompletableFuture<JobSubmissionResult> submissionFuture = submitJob(job);
final CompletableFuture<JobResult> jobResultFuture = submissionFuture.thenCompose(
(JobSubmissionResult ignored) -> requestJobResult(job.getJobID()));
final JobResult jobResult;
try {
jobResult = jobResultFuture.get();
} catch (ExecutionException e) {
throw new JobExecutionException(job.getJobID(), "Could not retrieve JobResult.", ExceptionUtils.stripExecutionException(e);
}
try {
return jobResult.toJobExecutionResult(Thread.currentThread().getContextClassLoader());
} catch (IOException | ClassNotFoundException e) {
throw new JobExecutionException(job.getJobID(), e);
}
}
The core logic of this code is to final CompletableFuture<JobSubmissionResult> submissionFuture = submitJob(job);
call a MiniCluster
class submitJob
method, then look at this method:
//代码目录:org/apache/flink/runtime/minicluster/MiniCluster.java
public CompletableFuture<JobSubmissionResult> submitJob(JobGraph jobGraph) {
final CompletableFuture<DispatcherGateway> dispatcherGatewayFuture = getDispatcherGatewayFuture();
// we have to allow queued scheduling in Flip-6 mode because we need to request slots
// from the ResourceManager
jobGraph.setAllowQueuedScheduling(true);
final CompletableFuture<InetSocketAddress> blobServerAddressFuture = createBlobServerAddress(dispatcherGatewayFuture);
final CompletableFuture<Void> jarUploadFuture = uploadAndSetJobFiles(blobServerAddressFuture, jobGraph);
final CompletableFuture<Acknowledge> acknowledgeCompletableFuture = jarUploadFuture
.thenCombine(
dispatcherGatewayFuture,
(Void ack, DispatcherGateway dispatcherGateway) -> dispatcherGateway.submitJob(jobGraph, rpcTimeout))
.thenCompose(Function.identity());
return acknowledgeCompletableFuture.thenApply(
(Acknowledge ignored) -> new JobSubmissionResult(jobGraph.getJobID()));
}
Here are Dispatcher
the components responsible for receiving job submission, they are persistent, generate JobManagers to perform the job and restore them when host failures. Dispatcher
There are two implementations, started under the local environment MiniDispatcher
, to start on a cluster environment StandaloneDispatcher
. The following is a configuration diagram of the class:
Here it's Dispatcher
launched a JobManagerRunner
commission JobManagerRunner
to the start of the Job JobMaster
. Corresponding code is as follows:
//代码目录:org/apache/flink/runtime/jobmaster/JobManagerRunner.java
private CompletableFuture<Void> verifyJobSchedulingStatusAndStartJobManager(UUID leaderSessionId) {
final CompletableFuture<JobSchedulingStatus> jobSchedulingStatusFuture = getJobSchedulingStatus();
return jobSchedulingStatusFuture.thenCompose(
jobSchedulingStatus -> {
if (jobSchedulingStatus == JobSchedulingStatus.DONE) {
return jobAlreadyDone();
} else {
return startJobMaster(leaderSessionId);
}
});
}
JobMaster
After a series of nested method calls, to eventually perform the following piece of logic:
//代码目录:org/apache/flink/runtime/jobmaster/JobMaster.java
private void scheduleExecutionGraph() {
checkState(jobStatusListener == null);
// register self as job status change listener
jobStatusListener = new JobManagerJobStatusListener();
executionGraph.registerJobStatusListener(jobStatusListener);
try {
executionGraph.scheduleForExecution();
}
catch (Throwable t) {
executionGraph.failGlobal(t);
}
}
Here executionGraph.scheduleForExecution();
called ExecutionGraph
startup method. In view of the structure of the Flink, ExecutionGraph
it is where the real is performed, so up to here, from one task to submit to the process of real implementation is over, and then review the implementation of the following processes in the local environment:
- The client execution
execute
method; MiniCluster
After completing most of the tasks delegated the task to directMiniDispatcher
;Dispatcher
After receiving the job, a instantiatedJobManagerRunner
, then you start the job with this instance;JobManagerRunner
Next to the jobJobMaster
to deal with;JobMaster
Using theExecutionGraph
method of execution to start the whole map, the whole task is started up.