Flink 1.8 DataStream API Programming Guide data stream API Programming Guide

 
Flink DataStream program in a conventional program is implemented to convert the data stream (e.g., filter, update status, define a window, the polymerization). Initially created data streams from various sources (e.g., message queues, sockets stream file). The results by the receiver (sink) returns, for example, the receiver may or write data to the standard output file (e.g., a command line terminal). Flink program can run in various environments, standalone or embedded in other programs. Execution can be performed in the local JVM can be executed on a cluster of many computers.

The sample program (Example Program)

The following program is an example of the complete work flow application window word count, which counts the words from Web Sockets 5-second window. You can copy and paste the code to run it locally.
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

public class WindowWordCount {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStream<Tuple2<String, Integer>> dataStream = env
                .socketTextStream("localhost", 9999)
                .flatMap(new Splitter())
                .keyBy(0)
                .timeWindow(Time.seconds(5))
                .sum(1);

        dataStream.print();

        env.execute("Window WordCount");
    }

    public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
        @Override
        public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
            for (String word: sentence.split(" ")) {
                out.collect(new Tuple2<String, Integer>(word, 1));
            }
        }
    }

}
To run the sample program, using the first terminal from the input stream start netcat:
nc -lk 9999
Just type some words you can return a new word. These will be the input word count program. To see the count is greater than 1, typing the same word repeatedly within five seconds (if not rapidly enter, then the window size increases from 5 seconds ☺).
 

Data source (Data Sources)

Source is the location of your program from which to read input. You can use StreamExecutionEnvironment.addSource (sourceFunction) source attached to the program. Flink comes with many pre-function source to achieve, but you can write your own custom SourceFunction achieved through non-source parallel source, or by implementing ParallelSourceFunction interface or parallel source extension RichParallelSourceFunction.
 
You can access several predefined flow from StreamExecutionEnvironment source:
Based on the file (File-based):
  • readTextFile(path)
  • readFile(fileInputFormat, path)
  • readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
Based on the socket (Socket-based):
  • socketTextStream - Reads from a socket. Elements can be separated by a delimiter.
Based on the set (Collection-based):
  • fromCollection(Collection)
  • fromCollection(Iterator, Class)
  • fromElements(T ...)
  • fromParallelCollection(SplittableIterator, Class)
  • generateSequence(from, to)
Customization (Custom):
  • addSource - additional new source function.
 

Data stream conversion (DataStream Transformations)

For an overview of available stream conversion, please refer to the operator .
 

The data receiver (Data Sinks)

The data receiver using DataStream and forwards them to a file, a socket, an external system or print them. Flink built with a variety of output formats, these formats back operation on the package in the DataStreams:
  • writeAsText() / TextOutputFormat
  • writeAsCsv(...) / CsvOutputFormat
  • print () / printToErr ()
  • writeUsingOutputFormat() / FileOutputFormat
  • writeToSocket
  • addSink call a custom receiver functions. Flink bind other systems (e.g., Apache Kafka) connector, these systems are implemented as a receiver function.

Iteration (Iterations)

Flow program implements an iterative step function and embedded into the IterativeStream. Since DataStream program may never completed, so there is no maximum number of iterations. Instead, you need to specify which part of the flow is fed back to the iteration, which portion using the split (Split) converter or filter (filter) forwarded downstream. Here, we show an example of using the filter. First, we define a IterativeStream.
IterativeStream<Integer> iteration = input.iterate();
Then, we use a series of conversion logic performed within the specified cycle (this is a simple map (Map) conversion)
DataStream<Integer> iterationBody = iteration.map(/* this is executed many times */); 
To turn off iterations and iteration define the tail, call IterativeStream of closeWith (feedbackStream) method. To closeWith function DataStream will head back to iteration. Common pattern is to separate a portion of the flow feedback and forward propagation flow using filters. These filters can be defined, for example "stop" logic, which allows the element to the downstream propagation feedback instead.
iteration.closeWith(iterationBody.filter(/* one part of the stream */));
DataStream<Integer> output = iterationBody.filter(/* some other part of the stream */);
For example, where a continuous procedure until they reach zero by subtracting 1 from the number of integers:
DataStream<Long> someIntegers = env.generateSequence(0, 1000);

IterativeStream<Long> iteration = someIntegers.iterate();

DataStream<Long> minusOne = iteration.map(value -> value - 1);

DataStream<Long> stillGreaterThanZero = minusOne.filter(value -> (value > 0));

iteration.closeWith(stillGreaterThanZero);

DataStream<Long> lessThanZero = minusOne.filter(value -> (value <= 0));

Execution parameters (Execution Parameters)

StreamExecutionEnvironment comprising ExecutionConfig, which allows set values ​​to a specific configuration of the job is run.
For a description of most of the parameters, see perform configuration . These are especially suitable for DataStream API:
  • setAutoWatermarkInterval (long milliseconds): set the interval to automatically output the watermark. You can use long getAutoWatermarkInterval () Gets the current value
 

Fault tolerance (Fault Tolerance)

State & Checkpointing describes how to enable and configure Flink's checkpoint mechanism.
 

Control delay (Controlling Latency)

By default, the element is not individually transmitted to the network (which would result in unnecessary network traffic), but will be buffered. You can set the buffer size (actual transmission between computers) in Flink profile. Although this method is conducive to optimizing throughput (throughput), but when the incoming stream fast enough, could lead to delays. To control latency and throughput, can be used env.setBufferTimeout (timeoutMillis) in the execution environment (or a single operator) to set the buffer is filled with the maximum waiting time. After that, even if the buffer is not full, it will automatically send buffer. The default timeout value is 100 milliseconds.
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();
env.setBufferTimeout(timeoutMillis);

env.generateSequence(1,10).map(new MyMapper()).setBufferTimeout(timeoutMillis);
To maximize throughput, set setBufferTimeout (-1), which will remove the timeout, the buffer will only refresh when full. To minimize delay, to set the timeout value close to zero (e.g., 5 or 10 ms). Avoid buffer timeout to 0, because it can cause severe performance degradation.
 

Debugging (Debugging)

Before running in the cluster distributed stream processing program, to ensure the best algorithm implemented as expected. Therefore, data analysis programs are often the result of checking, debugging and incremental process improvement.
 
Flink collected by local debugging support, injecting test data and result data within the IDE, provides a functional analysis of the program development process is significantly simplified data. This section provides some Flink how to simplify application development tips.
 

Local Execution Environment (Local Execution Environment)

LocalStreamEnvironment start Flink systems in the same JVM process it creates. If you start from the IDE LocalEnvironment, you can easily set breakpoints and debug the program code.
LocalEnvironment created and used as follows:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

DataStream<String> lines = env.addSource(/* some source */);
// build your program

env.execute();

Source data collection (Collection Data Sources)

Flink provides special data source, the data source is a set of Java support to facilitate testing. Once the program has been tested, the source and the receiver may easily be read / written to the external system source and replaced by a receiver.
 
A set of data source can be used as follows:
final StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironment();

// Create a DataStream from a list of elements
DataStream<Integer> myInts = env.fromElements(1, 2, 3, 4, 5);

// Create a DataStream from any Java collection
List<Tuple2<String, Integer>> data = ...
DataStream<Tuple2<String, Integer>> myTuples = env.fromCollection(data);

// Create a DataStream from an Iterator
Iterator<Long> longIt = ...
DataStream<Long> myLongs = env.fromCollection(longIt, Long.class);
Note: the current set of data types and data source requires iterators implement Serializable. In addition, the aggregate data source can not be executed in parallel (parallel degree = 1).

Iterative data receiver (Iterator Data Sink)

Flink also provides a receiver to collect DataStream results for testing and debugging. It can be used as follows:
import org.apache.flink.streaming.experimental.DataStreamUtils

DataStream<Tuple2<String, Integer>> myResult = ...
Iterator<Tuple2<String, Integer>> myOutput = DataStreamUtils.collect(myResult) 
Note: flink-streaming-contrib module from Flink 1.5.0 deleted. It has been transferred to the class flink-streaming-java and flink-streaming-scala.

Guess you like

Origin www.cnblogs.com/sxpujs/p/11371561.html