Flink basic use of API

Outline

Flink a DataSet with DataStream represent data sets. DateSet for batch processing, data representative of a limited, but for DataStream data stream, data representative of unbounded. Data in the dataset is not changed, that the elements which can not be added or deleted. We created a data source DataSet or DataStream, the data set operation generates a new dataset map, filter, etc. conversion (Transform) operation.

Flink write about the program in general through several steps:

  • Obtaining execution environment
  • Create the input data
  • Conversion operation performed on the data sets (hereinafter collectively referred to: transform)
  • The output data
  • Trigger execution

Here we will introduce the basic API Flink written procedures involved.

Input and output

First, you need to obtain execution environment, Flink provides the following three ways:

  • getExecutionEnvironment()
  • createLocalEnvironment()
  • createRemoteEnvironment(String host, int port, String... jarFiles)

The first case study to create the following code execution environment

Batch:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = env.readTextFile("file:///D:\\words.txt");
text.print();

Stream processing:

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("file:///D:\\words.txt");
text.print();
env.execute();

The above code creates an execution environment, while taking advantage of env created input source. You may call the print output method on the data set data to the console, of course, other methods may be invoked writeAsText outputs data to other media. The above processing of the last line of code that calls the stream execute a method, the stream processing method need to explicitly call trigger the execution of the program.

There are two ways to run the above code, one is executed directly in the IDE, run like a normal Java program, Flink will start a local program execution environment. Another way is to be packaged, submitted to Flink cluster operation.

The above example of the basic skeleton of a Flink contains the basic program, but not the data set more transform operations Below we briefly basic transform operation.

map operation

MapReduce map operation here is similar to the map, the data parsing process. Examples are as follows

Batch:

DataSet<Tuple2<String, Integer>> words = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
});
words.print();

Stream processing:

DataStream<Tuple2<String, Integer>> words = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
});
words.print()

Here a batch process and the code stream addition to different types of data sets, the rest are the same wording. Each map is to become a word (word 1) tuple. Similar to the map as well as transform filter, filter unwanted recording.

Specified key

Big data processing often need to be processed in accordance with a dimension, which is the need to specify the key. GroupBy using the specified key in the DataSet, in use keyBy DataStream specified key. Here we introduce an example to keyBy.

Flink data model is not based on the key-value, key is virtual, it can be seen as a function defined on the data.

In Tuple defined key

//0 代表 Tuple2 (二元组)中第一个元素
KeyedStream<Tuple2<String, Integer>, Tuple> keyed = words.keyBy(0); 
 //0,1 代表二元组中第一个和第二个元素作为 key
KeyedStream<Tuple2<String, Integer>, Tuple> keyed = words.keyBy(0,1);

Nested tuple

DataStream<Tuple3<Tuple2<Integer, Float>,String,Long>> ds;

ds.keyBy (0) will be the Tuple2 <Integer, Float> entirety key.

With specified key field expression

public class WC {
  public String word;
  public int count;
}

DataStream<WC> words = // [...]
DataStream<WC> wordCounts = words.keyBy("word");

Herein designated as WC object field word key.

Field expression syntax is as follows:

  • Java object field name used as key, as an example
  • (Starting from 0) the specified key type for use Tuple field names (f0, F1, ...) or offset, f0, for example, representatives of Tuple 5 and the first field and the sixth field
  • Java objects and nested Tuple field as a key, for example: f1.user.zip indicates zip Tuple field in the second field of the user object as the key
  • Wildcard * on behalf of all types are selected as key

Field expression examples

public static class WC {
  public ComplexNestedClass complex; //nested POJO
  private int count;
  // getter / setter for private field (count)
  public int getCount() {
    return count;
  }
  public void setCount(int c) {
    this.count = c;
  }
}
public static class ComplexNestedClass {
  public Integer someNumber;
  public float someFloat;
  public Tuple3<Long, Long, String> word;
  public IntWritable hadoopCitizen;
}
  • "Count": count like WC Fields
  • "Complex": complex of all fields (recursively)
  • "Complex.word.f2": ComplexNestedClass category word of the third field triplets
  • "Complex.hadoopCitizen": complex class hadoopCitizen field

Key Selector using the specified key
to develop key, an input selection key for each element, as specified key output by key selector function, the following examples

words.keyBy(new KeySelector<Tuple2<String, Integer>, Object>() {

            @Override
            public Object getKey(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                return stringIntegerTuple2.f0;
            }
});

We can see the effect of keyBy (0) to achieve the same.

Reprinted from: Flink basic API usage

Reproduced in: https: //www.jianshu.com/p/6c0f20660c63

Guess you like

Origin blog.csdn.net/weixin_33966095/article/details/91246156