Flink basic API

Flink a DataSet with DataStream represent data sets. DateSet for batch processing, data representing limited; DataStream for the data stream, data representative of unbounded. Data in the dataset is not changed, that the elements which can not be added or deleted. We created a data source DataSet or DataStream, the data set operation generates a new dataset map, filter, etc. conversion (Transform) operation.

Flink write about the program in general through several steps:

  • Obtaining execution environment
  • Create the input data
  • Conversion operation performed on the data sets (hereinafter collectively referred to: transform)
  • The output data
  • Trigger execution

Here we will introduce the basic API Flink written procedures involved.

Input and output

First, you need to obtain execution environment, Flink provides the following three ways:

getExecutionEnvironment()
createLocalEnvironment()
createRemoteEnvironment(String host, int port, String... jarFiles)

The first case study to create the following code execution environment

Batch:

ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = env.readTextFile("file:///D:\\words.txt");
text.print();

Stream processing :

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("file:///D:\\words.txt");
text.print();
env.execute();

words.txt contents of the file:

a
b
c
d
e
a
b

The above code creates an execution environment, while taking advantage of env created input source. You may call the print output method on the data set data to the console, of course, other methods may be invoked writeAsText outputs data to other media. The above processing of the last line of code that calls the stream execute a method, the stream processing method need to explicitly call trigger the execution of the program.

There are two ways to run the above code, one is executed directly in the IDE, run like a normal Java program, Flink will start a local program execution environment. Another way is to be packaged, submitted to Flink cluster operation. The above example of the basic skeleton of a Flink contains the basic program, but not the data set more transform operations Below we briefly basic transform operation.

map operation

MapReduce map operation here is similar to the map, the data parsing process. Examples are as follows

Batch :

DataSet<Tuple2<String, Integer>> words = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
});
words.print();

Stream processing :

DataStream<Tuple2<String, Integer>> words = text.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String s) throws Exception {
                return new Tuple2<>(s, 1);
            }
});
words.print()

Here in addition to batch processing and the types of data streams of different sets, the rest are the same wording. Each map is to become a word (word 1) tuple. Similar transform and map as well as filter, filter unwanted record, readers can try on their own.

Specified key

Big data processing often need to be processed in accordance with a dimension, which is the need to specify the key. GroupBy using the specified key in the DataSet, in use keyBy DataStream specified key. Here we introduce an example to keyBy.

Flink data model is not based on the key-value, key is virtual, it can be seen as a function defined on the data.

In Tuple defined key

KeyedStream <Tuple2 <String, Integer>, Tuple> keyed = words.keyBy (0); // 0 Representative Tuple2 (tuple) the first element
KeyedStream <Tuple2 <String, Integer>, Tuple> keyed = words.keyBy (0,1); // 0,1 representative of the first tuple element and a second key \

For nested tuple

DataStream<Tuple3<Tuple2<Integer, Float>,String,Long>> ds;

ds.keyBy (0) will be the Tuple2 <Integer, Float> entirety key.

With specified key field expression

public class WC {
  public String word;
  public int count;
}
DataStream<WC> words = // [...]
DataStream<WC> wordCounts = words.keyBy("word");

Herein designated as WC object field word key. Field expression syntax is as follows:

  • Java object field name used as key, as an example
  • (Starting from 0) the specified key type for use Tuple field names (f0, F1, ...) or offset, f0, for example, representatives of Tuple 5 and the first field and the sixth field
  • Java objects and nested Tuple field as a key, for example: f1.user.zip indicates zip Tuple field in the second field of the user object as the key
  • Wildcard * on behalf of all types are selected as key

Field expression examples

public static class WC {
  public ComplexNestedClass complex; //nested POJO
  private int count;
  // getter / setter for private field (count)
  public int getCount() {
    return count;
  }
  public void setCount(int c) {
    this.count = c;
  }
}
public static class ComplexNestedClass {
  public Integer someNumber;
  public float someFloat;
  public Tuple3<Long, Long, String> word;
  public IntWritable hadoopCitizen;
}
  • "Count": count like WC Fields
  • "Complex": complex of all fields (recursively)
  • "Complex.word.f2": ComplexNestedClass category word of the third field triplets
  • "Complex.hadoopCitizen": complex class hadoopCitizen field

Use Key Selector specified key

To develop key, key by key selector input selector function for each element, the output of the specified key, the following examples

words.keyBy(new KeySelector<Tuple2<String, Integer>, Object>() {

            @Override
            public Object getKey(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                return stringIntegerTuple2.f0;
            }
});

We can see the effect of keyBy (0) to achieve the same.

Flink is above the specified key approach.

to sum up 

这篇文章主要介绍了 Flink 程序的基本骨架。获得环境、创建输入源、对数据集做 transform 以及输出。由于数据处理经常会按照不同维度(不同的 key)进行统计,因此,本篇内容重点介绍了 Flink 中如何指定 key。后续将会继续介绍 Flink API 的使用。

Guess you like

Origin www.cnblogs.com/duma/p/10964985.html