2. Flink generic erasure "In-depth understanding of flink series"

Article Directory

- - 1 Generic erasure
  - 2 The return method on the operator

Java 8's Lambda expressions allow implementing and passing functions in a straightforward manner without declaring additional (anonymous) classes. Flink supports the use of lambda expressions for all operators of the Java API, but when the lambda expressions use Java generics, developers need to explicitly declare type information. If the developer does not explicitly declare the type information, then the use of lambda expressions will lead to program errors, which is caused by the generic erasure problem of the Java compiler.

1 Generic erasure

The Java compiler will discard most of the generic type information after compilation, which is called Java's generic erasure. This means that when a Flink application is running, an instance of an object will not know its generic type, e.g. instances of DataStream<string> and DataStream<long> will look the same in the JVM.

The types of the input and output parameters of the example map function below do not need to be declared because they are inferred by the Java compiler.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> dataStream = env.generateSequence(1, 5);
// 在map操作上使用lambda表达式
DataStream<Long> resultStream=dataStream.map(i -> i*i);
resultStream.print();
env.execute();

Flink can automatically extract the result type information from the implementation of the method signature OUT map (IN value), but for the map function Tuple2<Long, Long> map (Tuple2<Long, Long> value) with a generic return or input type, it will be compiled into a Tuple2 map (Tuple2 value) by the Java compiler, which makes it impossible for Flink to automatically infer the type information of the input and output types.

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> dataStream = env.generateSequence(1, 5);
//将Long类型的数据流转换为一个Tuple2<Long, Long>类型的新数据流
DataStream<Tuple2<Long, Long>> mapStream=dataStream.map(new MapFunction<Long, Tuple2<Long, Long>>() {
    @Override
    public Tuple2<Long, Long> map(Long value) throws Exception {
        return new Tuple2<>(value, value);
    }
})；
// 在map操作上使用lambda表达式
DataStream<Tuple2<Long, Long>> resultStream=mapStream.map(value -> value);
resultStream.print();
env.execute();

The elements in the data stream in the second map operation are Tuple2<Long, Long> with a generic type, and the map operation uses a Lambda expression so that Flink cannot deduce the generic type, and the application will throw an exception similar to the following:

Caused by: org.apache.flink.api.common.functions.InvalidTypesException: The generic type parameters of 'Tuple2' are missing. 
In many cases lambda methods don't provide enough information for automatic type extraction when Java generics are involved. 
An easy workaround is to use an (anonymous) class instead that 
implements the 'org.apache.flink.api.common.functions.MapFunction' interface.
Otherwise the type has to be specified explicitly using type information.
    at org.apache.flink.api.java.typeutils.TypeExtractionUtils.validateLambdaType(TypeExtractionUtils.java:350)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:579)
    at org.apache.flink.api.java.typeutils.TypeExtractor.getMapReturnTypes(TypeExtractor.java:175)
    at org.apache.flink.streaming.api.datastream.DataStream.map(DataStream.java:587)
    ... 1 more

2 The return method on the operator

For the problem that Flink cannot deduce the generic type due to the use of Lambda expressions on operators in Flink applications, developers need to call the returns(…) method after the operator that uses the lambda expression to transfer the function to add a type information hint about the return type of the operator. Otherwise, the output will be regarded as Object type, resulting in invalid serialization.

For the above program, we call the returns(…) method after calling the map operator on the mapStream data stream to add a type information hint for the return type of this operator.

import org.apache.flink.api.common.typeinfo.Types;

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> dataStream = env.generateSequence(1, 5);
//将Long类型的数据流转换为一个Tuple2<Long, Long>类型的数据流
DataStream<Tuple2<Long, Long>> mapStream=dataStream.map(new MapFunction<Long, Tuple2<Long, Long>>() {
    @Override
    public Tuple2<Long, Long> map(Long value) throws Exception {
        return new Tuple2<>(value, value);
    }
})；
// 在map操作上使用lambda表达式
DataStream<Tuple2<Long, Long>> resultStream=mapStream.map(value -> value)
                                                //提供明确的类型信息
                                                .returns(Types.TUPLE(Types.LONG, Types.LONG))
resultStream.print();
env.execute();

The returns(…) method provides three overloaded methods:

public SingleOutputStreamOperator<T> returns(Class<T> typeClass): Classes can be used as type hints for non-generic types (classes without generic parameters), but not for generic types such as Tuples. For those generic types, use the returns(TypeHint<T> typeHint) method.

public SingleOutputStreamOperator<T> returns(TypeHint<T> typeHint)：

通过以下方式使用此方法：
import org.apache.flink.api.common.typeinfo.TypeHint;
DataStream<Tuple2<String, Double>> result =
          stream.flatMap(new FunctionWithNonInferrableReturnType())
                .returns(new TypeHint<Tuple2<String, Double>>(){});

public SingleOutputStreamOperator<T> returns(TypeInformation<T> typeInfo): In most cases, the preferred method returns(Class) and returns(TypeHint).

通过以下方式使用此方法：
import org.apache.flink.api.common.typeinfo.Types;
DataStream<Tuple2<String, Double>> result =
          stream.flatMap(new FunctionWithNonInferrableReturnType())
                .returns(Types.TUPLE(Types.String, Types.Double));