Flink 1.17 Tutorial: Transformation

After the data source reads in the data, we can use various conversion operators to DataStreamconvert one or more into a new DataStream.
insert image description here

Basic conversion operators (map/ filter/ flatMap)

WaterSensor.java

package com.atguigu.bean;

import java.util.Objects;

/**
 * TODO
 *
 * @author cjp
 * @version 1.0
 */
public class WaterSensor {
    
    
    public String id;
    public Long ts;
    public Integer vc;

    // 一定要提供一个 空参 的构造器
    public WaterSensor() {
    
    
    }

    public WaterSensor(String id, Long ts, Integer vc) {
    
    
        this.id = id;
        this.ts = ts;
        this.vc = vc;
    }

    public String getId() {
    
    
        return id;
    }

    public void setId(String id) {
    
    
        this.id = id;
    }

    public Long getTs() {
    
    
        return ts;
    }

    public void setTs(Long ts) {
    
    
        this.ts = ts;
    }

    public Integer getVc() {
    
    
        return vc;
    }

    public void setVc(Integer vc) {
    
    
        this.vc = vc;
    }

    @Override
    public String toString() {
    
    
        return "WaterSensor{" +
                "id='" + id + '\'' +
                ", ts=" + ts +
                ", vc=" + vc +
                '}';
    }


    @Override
    public boolean equals(Object o) {
    
    
        if (this == o) {
    
    
            return true;
        }
        if (o == null || getClass() != o.getClass()) {
    
    
            return false;
        }
        WaterSensor that = (WaterSensor) o;
        return Objects.equals(id, that.id) &&
                Objects.equals(ts, that.ts) &&
                Objects.equals(vc, that.vc);
    }

    @Override
    public int hashCode() {
    
    

        return Objects.hash(id, ts, vc);
    }
}

map

Map is a big data operation operator that everyone is very familiar with. It is mainly used to convert the data in the data stream to form a new data stream. In simple terms, it is a "one-to-one mapping", where one element is consumed and one element is produced.insert image description here

We only need to call the map() method based on the DataStream to perform conversion processing. The parameter that the method needs to pass in is the implementation of the interface MapFunction; the return value type is still DataStream, but the generic type (element type in the stream) may change.
The following code implements the function of extracting the id field in WaterSensor in different ways.

public class TransMap {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
                new WaterSensor("sensor_1", 1L, 1),
                new WaterSensor("sensor_2", 2L, 2)
        );

        // 方式一:传入匿名类,实现MapFunction
        stream.map(new MapFunction<WaterSensor, String>() {
    
    
            @Override
            public String map(WaterSensor e) throws Exception {
    
    
                return e.id;
            }
        }).print();

        // 方式二:传入MapFunction的实现类
        // stream.map(new UserMap()).print();

        env.execute();
    }

    public static class UserMap implements MapFunction<WaterSensor, String> {
    
    
        @Override
        public String map(WaterSensor e) throws Exception {
    
    
            return e.id;
        }
    }
}

Execution result:
Method 1: Pass in an anonymous class to implement MapFunction
insert image description here
Method 2: Pass in the implementation class of MapFunction
insert image description here

In the above code, the generic type of the MapFunction implementation class is related to the input data type and output data type. When implementing the MapFunction interface, you need to specify two generic types, which are the types of input events and output events. You also need to rewrite a map() method to define the specific logic for converting one input event to another output event.

filter

The filter conversion operation, as the name implies, performs a filter on the data stream, sets the filter condition through a Boolean conditional expression, and judges each element in the stream. If it is true, the element will be output normally, and if it is false, the element will be filtered out.insert image description here

The data type of the new data stream after filter conversion is the same as that of the original data stream. The parameters that need to be passed in for filter conversion need to implement the FilterFunction interface, and the filter() method must be implemented in the FilterFunction, which is equivalent to a conditional expression that returns a Boolean type.
Case requirements: The following code will filter out the data whose sensor id is sensor_1 in the data stream.

package com.atguigu.zxl_test;

import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class TransFilter {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
                
        new WaterSensor("sensor_1", 1L, 1),
        new WaterSensor("sensor_1", 2L, 2),
        new WaterSensor("sensor_2", 2L, 2),
        new WaterSensor("sensor_3", 3L, 3)
        );

        // 方式一:传入匿名类实现FilterFunction
        stream.filter(new FilterFunction<WaterSensor>() {
    
    
            @Override
            public boolean filter(WaterSensor e) throws Exception {
    
    
                return e.id.equals("sensor_1");
            }
        }).print();

        // 方式二:传入FilterFunction实现类
        // stream.filter(new UserFilter()).print();
        
        env.execute();
    }
    public static class UserFilter implements FilterFunction<WaterSensor> {
    
    
        @Override
        public boolean filter(WaterSensor e) throws Exception {
    
    
            return e.id.equals("sensor_1");
        }
    }
}

Execution result:
Method 1: Pass in an anonymous class to implement FilterFunction
insert image description here
Method 2: Pass in a FilterFunction implementation class
insert image description here

Flat Map (flatMap)

The flatMap operation, also known as flat mapping, mainly splits the whole (generally a collection type) in the data stream into individual ones for use. Consuming an element can produce 0 or more elements. flatMap can be considered as a combination of the two-step operations of "flatten" and "map", that is, the data is first broken up and split according to certain rules, and then the split elements are converted.insert image description here

Like map, flatMap can also use Lambda expression or FlatMapFunction interface implementation class to pass parameters. The return value type depends on the specific logic of the passed parameters, which can be the same as the original data stream or different.

Case requirements: If the input data is sensor_1, only vc will be printed; if the input data is sensor_2, both ts and vc will be printed.
The implementation code is as follows:

package com.atguigu.zxl_test;

import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;


public class TransFlatmap {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
                
			new WaterSensor("sensor_1", 1L, 1),
			new WaterSensor("sensor_1", 2L, 2),
			new WaterSensor("sensor_2", 2L, 2),
			new WaterSensor("sensor_3", 3L, 3)

        );

        stream.flatMap(new MyFlatMap()).print();

        env.execute();
    }

    public static class MyFlatMap implements FlatMapFunction<WaterSensor, String> {
    
    

        @Override
        public void flatMap(WaterSensor value, Collector<String> out) throws Exception {
    
    

            if (value.id.equals("sensor_1")) {
    
    
                out.collect(String.valueOf(value.vc));
            } else if (value.id.equals("sensor_2")) {
    
    
                out.collect(String.valueOf(value.ts));
                out.collect(String.valueOf(value.vc));
            }
        }

    }
} 

Results of the:
insert image description here

Aggregation operator (Aggregation)

The calculation result not only depends on the current data, but also related to the previous data, which is equivalent to gathering all the data together for aggregation and merging—this is the so-called "aggregation" ( Aggregation), similar to the reduce operation in MapReduce.

Key partition (keyBy)

For Flink, DataStream does not have an API for direct aggregation. Because we must perform partitioning and parallel processing when we aggregate massive data, so as to improve efficiency. So in Flink, to do aggregation, you need to partition first; this operation is done through keyBy.
keyBy is an operator that must be used before aggregation. keyBy can logically divide a stream into different partitions by specifying a key. The partitions mentioned here are actually subtasks processed in parallel.
Based on different keys, the data in the stream will be allocated to different partitions; in this way, all data with the same key will be sent to the same partition.insert image description here

Internally, it is implemented by calculating the hash code of the key and performing a modulo operation on the number of partitions. So if the key here is POJO, the hashCode() method must be rewritten.
The keyBy() method needs to pass in a parameter, which specifies a key or a group of keys. There are many different ways to specify the key: for example, for the Tuple data type, you can specify the position of the field or a combination of multiple positions; for the POJO type, you can specify the name of the field (String); in addition, you can also pass in a Lambda expression or Implement a key selector (KeySelector) to illustrate the logic of extracting keys from data.
We can use id as the key to do a partition operation. The code is implemented as follows:

package com.atguigu.zxl_test;

import com.atguigu.bean.WaterSensor;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

public class TransKeyBy {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
            new WaterSensor("sensor_1", 1L, 1),
            new WaterSensor("sensor_1", 2L, 2),
            new WaterSensor("sensor_2", 2L, 2),
            new WaterSensor("sensor_3", 3L, 3)
        );

        // 方式一:使用Lambda表达式
        KeyedStream<WaterSensor, String> keyedStream = stream.keyBy(e -> e.id);

        // 添加操作符,例如打印结果  解决报错:No operators defined in streaming topology. Cannot execute.
        keyedStream.print();

        // 方式二:使用匿名类实现KeySelector
        /*KeyedStream<WaterSensor, String> keyedStream1 = stream.keyBy(new KeySelector<WaterSensor, String>() {
            @Override
            public String getKey(WaterSensor e) throws Exception {
                return e.id;
            }
        });

        // 添加操作符,例如打印结果 解决报错:No operators defined in streaming topology. Cannot execute.
        keyedStream1.print();*/

        env.execute();
    }
}

Results of the:
insert image description here

It should be noted that the result obtained by keyBy will no longer be DataStream, but will convert DataStream to KeyedStream. KeyedStream can be considered as "partitioned stream" or "keyed stream". It is a logical partition of DataStream according to key, so there are two types of generics: In addition to the element type in the current stream, the type of key needs to be specified.
KeyedStream also inherits from DataStream, so operations based on it also belong to the DataStream API. But it is different from the SingleOutputStreamOperator obtained by the previous conversion operation. It is only a stream partition operation, not a conversion operator. KeyedStream is a very important data structure, and only based on it can subsequent aggregation operations (such as sum and reduce) be performed.

Simple aggregation (sum/min/max/minBy/maxBy)

With the data stream KeyedStream partitioned by key, we can perform aggregation operations based on it. Flink implements some of the most basic and simple aggregation APIs for us, mainly as follows:

  • sum(): On the input stream, perform superposition and summation operations on the specified fields.
  • min(): On the input stream, find the minimum value for the specified field.
  • max(): On the input stream, find the maximum value for the specified field.
  • minBy(): Similar to min(), find the minimum value for the specified field on the input stream. The difference is that min() only calculates the minimum value of the specified field, and other fields will retain the value of the first data; while minBy() will return the entire piece of data including the minimum value of the field.
  • maxBy(): Similar to max(), it calculates the maximum value for the specified field on the input stream. The difference between the two is exactly the same as min()/minBy().

Simple aggregation operators are very convenient to use and have very clear semantics. When these aggregation methods are called, parameters also need to be passed in; however, unlike basic conversion operators, it is not necessary to implement custom functions, as long as the fields specified by the aggregation are specified. There are two ways to specify a field: specifying a location and specifying a name.
For tuple-type data, you can use these two methods to specify fields. It should be noted that the names of the fields in the tuple are named after f0, f1, f2, ....
If the type of data stream is POJO class, then it can only be specified by field name, not by location.

public class TransAggregation {
    
    

    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
			new WaterSensor("sensor_1", 1, 1),
			new WaterSensor("sensor_1", 2, 2),
			new WaterSensor("sensor_2", 2, 2),
			new WaterSensor("sensor_3", 3, 3)
        );

        stream.keyBy(e -> e.id).max("vc");    // 指定字段名称

        env.execute();
    }
}

The return of the simple aggregation operator is also a SingleOutputStreamOperator, that is, it is converted from a KeyedStream to a regular DataStream. So it can be understood like this: keyBy and aggregation appear in pairs, partition first, then aggregate, and the result is still a DataStream. Moreover, in the data stream after simple aggregation, the data type of the element remains unchanged.
An aggregation operator will save an aggregated value for each key, which we call "state" in Flink. So whenever there is a new data input, the operator will update the saved aggregation result and send an event with the updated aggregation value to the downstream operator. For unbounded streams, these states will never be cleared, so we use aggregation operators, which should only be used on data streams with a limited number of keys.

Reduction aggregation (reduce)

reduce can perform reduction processing on existing data, and perform an aggregation calculation on each new input data and the currently reduced value.
The reduce operation also converts KeyedStream to DataStream. It does not change the element data type of the stream, so the output type is the same as the input type.
When calling the reduce method of KeyedStream, you need to pass in a parameter to implement the ReduceFunction interface. The definition of the interface in the source code is as follows:

public interface ReduceFunction<T> extends Function, Serializable {
    
    
    T reduce(T value1, T value2) throws Exception;
}

The reduce() method needs to be implemented in the ReduceFunction interface. This method receives two input events and outputs an event of the same type after conversion. In the underlying implementation process of stream processing, the intermediate "merging result" is actually saved as a state of the task; after that, each time a new data comes, it is further reduced with the previous aggregation state.
We can define a function class separately to implement the ReduceFunction interface, or we can directly pass in an anonymous class. Of course, similar functions can also be achieved by passing in Lambda expressions.
For subsequent use, define a WaterSensorMapFunction:

public class WaterSensorMapFunction implements MapFunction<String,WaterSensor> {
    
    
    @Override
    public WaterSensor map(String value) throws Exception {
    
    
        String[] datas = value.split(",");
        return new WaterSensor(datas[0],Long.valueOf(datas[1]) ,Integer.valueOf(datas[2]) );
    }
}

Case: Use reduce to implement the functions of max and maxBy.

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

env
   .socketTextStream("hadoop102", 7777)
   .map(new WaterSensorMapFunction())
   .keyBy(WaterSensor::getId)
   .reduce(new ReduceFunction<WaterSensor>()
   {
    
    
       @Override
       public WaterSensor reduce(WaterSensor value1, WaterSensor value2) throws Exception {
    
    
           System.out.println("Demo7_Reduce.reduce");

           int maxVc = Math.max(value1.getVc(), value2.getVc());
           //实现max(vc)的效果  取最大值,其他字段以当前组的第一个为主
           //value1.setVc(maxVc);
           //实现maxBy(vc)的效果  取当前最大值的所有字段
           if (value1.getVc() > value2.getVc()){
    
    
               value1.setVc(maxVc);
               return value1;
           }else {
    
    
               value2.setVc(maxVc);
               return value2;
           }
       }
   })
   .print();
env.execute();

Reduce, like simple aggregation operators, also saves state for each key. Because the state will not be cleared, we need to apply the reduce operator to a stream of finite keys.

User Defined Function (UDF)

User-defined function (UDF), that is, users can re-implement the logic of operators according to their own needs.
User-defined functions are divided into: function class, anonymous function, rich function class.

Function Classes¶

Flink exposes the interfaces of all UDF functions, which are implemented in the form of interfaces or abstract classes, such as MapFunction, FilterFunction, and ReduceFunction. So users can customize a function class to implement the corresponding interface.
Requirement: Used to filter the content containing "sensor_1" from the user's click data:
Method 1: Implement the FilterFunction interface

public class TransFunctionUDF {

    public static void main(String[] args) throws Exception {

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
			new WaterSensor("sensor_1", 1, 1),
			new WaterSensor("sensor_1", 2, 2),
			new WaterSensor("sensor_2", 2, 2),
			new WaterSensor("sensor_3", 3, 3)
        );
       
        DataStream<String> filter = stream.filter(new UserFilter());
      
        filter.print();
        env.execute();
    }

    public static class UserFilter implements FilterFunction<WaterSensor> {
        @Override
        public boolean filter(WaterSensor e) throws Exception {
            return e.id.equals("sensor_1");
        }
    }
}

Method 2: Implement the FilterFunction interface through an anonymous class:

DataStream<String> stream = stream.filter(new FilterFunction< WaterSensor>() {
    
    
    @Override
    public boolean filter(WaterSensor e) throws Exception {
    
    
        return e.id.equals("sensor_1");
    }
});

Optimization of method 2: In order to make the class more general, we can also abstract the keyword "home" used for filtering as an attribute of the class, and pass it in when calling the constructor.

DataStreamSource<WaterSensor> stream = env.fromElements(        
	new WaterSensor("sensor_1", 1, 1),
	new WaterSensor("sensor_1", 2, 2),
	new WaterSensor("sensor_2", 2, 2),
	new WaterSensor("sensor_3", 3, 3)
);

DataStream<String> stream = stream.filter(new FilterFunctionImpl("sensor_1"));

public static class FilterFunctionImpl implements FilterFunction<WaterSensor> {
    
    
    private String id;

    FilterFunctionImpl(String id) {
    
     this.id=id; }

    @Override
    public boolean filter(WaterSensor value) throws Exception {
    
    
        return thid.id.equals(value.id);
    }
}

Method 3: Using an anonymous function (Lambda)

public class TransFunctionUDF {
    
    

    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        DataStreamSource<WaterSensor> stream = env.fromElements(
                
			new WaterSensor("sensor_1", 1, 1),
			new WaterSensor("sensor_1", 2, 2),
			new WaterSensor("sensor_2", 2, 2),
			new WaterSensor("sensor_3", 3, 3)
        );    

        //map函数使用Lambda表达式,不需要进行类型声明
        SingleOutputStreamOperator<String> filter = stream.filter(sensor -> "sensor_1".equals(sensor.id));

        filter.print();

        env.execute();
    }
}

Rich Function Classes

"Rich function class" is also an interface of a function class provided by DataStream API, and all Flink function classes have their Rich version. Rich function classes generally appear in the form of abstract classes. For example: RichMapFunction, RichFilterFunction, RichReduceFunction, etc.
The main difference from regular function classes is that rich function classes can obtain the context of the running environment and have some lifecycle methods, so they can implement more complex functions.
Rich Function has the concept of life cycle. Typical lifecycle methods are:

  • The open() method is the initialization method of Rich Function, that is, it will start the life cycle of an operator. When an operator's actual working methods such as map() or filter() are called, open() will be called first.
  • The close() method is the last method called in the life cycle, similar to the end method. Generally used to do some cleaning work.

It should be noted that the life cycle method here will only be called once for a parallel subtask; and the corresponding actual working method, such as map() in RichMapFunction, will trigger a call after each piece of data arrives.
Let's look at an example to illustrate:

public class RichFunctionExample {
    
    

    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);

        env
                .fromElements(1,2,3,4)
                .map(new RichMapFunction<Integer, Integer>() {
    
    
                    @Override
                    public void open(Configuration parameters) throws Exception {
    
    
                        super.open(parameters);
                        System.out.println("索引是:" + getRuntimeContext().getIndexOfThisSubtask() + " 的任务的生命周期开始");
                    }

                    @Override
                    public Integer map(Integer integer) throws Exception {
    
    
                        return integer + 1;
                    }

                    @Override
                    public void close() throws Exception {
    
    
                        super.close();
                        System.out.println("索引是:" + getRuntimeContext().getIndexOfThisSubtask() + " 的任务的生命周期结束");
                    }
                })
                .print();

        env.execute();
    }
}

Physical Partitioning Operator (Physical Partitioning)

Common physical partition strategies include: random allocation (Random), round-robin allocation (Round-Robin), rescaling (Rescale) and broadcast (Broadcast).

Random partition (shuffle)

The easiest way to repartition is to "shuffle" directly. By calling the .shuffle() method of DataStream, the data is randomly assigned to the parallel tasks of downstream operators.
The random partition obeys uniform distribution, so the data in the stream can be randomly shuffled and evenly passed to the downstream task partition. Because it is completely random, for the same input data, the results obtained each time will not be the same.
insert image description here

After random partitioning, the result is still a DataStream.
We can do a simple test: print the data directly to the console after reading it in, set the parallelism of the output to 2, and experience a shuffle in the middle. Execute it several times to see if the result is the same.

public class ShuffleExample {
    
    
    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		 env.setParallelism(2);

        DataStreamSource<Integer> stream = env.socketTextStream("hadoop102", 7777);;

        stream.shuffle().print()

        env.execute();
    }
}

Round Robin

Polling, in simple terms, is "dealing cards", distributing data in sequence. By calling DataStream's .rebalance() method, round-robin repartitioning can be implemented. Rebalance uses the Round-Robin load balancing algorithm, which can evenly distribute input stream data to downstream parallel tasks.
insert image description here

stream.rebalance()

Rescale partition (rescale)

Rescaling partitions is very similar to polling partitions. When the rescale() method is called, in fact, the bottom layer also uses the Round-Robin algorithm for polling, but only the data polling is sent to a part of the downstream parallel tasks. The practice of rescale is to divide into small groups, and the dealer only deals cards to everyone in his group in turn.
insert image description here

stream.rescale()

broadcast

In fact, this method should not be called "repartitioning", because after broadcasting, the data will be kept in different partitions, and repeated processing may be performed. The input data can be copied and sent to all parallel tasks of downstream operators by calling the broadcast() method of DataStream.

stream.broadcast()

Global partition (global)

Global partition is also a special partition method. This approach is very extreme. By calling the .global() method, all input stream data will be sent to the first parallel subtask of the downstream operator. This is equivalent to forcing the parallelism of downstream tasks to 1, so you need to be very cautious when using this operation, which may put a lot of pressure on the program.

stream.global()

Custom Partition (Custom)

When none of the partitioning strategies provided by Flink can meet the user's needs, we can customize the partitioning strategy by using the partitionCustom() method.
1) Custom Partitioner

public class MyPartitioner implements Partitioner<String> {
    
    

    @Override
    public int partition(String key, int numPartitions) {
    
    
        return Integer.parseInt(key) % numPartitions;
    }
}

2) Use a custom partition

public class PartitionCustomDemo {
    
    
    public static void main(String[] args) throws Exception {
    
    
//        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        StreamExecutionEnvironment env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration());

        env.setParallelism(2);

        DataStreamSource<String> socketDS = env.socketTextStream("hadoop102", 7777);

        DataStream<String> myDS = socketDS
                .partitionCustom(
                        new MyPartitioner(),
                        value -> value);
                

        myDS.print();

        env.execute();
    }
}

shunt

The so-called "splitting" refers to splitting a data stream into two or even multiple streams that are completely independent. That is, based on a DataStream, define some filter conditions, and select the qualified data into the corresponding stream.
insert image description here

Simple implementation

In fact, the need to filter data according to conditions is very easy to implement: as long as the .filter() method is called independently for the same stream multiple times to filter, the split stream can be obtained.

Case requirements: read an integer digital stream and divide the data stream into odd and even streams.

Code:

public class SplitStreamByFilter {
    
    

    public static void main(String[] args) throws Exception {
    
    

        
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
      
        SingleOutputStreamOperator<Integer> ds = env.socketTextStream("hadoop102", 7777)
                                                           .map(Integer::valueOf);
        //将ds 分为两个流 ,一个是奇数流,一个是偶数流
        //使用filter 过滤两次
        SingleOutputStreamOperator<Integer> ds1 = ds.filter(x -> x % 2 == 0);
        SingleOutputStreamOperator<Integer> ds2 = ds.filter(x -> x % 2 == 1);

        ds1.print("偶数");
        ds2.print("奇数");
        
        env.execute();
    }
}

This implementation is very simple, but the code is a bit redundant—our processing logic is actually the same for the three split streams, but it has been written three times. And the meaning behind this code is to copy the original data stream three times, and then filter each one separately; this is obviously not efficient enough. We naturally thought, can we split them all directly with one operator without copying the stream?

use side output stream

Regarding the use of the side output stream in the processing function, we have introduced it in detail in Section 7.5. Simply put, you only need to call the .output() method of the context ctx to output any type of data. The marking and extraction of the side output stream are inseparable from an "Output Tag" (OutputTag), which specifies the id and type of the side output stream.
Code implementation: divide the WaterSensor according to the Id type.

public class SplitStreamByOutputTag {
    
        
public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        SingleOutputStreamOperator<WaterSensor> ds = env.socketTextStream("hadoop102", 7777)
              .map(new WaterSensorMapFunction());


        OutputTag<WaterSensor> s1 = new OutputTag<>("s1", Types.POJO(WaterSensor.class)){
    
    };
        OutputTag<WaterSensor> s2 = new OutputTag<>("s2", Types.POJO(WaterSensor.class)){
    
    };
       //返回的都是主流
        SingleOutputStreamOperator<WaterSensor> ds1 = ds.process(new ProcessFunction<WaterSensor, WaterSensor>()
        {
    
    
            @Override
            public void processElement(WaterSensor value, Context ctx, Collector<WaterSensor> out) throws Exception {
    
    

                if ("s1".equals(value.getId())) {
    
    
                    ctx.output(s1, value);
                } else if ("s2".equals(value.getId())) {
    
    
                    ctx.output(s2, value);
                } else {
    
    
                    //主流
                    out.collect(value);
                }

            }
        });

        ds1.print("主流,非s1,s2的传感器");
        SideOutputDataStream<WaterSensor> s1DS = ds1.getSideOutput(s1);
        SideOutputDataStream<WaterSensor> s2DS = ds1.getSideOutput(s2);

        s1DS.printToErr("s1");
        s2DS.printToErr("s2");
        
        env.execute();
 
}
}

Basic merge operation

In practical applications, we often encounter multiple streams from different sources and need to jointly process their data. Therefore, confluence operations in Flink will be more common, and the corresponding APIs will be richer.

Union

The simplest merging operation is to directly combine multiple streams, which is called the "union" of streams. The joint operation requires that the data types in the streams must be the same, and the new stream after merging will include elements in all streams, and the data types will not change.
insert image description here

In the code, as long as we directly call the .union() method based on the DataStream and pass in other DataStreams as parameters, the union of the streams can be realized; the result is still a DataStream: Note: the parameter of stream1.union(stream2, stream3, ...)
union() can be multiple DataStreams , so the union operation can realize the merging of multiple streams.
Code implementation: We can use the following code to do a simple test:

public class UnionExample {
    
    

    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        env.setParallelism(1);

        DataStreamSource<Integer> ds1 = env.fromElements(1, 2, 3);
        DataStreamSource<Integer> ds2 = env.fromElements(2, 2, 3);
        DataStreamSource<String> ds3 = env.fromElements("2", "2", "3");

        ds1.union(ds2,ds3.map(Integer::valueOf))
           .print();

        env.execute();
    }
}

Connect

Although the union of streams is simple, it is limited by the fact that the data type cannot be changed, and the flexibility is greatly reduced, so practical applications rarely appear. In addition to union, Flink also provides another convenient confluence operation - connect.

1) Connected Streams (ConnectedStreams)
insert image description here

Code implementation: It needs to be divided into two steps: first, call the .connect() method based on one DataStream, pass in another DataStream as a parameter, connect the two streams, and get a ConnectedStreams; then call the same processing method to get a DataStream. The same processing methods that can be called here are .map()/.flatMap(), and .process() method.

public class ConnectDemo {
    
    

    public static void main(String[] args) throws Exception {
    
    

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);

//        DataStreamSource<Integer> source1 = env.fromElements(1, 2, 3);
//        DataStreamSource<String> source2 = env.fromElements("a", "b", "c");

        SingleOutputStreamOperator<Integer> source1 = env
                .socketTextStream("hadoop102", 7777)
                .map(i -> Integer.parseInt(i));

        DataStreamSource<String> source2 = env.socketTextStream("hadoop102", 8888);

        /**
         * TODO 使用 connect 合流
         * 1、一次只能连接 2条流
         * 2、流的数据类型可以不一样
         * 3、 连接后可以调用 map、flatmap、process来处理,但是各处理各的
         */
        ConnectedStreams<Integer, String> connect = source1.connect(source2);

        SingleOutputStreamOperator<String> result = connect.map(new CoMapFunction<Integer, String, String>() {
    
    
            @Override
            public String map1(Integer value) throws Exception {
    
    
                return "来源于数字流:" + value.toString();
            }

            @Override
            public String map2(String value) throws Exception {
    
    
                return "来源于字母流:" + value;
            }
        });

        result.print();

        env.execute();    }
}

In the above code, ConnectedStreams has two type parameters, which respectively represent the data types of the two streams contained in it; due to the need for "one country, two systems", when calling the .map() method, it is no longer a simple MapFunction. It is a CoMapFunction, which means to perform a map operation on the data in the two streams respectively. This interface has three type parameters, which in turn represent the data types in the first stream, the second stream, and the merged stream. The method to be implemented is also very straightforward: .map1() is the map operation on the data in the first stream, and .map2() is for the second stream.

2) CoProcessFunction
is similar to CoMapFunction. If you call .map(), you need to pass in a CoMapFunction, and you need to implement two methods: map1() and map2(); when calling .process(), you need to pass in a CoProcessFunction. It is also a member of the "handler function" family and is used in a very similar way. What it needs to implement are the two methods processElement1() and processElement2(). When each data arrives, one of the methods will be called for processing according to the source flow.
It is worth mentioning that ConnectedStreams can also directly call .keyBy() to perform key partition operations, and the result is still a ConnectedStreams:
connectedStreams.keyBy(keySelector1, keySelector2);
here, two parameters keySelector1 and keySelector2 are passed in, which are two streams The respective key selector in the key; of course, the position value of the key (keyPosition) or the field name of the key (field) can also be directly passed in, which is exactly the same as the common keyBy usage. The keyBy operation of ConnectedStreams is actually to put the data with the same key in the two streams together, and then process the source streams separately, which is very useful in some scenarios.

Case requirements: connect two streams, and output data that can be matched according to id (similar to inner join effect)

public class ConnectKeybyDemo {
    
    
    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(2);

        DataStreamSource<Tuple2<Integer, String>> source1 = env.fromElements(
                Tuple2.of(1, "a1"),
                Tuple2.of(1, "a2"),
                Tuple2.of(2, "b"),
                Tuple2.of(3, "c")
        );
        DataStreamSource<Tuple3<Integer, String, Integer>> source2 = env.fromElements(
                Tuple3.of(1, "aa1", 1),
                Tuple3.of(1, "aa2", 2),
                Tuple3.of(2, "bb", 1),
                Tuple3.of(3, "cc", 1)
        );

        ConnectedStreams<Tuple2<Integer, String>, Tuple3<Integer, String, Integer>> connect = source1.connect(source2);

        // 多并行度下,需要根据 关联条件 进行keyby,才能保证key相同的数据到一起去,才能匹配上
        ConnectedStreams<Tuple2<Integer, String>, Tuple3<Integer, String, Integer>> connectKey = connect.keyBy(s1 -> s1.f0, s2 -> s2.f0);

        SingleOutputStreamOperator<String> result = connectKey.process(
                new CoProcessFunction<Tuple2<Integer, String>, Tuple3<Integer, String, Integer>, String>() {
    
    
                    // 定义 HashMap,缓存来过的数据,key=id,value=list<数据>
                    Map<Integer, List<Tuple2<Integer, String>>> s1Cache = new HashMap<>();
                    Map<Integer, List<Tuple3<Integer, String, Integer>>> s2Cache = new HashMap<>();

                    @Override
                    public void processElement1(Tuple2<Integer, String> value, Context ctx, Collector<String> out) throws Exception {
    
    
                        Integer id = value.f0;
                        // TODO 1.来过的s1数据,都存起来
                        if (!s1Cache.containsKey(id)) {
    
    
                            // 1.1 第一条数据,初始化 value的list,放入 hashmap
                            List<Tuple2<Integer, String>> s1Values = new ArrayList<>();
                            s1Values.add(value);
                            s1Cache.put(id, s1Values);
                        } else {
    
    
                            // 1.2 不是第一条,直接添加到 list中
                            s1Cache.get(id).add(value);
                        }

                        //TODO 2.根据id,查找s2的数据,只输出 匹配上 的数据
                        if (s2Cache.containsKey(id)) {
    
    
                            for (Tuple3<Integer, String, Integer> s2Element : s2Cache.get(id)) {
    
    
                                out.collect("s1:" + value + "<--------->s2:" + s2Element);
                            }
                        }
                    }

                    @Override
                    public void processElement2(Tuple3<Integer, String, Integer> value, Context ctx, Collector<String> out) throws Exception {
    
    
                        Integer id = value.f0;
                        // TODO 1.来过的s2数据,都存起来
                        if (!s2Cache.containsKey(id)) {
    
    
                            // 1.1 第一条数据,初始化 value的list,放入 hashmap
                            List<Tuple3<Integer, String, Integer>> s2Values = new ArrayList<>();
                            s2Values.add(value);
                            s2Cache.put(id, s2Values);
                        } else {
    
    
                            // 1.2 不是第一条,直接添加到 list中
                            s2Cache.get(id).add(value);
                        }

                        //TODO 2.根据id,查找s1的数据,只输出 匹配上 的数据
                        if (s1Cache.containsKey(id)) {
    
    
                            for (Tuple2<Integer, String> s1Element : s1Cache.get(id)) {
    
    
                                out.collect("s1:" + s1Element + "<--------->s2:" + value);
                            }
                        }
                    }
                });

        result.print();

        env.execute();
    }
}

Guess you like

Origin blog.csdn.net/a772304419/article/details/132647138