All-rounder in Flink DataStream API-Process Function

In this article detailed explanation of Flink time and watermarks , the relevant content of Flink time and watermarks is explained. You may not help but ask, how to access the timestamp and water mark? First of all, it is not accessible through the ordinary DataStream API. It needs to use a low-level API provided by Flink-Process Function. Process Function can not only access the timestamp and water mark, but also register timers to be triggered at a certain time in the future. In addition, you can also send data to multiple output streams through Side Outputs. In this way, the function of data shunting can be realized, and it is also a way to deal with late data. Below we will start with the source code and combine specific use cases to illustrate how to use Process Function.

Introduction

Flink provides many Process Functions, each of which has its own function. These Process Functions mainly include:

  • ProcessFunction
  • KeyedProcessFunction
  • CoProcessFunction
  • ProcessJoinFunction
  • ProcessWindowFunction
  • ProcessAllWindowFunction
  • BaseBroadcastProcessFunction
    • KeyedBroadcastProcessFunction
  • BroadcastProcessFunction

The inheritance diagram is as follows:
Insert picture description here

Inheritance can be seen from the above, have achieved RichFunction interface, supports open(), close(), getRuntimeContext()and so called method. As can be seen from the name, these functions have different applicable scenarios, but the basic functions are similar. The following will take KeyedProcessFunction as an example to discuss the general functions of these functions.

Source code

KeyedProcessFunction

/**
 * 处理KeyedStream流的低级API函数
 * 对于输入流中的每个元素都会触发调用processElement方法.该方法会产生0个或多个输出.
 * 其实现类可以通过Context访问数据的时间戳和计时器(timers).当计时器(timers)触发时,会回调onTimer方法.
 * onTimer方法会产生0个或者多个输出,并且会注册一个未来的计时器.
 *
 * 注意:如果要访问keyed state和计时器(timers),必须在KeyedStream上使用KeyedProcessFunction.
 * 另外,KeyedProcessFunction的父类AbstractRichFunction实现了RichFunction接口,所以,可以使用
 * open(),close()及getRuntimeContext()方法.
 *
 * @param <K> key的类型
 * @param <I> 输入元素的数据类型
 * @param <O> 输出元素的数据类型
 */
@PublicEvolving
public abstract class KeyedProcessFunction<K, I, O> extends AbstractRichFunction {
    
    

	private static final long serialVersionUID = 1L;
	/**
	 * 处理输入流中的每个元素
	 * 该方法会输出0个或者多个输出,类似于FlatMap的功能
	 * 除此之外,该方法还可以更新内部状态或者设置计时器(timer)
	 * @param value 输入元素
	 * @param ctx  Context,可以访问输入元素的时间戳,并其可以获取一个时间服务器(TimerService),用于注册计时器(timers)并查询时间
	 *  Context只有在processElement被调用期间有效.
	 * @param out  返回的结果值
	 * @throws Exception
	 */
	public abstract void processElement(I value, Context ctx, Collector<O> out) throws Exception;

	/**
	 * 是一个回调函数,当在TimerService中注册的计时器(timers)被触发时,会回调该函数
	 * @param timestamp 触发计时器(timers)的时间戳
	 * @param ctx  OnTimerContext,允许访问时间戳,TimeDomain枚举类提供了两种时间类型:
	 * EVENT_TIME与PROCESSING_TIME
	 * 并其可以获取一个时间服务器(TimerService),用于注册计时器(timers)并查询时间
	 * OnTimerContext只有在onTimer方法被调用期间有效
	 * @param out 结果输出
	 * @throws Exception
	 */
	public void onTimer(long timestamp, OnTimerContext ctx, Collector<O> out) throws Exception {
    
    }
	/**
	 * 仅仅在processElement()方法或者onTimer方法被调用期间有效
	 */
	public abstract class Context {
    
    

		/**
		 * 当前被处理元素的时间戳,或者是触发计时器(timers)时的时间戳
		 * 该值可能为null,比如当程序中设置的时间语义为:TimeCharacteristic#ProcessingTime
		 * @return
		 */
		public abstract Long timestamp();

		/**
		 * 访问时间和注册的计时器(timers)
		 * @return
		 */
		public abstract TimerService timerService();

		/**
		 * 将元素输出到side output (侧输出)
		 * @param outputTag 侧输出的标记
		 * @param value 输出的记录
		 * @param <X>
		 */
		public abstract <X> void output(OutputTag<X> outputTag, X value);
		/**
		 * 获取被处理元素的key
		 * @return
		 */
		public abstract K getCurrentKey();
	}
	/**
	 * 当onTimer方法被调用时,才可以使用OnTimerContext
	 */
	public abstract class OnTimerContext extends Context {
    
    
		/**
		 * 触发计时器(timers)的时间类型,包括两种:EVENT_TIME与PROCESSING_TIME
		 * @return
		 */
		public abstract TimeDomain timeDomain();
		/**
		 * 获取触发计时器(timer)元素的key
		 * @return
		 */
		@Override
		public abstract K getCurrentKey();
	}
}

In the above source code, there are mainly two methods, the analysis is as follows:

  • processElement(I value, Context ctx, Collector out)

This method will be called once for each record in the stream and output zero or more elements, similar to the function of FlatMap, and send the result through Collector. In addition, this function has a Context parameter, through which the user can access the timestamp, the key value of the current record and TimerService (the TimerService will be explained in detail below). In addition, you can also use the output method to send data to the side output to achieve the function of shunting or processing late data.

  • onTimer(long timestamp, OnTimerContext ctx, Collector out)

This method is a callback function, which will be called back when the timers registered in TimerService are triggered. The @param timestampparameter represents the timestamp that triggered the timer (timers), and the Collector can send the record. Careful you may find that these two methods have a context parameter. The above method passes the Context parameter, and the onTimer method passes the OnTimerContext parameter. These two parameter objects can achieve similar functions. OnTimerContext can also return the time domain that triggered the timer (EVENT_TIME and PROCESSING_TIME).

TimerService

In the KeyedProcessFunction source code, TimerService is used to access the time and timer. Let's take a look at the source code:

@PublicEvolving
public interface TimerService {
    
    
	String UNSUPPORTED_REGISTER_TIMER_MSG = "Setting timers is only supported on a keyed streams.";
	String UNSUPPORTED_DELETE_TIMER_MSG = "Deleting timers is only supported on a keyed streams.";
	// 返回当前的处理时间
	long currentProcessingTime();
	// 返回当前event-time水位线(watermark)
	long currentWatermark();

	/**
	 * 注册一个计时器(timers),当processing time的时间等于该计时器时钟时会被调用
	 * @param time
	 */
	void registerProcessingTimeTimer(long time);

	/**
	 * 注册一个计时器(timers),当event time的水位线(watermark)到达该时间时会被触发
	 * @param time
	 */
	void registerEventTimeTimer(long time);

	/**
	 * 根据给定的触发时间(trigger time)来删除processing-time计时器
	 * 如果这个timer不存在,那么该方法不会起作用,
	 * 即该计时器(timer)之前已经被注册了,并且没有过时
	 *
	 * @param time
	 */
	void deleteProcessingTimeTimer(long time);
    
	/**
	 * 根据给定的触发时间(trigger time)来删除event-time 计时器
	 * 如果这个timer不存在,那么该方法不会起作用,
	 * 	即该计时器(timer)之前已经被注册了,并且没有过时
	 * @param time
	 */
	void deleteEventTimeTimer(long time);
}

TimerService provides the following methods:

  • currentProcessingTime()

Returns the current processing time

  • currentWatermark()

Return the current event-time watermark timestamp

  • registerProcessingTimeTimer(long time)

For the current key, register a processing time timer (timers), which will be called when the processing time is equal to the timer clock

  • registerEventTimeTimer(long time)

For the current key, register an event time timer (timers), which will be called when the water mark timestamp is greater than or equal to the timer clock

  • deleteProcessingTimeTimer(long time)

For the current key, delete a previously registered processing time timer (timers). If the timer does not exist, then this method will not work

  • deleteEventTimeTimer(long time)

For the current key, delete a previously registered event time timer (timers), if the timer does not exist, then this method will not work

When the timer is triggered, the onTimer() function will be called back, and the system calls the ProcessElement() method and onTimer() method synchronously

Note: There are two Error messages in the above source code, which means that the timer can only be used on keyed streams. The common use is to clear the keyed state after some key values ​​are not in use, or to implement some time-based custom windows logic. If you want to use a timer on a non-KeyedStream, you can use KeySelector to return a fixed partition value (for example, return a constant), so that all data will only be sent to one partition.

Use Cases

The following will use the side output function of Process Function for shunt processing, the specific code is as follows:

public class ProcessFunctionExample {
    
    

    // 定义side output标签
    static final OutputTag<UserBehaviors> buyTags = new OutputTag<UserBehaviors>("buy") {
    
    
    };
    static final OutputTag<UserBehaviors> cartTags = new OutputTag<UserBehaviors>("cart") {
    
    
    };
    static final OutputTag<UserBehaviors> favTags = new OutputTag<UserBehaviors>("fav") {
    
    
    };
    static class SplitStreamFunction extends ProcessFunction<UserBehaviors, UserBehaviors> {
    
    

        @Override
        public void processElement(UserBehaviors value, Context ctx, Collector<UserBehaviors> out) throws Exception {
    
    
            switch (value.behavior) {
    
    
                case "buy":
                    ctx.output(buyTags, value);
                    break;
                case "cart":
                    ctx.output(cartTags, value);
                    break;
                case "fav":
                    ctx.output(favTags, value);
                    break;
                default:
                    out.collect(value);
           }
        }
    }
    public static void main(String[] args) throws Exception {
    
    
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);

        // 模拟数据源[userId,behavior,product]
        SingleOutputStreamOperator<UserBehaviors> splitStream = env.fromElements(
                new UserBehaviors(1L, "buy", "iphone"),
                new UserBehaviors(1L, "cart", "huawei"),
                new UserBehaviors(1L, "buy", "logi"),
                new UserBehaviors(1L, "fav", "oppo"),
                new UserBehaviors(2L, "buy", "huawei"),
                new UserBehaviors(2L, "buy", "onemore"),
                new UserBehaviors(2L, "fav", "iphone")).process(new SplitStreamFunction());

        //获取分流之后购买行为的数据
        splitStream.getSideOutput(buyTags).print("data_buy");
        //获取分流之后加购行为的数据
        splitStream.getSideOutput(cartTags).print("data_cart");
        //获取分流之后收藏行为的数据
        splitStream.getSideOutput(favTags).print("data_fav");

        env.execute("ProcessFunctionExample");
    }
}

to sum up

This article first introduces several low-level Process Function APIs provided by Flink. These APIs can access the timestamp and water mark, and support the registration of a timer to call the callback function onTimer(). Then I interpreted the common parts of these APIs from the perspective of source code, and explained in detail the specific meaning and usage of each method. Finally, a common use scenario case of Process Function is given, and it is used to implement diversion processing. In addition, users can also use these functions, by registering the timer, and defining processing logic in the callback function, which is very flexible.

*Follow the public account : big data technology and data warehouse to
receive 100G big data materials for free
Insert picture description here

Guess you like

Origin blog.csdn.net/jmx_bigdata/article/details/105937485