Eleven -window & EventTime example flink learning

The interview Processing Time, here ready to look Event Time, and must concern, Watermarks in the ET scene.

EventTime & Watermark

Event time programs must specify how to generate Event Time Watermarks, which is the mechanism that signals progress in event time.

Subject to event time program, you must specify the watermark.

The following quote from "Learning from 0-1 flink" and the official website Description:

Support Event Time stream processors need a way to measure progress Event Time. For example, when more than one hour Event Time ended, the need to build a window operator notification window per hour, so that the operator can close the window in progress.

Event Time can be carried out independently in Processing Time. For example, in a program, the operator of the current Event Time may be slightly behind the Processing Time (taking into account the delay in receiving the event), and both were at the same speed. On the other hand, another stream program may need only a few seconds can be processed in Kafka Topic data Event Time weeks.

A stream processor that supports event time needs a way to measure the progress of event time. For example, a window operator that builds hourly windows needs to be notified when event time has passed beyond the end of an hour, so that the operator can close the window in progress.

Event time can progress independently of processing time (measured by wall clocks). For example, in one program the current event time of an operator may trail slightly behind the processing time (accounting for a delay in receiving the events), while both proceed at the same speed. On the other hand, another streaming program might progress through weeks of event time with only a few seconds of processing, by fast-forwarding through some historic data already buffered in a Kafka topic (or another message queue).

Flink for mechanisms for measuring progress Event Time is Watermarks. Watermarks as the mobile part of the data stream with a time stamp and t. Watermark (t) time of the Event Time declaration stream has reached t, which means that there should not be a stream of '<= t element having a time stamp T (i.e., greater than or equal an event timestamp watermarks)

The following figure shows the event having (logical) timestamp and inline flow watermark. In the present embodiment, the events are in order (with respect to their time stamps), this means that the watermark is only periodically mark stream.

stream_watermark_in_order
stream_watermark_in_order

Watermark is essential for the chaotic flow, as shown below, in which the events are not sorted by time stamp. Usually, Watermark is a statement by the stream that point, reach a timestamp all events should arrive. Once a watermark to reach the operator, the operator may be an internal event ahead of time to the value of the watermark.

stream_watermark_out_of_order
stream_watermark_out_of_order

Understanding down time if flink type is set Event Time, must set the watermark, as a sign to tell flink progress.

If the watermark (time1) has been determined, then that stream all time2 data older than watermark-time1 certainly have been processed, either ordered or unordered data stream data stream.

Who watermark is generated? --sorry, it is to run the job in flink generated code, rather than the datasource itself.

watermark is that each has a corresponding data it? It can be 1: 1, but not, according to the needs and realities do.

It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.

Parallel flow of the watermark

Watermark function is generated at the source, or generated directly after the source function. Each parallel sub-task function usually generates its source independent watermarks. The watermark defines the specific source of the parallel event at the time.

When the watermark by streaming programs, they will be ahead of the event staff at the time of the operation arrived. When an operator early (advanced) its event time (event time), it generates a new watermark for its subsequent downstream operator.

Some operators consume multiple input streams; for example, a Union, or following keyBy (...) or Partition (...) operator function. The operator of such current event time is the minimum time that an event input stream. Because of its input stream update event time, so the operator as well.

The following figure shows an example of the event and passing through parallel flow of the watermark, and the tracking time of the event operator.

flink_parallel_streams_watermarks
flink_parallel_streams_watermarks

From the chart, event time is generated from the source, the same, Watermark well.

Map data from the source through the converter, and the window processing on

The other did not understand. . .

About TimeStamp and Watermark

In order to work with event time, Flink needs to know the events’ timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element.

Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time.

There are two ways to assign timestamps and generate watermarks:

  1. Directly in the data stream source
  2. Via a timestamp assigner / watermark generator: in Flink, timestamp assigners also define the watermarks to be emitted

Attention Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.

Next event time type, flink must know the event corresponding timestamp, that is to say, this stream each element to be allocated timestamp, is generally placed in corresponding fields in each element.

Assigned timestamp generation and watermark are generally processed together (hand-in-hand).

There are two ways to assign watermark generation timestamp +

  • Directly specified in the datasource
  • Specified by a timestamp assigner (otherwise known watermark generator). In flink in, timestamp assigner is also a watermark generator
Directly specified in the datasource

Stream sources can directly assign timestamps to the elements they produce, and they can also emit watermarks. When this is done, no timestamp assigner is needed. Note that if a timestamp assigner is used, any timestamps and watermarks provided by the source will be overwritten.

To assign a timestamp to an element in the source directly, the source must use the collectWithTimestamp(...) method on the SourceContext. To generate watermarks, the source must call the emitWatermark(Watermark) function.

For example, before mysql datasource with spring, its implementation is this:

    @Override
    public void run(SourceContext<UrlInfo> sourceContext) throws Exception { log.info("------query "); if(urlInfoManager == null){ init(); } List<UrlInfo> urlInfoList = urlInfoManager.queryAll(); urlInfoList.parallelStream().forEach(urlInfo -> sourceContext.collect(urlInfo)); } 

If you need to add timestamp, you need to call collectWithTimestamp; If you need to generate a watermark, you need to call emitWatermark.

Modified as follows:

    @Override
    public void run(SourceContext<UrlInfo> sourceContext) throws Exception { log.info("------query "); if(urlInfoManager == null){ init(); } List<UrlInfo> urlInfoList = urlInfoManager.queryAll(); urlInfoList.parallelStream().forEach(urlInfo -> { // 增加timestamp sourceContext.collectWithTimestamp(urlInfo,System.currentTimeMillis()); // 生成水印 sourceContext.emitWatermark(new Watermark(urlInfo.getCurrentTime()== null? System.currentTimeMillis():urlInfo.getCurrentTime().getTime())); sourceContext.collect(urlInfo); }); } 

Note which adds two lines of code, timestamp and watermark are for each element.

/ Watermark Generators designated by Timestamp Assigners

Timestamp assigners take a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.

Timestamp assigners are usually specified immediately after the data source, but it is not strictly required to do so. A common pattern, for example, is to parse (MapFunction) and filter (FilterFunction) before the timestamp assigner. In any case, the timestamp assigner needs to be specified before the first operation on event time (such as the first window operation). As a special case, when using Kafka as the source of a streaming job, Flink allows the specification of a timestamp assigner / watermark emitter inside the source (or consumer) itself. More information on how to do so can be found in the Kafka Connector documentation.

Timestamp Assigner input allows a stream, with the output of a timestamp, the stream of watermark elements. Already have a timestamp, watermark before if the stream, it will be overwritten.

Timestamp Assigner usually immediately specify immediately after datasoure initialization, but it does not have to do so. A common mode after a parse, filter, designated timestamp assigner; but before any first time event time required for the operation, you must specify the timestamp assigner.

Look at an example:

    public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink(); DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>()); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream = dataStreamSource.filter((FilterFunction<UrlInfo>) o -> { if (o.getDomain() == UrlInfo.BAIDU) { return true; } return false; }).assignTimestampsAndWatermarks(new MyTimestampAndWatermarkAssigner()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring"); } 

It can be seen here to do a assignTimestampAndWatermarks operation after filter.

The Watermarks with the Periodic with - periodically add watermark

AssignerWithPeriodicWatermarks assigns timestamps and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).

The interval (every n milliseconds) in which the watermark will be generated is defined viaExecutionConfig.setAutoWatermarkInterval(...). The assigner’s getCurrentWatermark() method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.

If you need to periodically generate Watermark, but not always generated, it needs to call a method AssignerWithPeriodicWatermarks, the time interval in milliseconds as a unit, the method requires ExecutionConfig.setAutoWatermarkInterval provided.

   public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink(); DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>()); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); // 设定watermark间隔时间 ExecutionConfig config = env.getConfig(); config.setAutoWatermarkInterval(300); SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream = dataStreamSource.filter((FilterFunction<UrlInfo>) o -> { if (o.getDomain() == UrlInfo.BAIDU) { return true; } return false; }).assignTimestampsAndWatermarks(new TimeLagWatermarkGenerator()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring"); } 

It can be seen here, the time interval set by ExecuteConfig watermark generated, while adding TimeLagWatermarkGenerator after filter, which code is as follows (from the official website, slightly modified):

/**
 * This generator generates watermarks that are lagging behind processing time by a fixed amount.
 * It assumes that elements arrive in Flink after a bounded delay.
 */
public class TimeLagWatermarkGenerator implements AssignerWithPeriodicWatermarks<UrlInfo> { private final long maxTimeLag = 5000; // 5 seconds @Override public long extractTimestamp(UrlInfo element, long previousElementTimestamp) { return element.getCurrentTime().getTime(); } @Override public Watermark getCurrentWatermark() { // return the watermark as current time minus the maximum time lag return new Watermark(System.currentTimeMillis() - maxTimeLag); } } 
With Punctuated (punctuated) Watermarks

To generate watermarks whenever a certain event indicates that a new watermark might be generated, useAssignerWithPunctuatedWatermarks. For this class Flink will first call the extractTimestamp(...) method to assign the element a timestamp, and then immediately call the checkAndGetNextWatermark(...) method on that element.

The checkAndGetNextWatermark(...) method is passed the timestamp that was assigned in the extractTimestamp(...) method, and can decide whether it wants to generate a watermark. Whenever the checkAndGetNextWatermark(...) method returns a non-null watermark, and that watermark is larger than the latest previous watermark, that new watermark will be emitted.

 public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); MysqlDSWithSpringForFlink streamSource = new MysqlDSWithSpringForFlink(); DataStreamSource dataStreamSource = env.addSource(streamSource);//addSink(new PrintSinkFunction<>()); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); SingleOutputStreamOperator<UrlInfo> withTimestampAndWatermarkStream = dataStreamSource.filter((FilterFunction<UrlInfo>) o -> { if (o.getDomain() == UrlInfo.BAIDU) { return true; } return false; }).assignTimestampsAndWatermarks(new PunctuatedAssigner()); dataStreamSource.addSink(new PrintSinkFunction()); env.execute("mysql Datasource with pool and spring"); } 
import myflink.model.UrlInfo;
import org.apache.flink.streaming.api.functions.AssignerWithPunctuatedWatermarks;
import org.apache.flink.streaming.api.watermark.Watermark;

public class PunctuatedAssigner implements AssignerWithPunctuatedWatermarks<UrlInfo> { @Override public long extractTimestamp(UrlInfo element, long previousElementTimestamp) { return element.getCurrentTime().getTime(); } @Override public Watermark checkAndGetNextWatermark(UrlInfo lastElement, long extractedTimestamp) { /** * Creates a new watermark with the given timestamp in milliseconds. */ return lastElement.hasWatermarkMarker() ? new Watermark(extractedTimestamp) : null; } } 
kafka related

When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).

In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.

For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks.

The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.

Since a plurality of partition kafka, kafka each partition may have its own rules event time, and the consumer side, partition the plurality of data are processed in parallel, different data from different partition which event time rules, so They destroyed a generation rule event time.

In this case, use Kafka-partition-aware watermark flink generated, the following code:

    public static void main(String[] args) throws Exception { final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); Properties properties = new Properties(); properties.put("bootstrap.servers", "localhost:9092"); properties.put("zookeeper.connect", "localhost:2181"); properties.put("group.id", "metric-group"); properties.put("auto.offset.reset", "latest"); properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); properties.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); SingleOutputStreamOperator<UrlInfo> dataStreamSource = env.addSource( new FlinkKafkaConsumer010<String>( "testjin",// topic new SimpleStringSchema(), properties ) ).setParallelism(1) // map操作,转换,从一个数据流转换成另一个数据流,这里是从string-->UrlInfo .map(string -> JSON.parseObject(string, UrlInfo.class)); dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<UrlInfo>(){ @Override public long extractAscendingTimestamp(UrlInfo element) { return element.getCurrentTime().getTime(); } }); env.execute("save url to db"); } 

Note that using AscendingTimestampExtractor, i.e. an ascending timestamp assigner.

References:

http://www.54tianzhisheng.cn/2018/12/11/Flink-time/

https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_time.html

https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/event_timestamps_watermarks.html


Reprinted: https://www.jianshu.com/p/13b6d180adcb

Guess you like

Origin www.cnblogs.com/cxhfuujust/p/10960313.html