Flink DataStream中join

Window Join (Window Join)

window joinThey will share the same keyand in the same window elements coupled together two streams. Window dispenser may be used to define these windows, and evaluated according to the two flow elements.

The element is then transferred to both user-defined JoinFunctionor FlatJoinFunctionwhere the user may issue the results satisfy the join condition.

General usage can be summarized as follows:

stream.join(otherStream)
    .where(<KeySelector>)
    .equalTo(<KeySelector>)
    .window(<WindowAssigner>)
    .apply(<JoinFunction>)

Some instructions on semantics:

  • Creating elements of the two streams pair behaves like a combination of inner-joinelements which means that if a stream of elements that do not want to connect with another stream of the corresponding element, the element will not be issued.
  • Those elements do join will be the greatest time stamp (still in the corresponding window) as a time stamp. For example, [5, 10) will result in a window boundary has a connection element 9 as its time stamp.

In the following section, we will outline the use of different types of scenarios when some exemplary window joinbehavior.

Tumbling Window Join

In execution Tumbling Window Join, all have a common keyand public Tumbling Window Joinelements are combined in pairs by the coupling, and is transmitted to JoinFunctionor FlatJoinFunction. Because it behaves like a inner join, it will not be issued tumbling windowno element in a stream of elements from another stream!

As shown, we define a size of 2 ms Tumbling Window, the window is in the form of [0,1], [2,3], .... The figure shows pairwise combinations of all elements in each window, the window will be transmitted to the JoinFunction. Note that, in the flip window [6,7], since the elements to be connected and joined with orange ⑥ element does not exist in the stream of green, and therefore does not issue any content.

import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream.join(greenStream)
    .where(elem => /* select key */)
    .equalTo(elem => /* select key */)
    .window(TumblingEventTimeWindows.of(Time.milliseconds(2)))
    .apply { (e1, e2) => e1 + "," + e2 }

Sliding Window Join

Performed Sliding Window Join, all with a common keyand public Sliding Windowelements are grouped in pairs connected by, and passed to JoinFunctionor FlatJoinFunction. In the current Sliding Windowin a flow stream no elements of other elements will not be issued!

Please note that some elements may be connected in a sliding window, but can not be connected in another sliding window!

In this example, we use the size of the sliding window of 2 ms 1 ms sliding time, so as to arrive sliding window [1,0], [0,1], [1,2], [2,3], ... . Connecting elements below the x-axis is transmitted to each of the elements JoinFunction sliding window. Here you can also see, for example, how an orange and green ② ③ combined in the window [2,3] but were not linked to the green ③ in the window [1,2].

import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream.join(greenStream)
    .where(elem => /* select key */)
    .equalTo(elem => /* select key */)
    .window(SlidingEventTimeWindows.of(Time.milliseconds(2) /* size */, Time.milliseconds(1) /* slide */))
    .apply { (e1, e2) => e1 + "," + e2 }

Session Window Join

Perform Session Window Jointhe same time, having a "combination" session satisfies the conditions keyof all elements will be connected together in pairs in combination, and is transmitted to JoinFunctionor FlatJoinFunction. Executed again inner join, so if there is a Session Window Joinonly contains the elements of a stream, it will not issue any output!

Here, we define a Session Window Joinconnector, wherein the interval between each session of at least 1ms. There are three sessions, two previous session, two streams are transmitted to the coupling element JoinFunction. In a third session, no green flow element, ⑧ and ⑨ are not connected!

import org.apache.flink.streaming.api.windowing.assigners.EventTimeSessionWindows;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream.join(greenStream)
    .where(elem => /* select key */)
    .equalTo(elem => /* select key */)
    .window(EventTimeSessionWindows.withGap(Time.milliseconds(1)))
    .apply { (e1, e2) => e1 + "," + e2 }

Interval Join (Interval Join)

Interval JoinThe public keyelement connecting the two streams (which are now referred to as A and B), the element B and the flow of the elementary stream having a time stamp A time stamp is a relative time interval .

This can also be expressed as more formally b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]ora.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound

Wherein a and b are elements of A and B, they share a common key. As long as the lower limit is always less than or equal to the upper limit, the lower and upper limits that can be negative or positive. Interval JoinThe current execution only inner joins. It is transmitted to the pair of elements ProcessJoinFunction, the larger they will be assigned timestamps two elements (through ProcessJoinFunction.Contextaccess).

Note: Interval Joincurrently only supports event time.

In the above example, the two streams will be "orange" and "green" are connected together, the lower limit -2 ms, the upper limit is +1 milliseconds. By default, these boundaries are inclusive of, but .lowerBoundExclusive(), and .upperBoundExclusivecan be applied to change the behavior.

More formal notation used again, this will translate into orangeElem.ts + lowerBound <= greenElem.ts <= orangeElem.ts + upperBoundas shown in a triangle.

import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction;
import org.apache.flink.streaming.api.windowing.time.Time;

...

val orangeStream: DataStream[Integer] = ...
val greenStream: DataStream[Integer] = ...

orangeStream
    .keyBy(elem => /* select key */)
    .intervalJoin(greenStream.keyBy(elem => /* select key */))
    .between(Time.milliseconds(-2), Time.milliseconds(1))
    .process(new ProcessJoinFunction[Integer, Integer, String] {
        override def processElement(left: Integer, right: Integer, ctx: ProcessJoinFunction[Integer, Integer, String]#Context, out: Collector[String]): Unit = {
         out.collect(left + "," + right); 
        }
      });
    });

Suggested Reading:

https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/operators/joining.html

https://github.com/perkinls/flink-local-train

https://yq.aliyun.com/users/ohyfzrwxmb3me?spm=a2c4e.11153940.0.0.763648d5EoX4bX

Published 87 original articles · won praise 69 · views 130 000 +

Guess you like

Origin blog.csdn.net/lp284558195/article/details/103997357