Flink的窗口计算(章节四)

Flink的窗口计算(章节四)

windows

介绍

窗⼝计算是流计算的核⼼,窗⼝将流数据切分成有限⼤⼩的“buckets”,我们可以对这个“buckets”中的有限数据做运算。

在Flink中整体将窗⼝计算按分为两⼤类:keyedstream窗⼝、datastream窗⼝,以下是代码结构:

Keyed Windows

stream
 .keyBy(...) <- keyed versus non-keyed windows
 .window(...) <- 必须指定: "window assigner"
 [.trigger(...)] <- 可选: "trigger" (else default trigger) 决定了窗⼝何时触发计算
 [.evictor(...)] <- 可选: "evictor" (else no evictor) 剔除器,剔除窗⼝内的元素
 [.allowedLateness(...)] <- 可选: "lateness" (else zero) 是否允许有迟到
 [.sideOutputLateData(...)] <- 可选: "output tag" (else no side output for latedata)
 .reduce/aggregate/fold/apply() <- 必须: "Window Function" 对窗⼝的数据做运算
 [.getSideOutput(...)] <- 可选: "output tag" 获取迟到的数据

Non-Keyed Windows

stream
 .windowAll(...) <- 必须指定: "window assigner"
 [.trigger(...)] <- 可选: "trigger" (else default trigger) 决定了窗⼝何时触发计算
 [.evictor(...)] <- 可选: "evictor" (else no evictor) 剔除器,剔除窗⼝内的元素
 [.allowedLateness(...)] <- 可选: "lateness" (else zero) 是否允许有迟到
 [.sideOutputLateData(...)] <- 可选: "output tag" (else no side output for latedata)
 .reduce/aggregate/fold/apply() <- 必须: "Window Function" 对窗⼝的数据做运算
 [.getSideOutput(...)] <- 可选: "output tag" 获取迟到的数据

Window Lifecycle

当有第⼀个元素落⼊到窗⼝中的时候窗⼝就被创建,当时间(⽔位线)越过窗⼝的EndTime的时候,该窗⼝认定为是就绪状态,可以应⽤WindowFunction对窗⼝中的元素进⾏运算。当前的时间(⽔位线)越过了窗⼝的EndTime+allowed lateness时间,该窗⼝会被删除。只有time-based windows 才有⽣命周期的概念,因为Flink还有⼀种类型的窗⼝global window不是基于时间的,因此没有⽣命周期的概念。

例如,采⽤基于Event-Time的窗⼝化策略,该策略每5分钟创建⼀次不重叠(或翻滚)的窗⼝,并允许延迟为1分钟,Flink将为12:00⾄12:05之间的间隔创建⼀个新窗⼝:当带有时间戳的第⼀个元素落⼊此时间间隔时中,且⽔位线经过12:06时间戳时,12:00⾄12:05窗⼝将被删除。

每⼀种窗⼝都有⼀个Trigger和function与之绑定,function的作⽤是⽤于对窗⼝中的内容实现运算。⽽Trigger决定了窗⼝什么时候是就绪的,因为只有就绪的窗⼝才会运⽤function做运算。

除了指定以上的策略以外,我们还可以指定 Evictor ,该 Evictor 可以在窗⼝就绪以后且在function运⾏之前或者之后删除窗⼝中的元素。

Keyed vs Non-Keyed Windows

Keyed Windows:在某⼀个时刻,会触发多个window任务,取决于Key的种类。

Non-Keyed Windows:因为没有key概念,所以任意时刻只有⼀个window任务执⾏。

Window Assigners

Window Assigner定义了如何将元素分配给窗⼝,这是通过在 window(…) / windowAll() 指定⼀个Window Assigner实现。

Window Assigner负责将接收的数据分配给1~N窗⼝,Flink中预定义了⼀些Window Assigner如下:tumbling windows , sliding windows , session windows 和 global windows .⽤户也可以同过实现WindowAssigner类⾃定义窗⼝。除了global windows 以外其它窗⼝都是基于时间的TimeWindow.Timebased窗⼝都有 start timestamp (包含)和end timestamp (排除)属性描述⼀个窗⼝的⼤⼩。

Tumbling Windows

滚动窗⼝分配器将每个元素分配给指定窗⼝⼤⼩的窗⼝。滚动窗⼝具有固定的⼤⼩,并且不重叠。例如,如果您指定⼤⼩为5分钟的翻滚窗⼝,则将评估当前窗⼝,并且每五分钟将启动⼀个新窗⼝,如下图所示。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-oaYWo0aL-1584107236523)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1583993721436.png)]

val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
val counts = text.flatMap(line=>line.split("\\s+"))
                .map(word=>(word,1))
                .keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .reduce((v1,v2)=>(v1._1,v1._2+v2._2))
                .print()
env.execute("Tumbling Window Stream WordCount")
Sliding Windows

滑动窗⼝分配器将元素分配给固定⻓度的窗⼝。类似于滚动窗⼝分配器,窗⼝的⼤⼩由窗⼝⼤⼩参数配置。附加的窗⼝滑动参数控制滑动窗⼝启动的频率。因此,如果幻灯⽚⼩于窗⼝⼤⼩,则滑动窗⼝可能会重叠。在这种情况下,元素被分配给多个窗⼝。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UllRYO0B-1584107236527)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1583993828651.png)]

object FlinkProcessingTimeSlidingWindow {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                .map(word=>(word,1))
                .keyBy(0)
                .window(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
                .aggregate(new UserDefineAggregateFunction)
                .print()
        //5.执⾏流计算任务
        env.execute("Sliding Window Stream WordCount")
    }
}
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
                                                            (String,Int)]{
    override def createAccumulator(): (String, Int) = ("",0)
    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
    {
        (value._1,value._2+accumulator._2)
    }
    override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        (a._1,a._2+b._2)
    }
}
Session Windows

会话窗⼝分配器按活动会话对元素进⾏分组。与滚动窗⼝和滑动窗⼝相⽐,会话窗⼝不重叠且没有固定的开始和结束时间。相反,当会话窗⼝在⼀定时间段内未接收到元素时(即,发⽣不活动间隙时),它将关闭。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-iZvVLMtd-1584107236528)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1583994047195.png)]

object FlinkProcessingTimeSessionWindow {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .map(word=>(word,1))
        //此处必须使用这种方式获取key,直接使用下标获取会报错
        .keyBy(t=>t._1)
        .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
        .apply(new UserDefineWindowFunction)
        .print()
        //5.执⾏流计算任务
        env.execute("Session Window Stream WordCount")
    }
}
class UserDefineWindowFunction extends WindowFunction[(String,Int),
                                                      (String,Int),String,TimeWindow]{
    override def apply(key: String,
                       window: TimeWindow,
                       input: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        val sdf = new SimpleDateFormat("HH:mm:ss")
        var start=sdf.format(window.getStart)
        var end=sdf.format(window.getEnd)
        var sum = input.map(_._2).sum
        out.collect((s"${key}\t${start}~${end}",sum))
    }
}
Global Windows

全局窗⼝分配器将具有相同键的所有元素分配给同⼀单个全局窗⼝。仅当您还指定⾃定义触发器时,此窗⼝⽅案才有⽤。否则,将不会执⾏任何计算,因为全局窗⼝没有可以处理聚合元素的⾃然终点。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ijPoj7qN-1584107236530)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1583994205443.png)]

object FlinkGlobalWindow {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .map(word=>(word,1))
        //注意此处,方便推断key
        .keyBy(t=>t._1)
        .window(GlobalWindows.create())
        //自定义触发器:当键的出现次数为四次时触发
        .trigger(CountTrigger.of(4))
        .apply(new UserDefineGlobalWindowFunction)
        .print()
        //5.执⾏流计算任务
        env.execute("Global Window Stream WordCount")
    }
}
class UserDefineGlobalWindowFunction extends WindowFunction[(String,Int),
                                                            (String,Int),String,GlobalWindow]{
    override def apply(key: String,
                       window: GlobalWindow,
                       input: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        var sum = input.map(_._2).sum
        out.collect((s"${key}",sum))
    }
}

Window Functions

定义窗⼝分配器后,我们需要指定要在每个窗⼝上执⾏的计算。这是Window Function的职责,⼀旦系统确定窗⼝已准备好进⾏处理,就可以处理每个窗⼝的元素。

窗⼝函数可以是ReduceFunction,AggregateFunction,FoldFunction、ProcessWindowFunction或WindowFunction(古董)之⼀。其中ReduceFunction和AggregateFunction在运⾏效率上⽐ProcessWindowFunction要⾼,因为前俩个⽅法执⾏的是增量计算,只要有数据抵达窗⼝,系统就会调⽤ReduceFunction,AggregateFunction实现增量计算;ProcessWindowFunction在窗⼝触发之前会⼀直缓存接收数据,只有当窗⼝就绪的时候才会对窗⼝中的元素做批量计算,但是该⽅法可以获取窗⼝的元数据信息。

可以通过将ProcessWindowFunction与ReduceFunction,AggregateFunction或FoldFunction结合使⽤来获得窗⼝元素的增量聚合以及ProcessWindowFunction接收的其他窗⼝元数据,从⽽减轻这种情况。

ReduceFunction
class UserDefineReduceFunction extends ReduceFunction[(String,Int)]{
    override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
        println("reduce:"+v1+"\t"+v2)
        (v1._1,v2._2+v1._2)
    }
}
object FlinkProcessingTimeTumblingWindowWithReduceFunction {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                        .map(word=>(word,1))
                        .keyBy(0)
                        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                        .reduce(new UserDefineReduceFunction)
                        .print()
        //5.执⾏流计算任务
        env.execute("Tumbling Window Stream WordCount")
    }
}
AggregateFunction
class UserDefineAggregateFunction extends AggregateFunction[(String,Int),(String,Int),
                                                            (String,Int)]{
    override def createAccumulator(): (String, Int) = ("",0)
    override def add(value: (String, Int), accumulator: (String, Int)): (String, Int) =
    {
        println("add:"+value+"\t"+accumulator)
        (value._1,value._2+accumulator._2)
    }
    override def getResult(accumulator: (String, Int)): (String, Int) = accumulator
    override def merge(a: (String, Int), b: (String, Int)): (String, Int) = {
        println("merge:"+a+"\t"+b)
        (a._1,a._2+b._2)
    }
}
object FlinkProcessingTimeTumblingWindowWithAggregateFunction {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                        .map(word=>(word,1))
                        .keyBy(0)
                        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                        .aggregate(new UserDefineAggregateFunction)
                        .print()
        //5.执⾏流计算任务
        env.execute("Tumbling Window Stream WordCount")
    }
}
FoldFunction
class UserDefineFoldFunction extends FoldFunction[(String,Int),(String,Int)]{
    override def fold(accumulator: (String, Int), value: (String, Int)): (String, Int) =
    {
        println("fold:"+accumulator+"\t"+value)
        (value._1,accumulator._2+value._2)
    }
}
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.socketTextStream("CentOS", 9999)
//3.执⾏DataStream的转换算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
                .map(word=>(word,1))
                .keyBy(0)
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                .fold(("",0),new UserDefineFoldFunction)
                .print()
//5.执⾏流计算任务
env.execute("Tumbling Window Stream WordCount")

注意 :FoldFunction不可以⽤在Session Window中

ProcessWindowFunction
class UserDefineProcessWindowFunction extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow]{
    val sdf=new SimpleDateFormat("HH:mm:ss")
    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {
        val w = context.window//获取窗⼝元数据
        val start =sdf.format(w.getStart)
        val end = sdf.format(w.getEnd)
        val total=elements.map(_._2).sum
        out.collect((key+"\t["+start+"~"+end+"]",total))
    }
}
object FlinkProcessingTimeTumblingWindowWithProcessWindowFunction {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                        .map(word=>(word,1))
                        .keyBy(t=>t._1)
                        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                        .process(new UserDefineProcessWindowFunction)
                        .print()
        //5.执⾏流计算任务
        env.execute("Tumbling Window Stream WordCount")
    }
}
ProcessWindowFunction & Reduce/Aggregte/Fold
class UserDefineProcessWindowFunction2 extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow]{
    val sdf=new SimpleDateFormat("HH:mm:ss")
    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {
        val w = context.window//获取窗⼝元数据
        val start =sdf.format(w.getStart)
        val end = sdf.format(w.getEnd)
		//打印
        val list = elements.toList
        println("list:"+list)
        
        val total=list.map(_._2).sum
        out.collect((key+"\t["+start+"~"+end+"]",total))
    }
}
class UserDefineReduceFunction2 extends ReduceFunction[(String,Int)]{
    override def reduce(v1: (String, Int), v2: (String, Int)): (String, Int) = {
        println("reduce:"+v1+"\t"+v2)
        (v1._1,v2._2+v1._2)
    }
}
object FlinkProcessingTimeTumblingWindowWithReduceFucntionAndProcessWindowFunction {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .map(word=>(word,1))
        .keyBy(t=>t._1)
        .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
        .reduce(new UserDefineReduceFunction2,new UserDefineProcessWindowFunction2)
        .print()
        //5.执⾏流计算任务
        env.execute("Tumbling Window Stream WordCount")
    }
}

Per-window state In ProcessWindowFunction

class UserDefineProcessWindowFunction3 extends ProcessWindowFunction[(String,Int),(String,Int),String,TimeWindow]{
    val sdf=new SimpleDateFormat("HH:mm:ss")
    var wvsd:ValueStateDescriptor[Int]=_
    var gvsd:ValueStateDescriptor[Int]=_
    override def open(parameters: Configuration): Unit = {
        wvsd=new ValueStateDescriptor[Int]("ws",createTypeInformation[Int])
        gvsd=new ValueStateDescriptor[Int]("gs",createTypeInformation[Int])
    }
    override def process(key: String,
                         context: Context,
                         elements: Iterable[(String, Int)],
                         out: Collector[(String, Int)]): Unit = {
        val w = context.window//获取窗⼝元数据
        val start =sdf.format(w.getStart)
        val end = sdf.format(w.getEnd)
        val list = elements.toList
        //println("list:"+list)
        val total=list.map(_._2).sum
        var wvs:ValueState[Int]=context.windowState.getState(wvsd)
        var gvs:ValueState[Int]=context.globalState.getState(gvsd)
        wvs.update(wvs.value()+total)
        gvs.update(gvs.value()+total)
        println("Window Count:"+wvs.value()+"\t"+"Global Count:"+gvs.value())
        out.collect((key+"\t["+start+"~"+end+"]",total))
    }
}
object FlinkProcessingTimeTumblingWindowWithProcessWindowFunctionState {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                    .map(word=>(word,1))
                    .keyBy(t=>t._1)
                    .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                    .process(new UserDefineProcessWindowFunction3)
                    .print()
        //5.执⾏流计算任务
        env.execute("Tumbling Window Stream WordCount")
    }
}
WindowFunction (Legacy)

在某些可以使⽤ProcessWindowFunction的地⽅,您也可以使⽤WindowFunction。这是ProcessWindowFunction的较旧版本,提供的上下⽂信息较少,并且没有某些⾼级功能,例如,每个窗⼝的keyed State。

class UserDefineWindowFunction extends WindowFunction[(String,Int), (String,Int),String,TimeWindow]{
    override def apply(key: String,
                       window: TimeWindow,
                       input: Iterable[(String, Int)],
                       out: Collector[(String, Int)]): Unit = {
        val sdf = new SimpleDateFormat("HH:mm:ss")
        var start=sdf.format(window.getStart)
        var end=sdf.format(window.getEnd)
        var sum = input.map(_._2).sum
        out.collect((s"${key}\t${start}~${end}",sum))
    }
}
object FlinkProcessingTimeSessionWindowWithWindowFunction {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
                        .map(word=>(word,1))
                        .keyBy(t=>t._1)
                        .window(ProcessingTimeSessionWindows.withGap(Time.seconds(5)))
                        .apply(new UserDefineWindowFunction)
                        .print()
        //5.执⾏流计算任务
        env.execute("Session Window Stream WordCount")
    }
}

Trigger(触发器)

Trigger决定了什么时候窗⼝准备就绪了,⼀旦窗⼝准备就绪就可以使⽤WindowFunction进⾏计算。每⼀个 WindowAssigner 都会有⼀个默认的Trigger。如果默认的Trigger不满⾜⽤户的需求⽤户可以⾃定义Trigger。

窗口类型 触发器 触发时机
EventTime(Tumblng/Sliding/Session) EventTimeTrigger ⼀旦Watermarker没过窗⼝的EndTime,该窗⼝认定为就绪
ProcessingTime(Tumblng/Sliding/Session) ProcessingTimeTrigger ⼀旦计算节点系统时钟没过窗⼝的EndTime,该触发器便触发
GlobalWindow NeverTrigger 永远不触发。

触发器接⼝具有五种⽅法,这些⽅法允许触发器对不同事件做出反应:

public abstract class Trigger<T, W extends Window> implements Serializable {
/**
 只要有元素落⼊到当前窗⼝, 就会调⽤该⽅法
 * @param element 收到的元素
 * @param timestamp 元素抵达时间.
 * @param window 元素所属的window窗⼝.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onElement(T element, long timestamp, W window,
TriggerContext ctx) throws Exception;
 /**
 * processing-time 定时器回调函数
 *
 * @param time 定时器触发的时间.
 * @param window 定时器触发的窗⼝对象.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onProcessingTime(long time, W window, TriggerContext
ctx) throws Exception;

 /**
 * event-time 定时器回调函数
 *
  * @param time 定时器触发的时间.
 * @param window 定时器触发的窗⼝对象.
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime) 回调.
 */
 public abstract TriggerResult onEventTime(long time, W window, TriggerContext ctx)
throws Exception;

 /**
 * 当 多个窗⼝合并到⼀个窗⼝的时候,调⽤该⽅法,例如系统SessionWindow
 * {@link org.apache.flink.streaming.api.windowing.assigners.WindowAssigner}.
 *
 * @param window 合并后的新窗⼝对象
 * @param ctx ⼀个上下⽂对象,通常⽤该对象注册 timer(ProcessingTime/EventTime)回调以及访问
状态
 */
 public void onMerge(W window, OnMergeContext ctx) throws Exception {
 throw new UnsupportedOperationException("This trigger does not support merging.");
 }
 /**
 * 当窗⼝被删除后执⾏所需的任何操作。例如:可以清除定时器或者删除状态数据
 */
 public abstract void clear(W window, TriggerContext ctx) throws Exception;
}

关于上述⽅法,需要注意两件事:
1)前三个⽅法决定如何通过返回TriggerResult来决定窗⼝是否就绪。

public enum TriggerResult {
 /**
 * 不触发,也不删除元素
 */
    CONTINUE(false, false),
 /**
 * 触发窗⼝,窗⼝出发后删除窗⼝中的元素
 */
    FIRE_AND_PURGE(true, true),
 /**
 * 触发窗⼝,但是保留窗⼝元素
 */
    FIRE(true, false),
 /**
 * 不触发窗⼝,丢弃窗⼝,并且删除窗⼝的元素
 */
    PURGE(false, true);
    private final boolean fire;//是否触发窗⼝
    private final boolean purge;//是否清除窗⼝元素
    ...
}

2)这些⽅法中的任何⼀种都可以⽤于注册处理或事件时间计时器以⽤于将来的操作.

案例使⽤

class UserDefineCountTrigger(maxCount:Long) extends Trigger[String,TimeWindow]{
    var rsd:ReducingStateDescriptor[Long]= new ReducingStateDescriptor[Long]("rsd",new
                                                                             ReduceFunction[Long] {
                                                                                 override def reduce(value1: Long, value2: Long): Long = value1+value2
                                                                             },createTypeInformation[Long])
    override def onElement(element: String, timestamp: Long, window: TimeWindow, ctx:
                           Trigger.TriggerContext): TriggerResult = {
        val state: ReducingState[Long] = ctx.getPartitionedState(rsd)
        state.add(1L)
        if(state.get() >= maxCount){
            state.clear()
            return TriggerResult.FIRE_AND_PURGE
        }else{
            return TriggerResult.CONTINUE
        }
    }
    override def onProcessingTime(time: Long, window: TimeWindow, ctx:
                                  Trigger.TriggerContext): TriggerResult = TriggerResult.CONTINUE
    override def onEventTime(time: Long, window: TimeWindow, ctx:
                             Trigger.TriggerContext): TriggerResult =TriggerResult.CONTINUE
    override def clear(window: TimeWindow, ctx: Trigger.TriggerContext): Unit = {
        println("==clear==")
        ctx.getPartitionedState(rsd).clear()
    }
}
object FlinkTumblingWindowWithCountTrigger {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts = text.flatMap(line=>line.split("\\s+"))
        .windowAll(TumblingProcessingTimeWindows.of(Time.seconds(5)))
        .trigger(new UserDefineCountTrigger(4L))
        .apply(new UserDefineGlobalWindowFunction)
        .print()
        //5.执⾏流计算任务
        env.execute("Global Window Stream WordCount")
    }
}
DeltaTrigger

Evictors(剔除器)

Flink的窗⼝模型允许除了WindowAssigner和Trigger之外还指定⼀个可选的Evictor。可以使⽤evictor(…)⽅法来完成此操作。Evictors可以在触发器触发后,应⽤Window Function之前或之后从窗⼝中删除元素。

evictor接口源码分析:

public interface Evictor<T, W extends Window> extends Serializable {
    /**
 * 在调⽤windowing function之前被调⽤.
 *
 * @param 当前窗⼝中的所有元素
 * @param size 当前窗⼝元素的总数
 * @param window The {@link Window}
 * @param evictorContext Evictor上下⽂对象
 */
    void evictBefore(Iterable<TimestampedValue<T>> elements, int size, W window,EvictorContext evictorContext);
    /**
 * 在调⽤ windowing function之后调⽤.
 *
 * @param elements The elements currently in the pane.
 * @param size The current number of elements in the pane.
 * @param window The {@link Window}
 * @param evictorContext The context for the Evictor
 */
    void evictAfter(Iterable<TimestampedValue<T>> elements, int size, W window,EvictorContext evictorContext);
}

evictBefore()包含要在窗⼝函数之前应⽤的剔除逻辑,⽽evictAfter()包含要在窗⼝函数之后应⽤的剔除逻辑。应⽤窗⼝功能之前剔除的元素将不会被其处理。

Flink附带了三个预先实施的驱逐程序。这些是:

CountEvictor

从窗⼝中保留⽤户指定数量的元素,并从窗⼝缓冲区的开头丢弃其余的元素。

private void evict(Iterable<TimestampedValue<Object>> elements, int size,
                   EvictorContext ctx) {
    if (size <= maxCount) {
        return;
    } else {
        int evictedCount = 0;        //要剔除元素的数量
        for (Iterator<TimestampedValue<Object>> iterator = elements.iterator();
             iterator.hasNext();){
            iterator.next();
            evictedCount++;          //从第一个元素开始剔除
            if (evictedCount > size - maxCount) {  // 输入元素的数量-允许的最大数量
                break;
            } else {
                iterator.remove();
            }
        }
    }
}
DeltaEvictor

采⽤DeltaFunction和阈值,计算窗⼝缓冲区中最后⼀个元素与其余每个元素之间的增量,并删除增量⼤
于或等于阈值的元素。

private void evict(Iterable<TimestampedValue<T>> elements, int size, EvictorContext ctx) {
    TimestampedValue<T> lastElement = Iterables.getLast(elements);
    for (Iterator<TimestampedValue<T>> iterator = elements.iterator();
         iterator.hasNext();){
        TimestampedValue<T> element = iterator.next();
        //如果最后⼀个元素和前⾯元素差值⼤于threshold
        if(deltaFunction.getDelta(element.getValue(), lastElement.getValue()) >=this.threshold){
            iterator.remove();
        }
    }
}
TimeEvictor

以毫秒为单位的间隔作为参数,对于给定的窗⼝,它将在其元素中找到最⼤时间戳max_ts,并删除所有时间戳⼩于max_ts-interval的元素。- 只要最新的⼀段时间间隔的数据

private void evict(Iterable<TimestampedValue<Object>> elements, int size,EvictorContext ctx) {
    if (!hasTimestamp(elements)) {
        return;
    }
    //获取最⼤时间戳
    long currentTime = getMaxTimestamp(elements);
    long evictCutoff = currentTime - windowSize;
    for(Iterator<TimestampedValue<Object>> iterator = elements.iterator();iterator.hasNext();){
        TimestampedValue<Object> record = iterator.next();
        if (record.getTimestamp() <= evictCutoff) {
            iterator.remove();
        }
    }
}
private boolean hasTimestamp(Iterable<TimestampedValue<Object>> elements) {
    Iterator<TimestampedValue<Object>> it = elements.iterator();
    if (it.hasNext()) {
        return it.next().hasTimestamp();
    }
    return false;
}
private long getMaxTimestamp(Iterable<TimestampedValue<Object>> elements) {
    long currentTime = Long.MIN_VALUE;
    for (Iterator<TimestampedValue<Object>> iterator = elements.iterator();iterator.hasNext();){
        TimestampedValue<Object> record = iterator.next();
        currentTime = Math.max(currentTime, record.getTimestamp());
    }
    return currentTime;
}
UserDefineEvictor
public class UserDefineEvictor implements Evictor<String, TimeWindow> {
    private Boolean isEvictorAfter=false;
    private String excludeContent=null;
    public UserDefineEvictor(Boolean isEvictorAfter, String excludeContent) {
        this.isEvictorAfter = isEvictorAfter;
        this.excludeContent = excludeContent;
    }
    @Override
    public void evictBefore(Iterable<TimestampedValue<String>> elements, int size,
                            TimeWindow window, EvictorContext evictorContext) {
        if(!isEvictorAfter){
            evict(elements,size,window,evictorContext);
        }
    }
    @Override
    public void evictAfter(Iterable<TimestampedValue<String>> elements, int size,
                           TimeWindow window, EvictorContext evictorContext) {
        if(isEvictorAfter){
            evict(elements,size,window,evictorContext);
        }
    }
    private void evict(Iterable<TimestampedValue<String>> elements, int size,
                       TimeWindow window, EvictorContext evictorContext){
        for( Iterator<TimestampedValue<String>> iterator =
            elements.iterator();iterator.hasNext();){
            TimestampedValue<String> element = iterator.next();
            //将含有相关内容元素删除
            System.out.println(element.getValue());
            if(element.getValue().contains(excludeContent)){
                iterator.remove();
            }
        }
    }
}
object FlinkSlidingWindowWithUserDefineEvictor {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        val text = env.socketTextStream("CentOS", 9999)
        //3.执⾏DataStream的转换算⼦
        val counts =
        text.windowAll(SlidingProcessingTimeWindows.of(Time.seconds(4),Time.seconds(2)))
        .evictor(new UserDefineEvictor(false,"error"))
        //将所得到的数据进行拼接 使用 | 分隔符
        .apply(new UserDefineSlidingWindowFunction)
        .print()
        //5.执⾏流计算任务
        env.execute("Sliding Window Stream WordCount")
    }
}
class UserDefineSlidingWindowFunction extendsAllWindowFunction[String,String,TimeWindow]{
    override def apply(window: TimeWindow,
                       input: Iterable[String],
                       out: Collector[String]): Unit = {
        val sdf = new SimpleDateFormat("HH:mm:ss")
        var start=sdf.format(window.getStart)
        var end=sdf.format(window.getEnd)
        var windowContent=input.toList
        println("window:"+start+"\t"+end+" "+windowContent.mkString(" | "))
    }
}

EventTime Window

Flink流计算传输中⽀持多种时间概念:ProcessingTime / EventTime / IngestionTime

如果Flink算⼦使⽤的时候不做特殊设定,默认使⽤的是ProcessingTime。其中和ProcessingTime类似IngestionTime都是由系统⾃动产⽣,不同的是IngestionTime是由DataSource源产⽣⽽ProcessingTime由计算算⼦产⽣。因此以上两种时间策略都不能很好的表达在流计算中事件产⽣时间(考虑⽹络传输延迟)。

Flink中⽀持基于EventTime语义的窗⼝计算,Flink会使⽤Watermarker机制去衡量事件时间推进进度。

Watermarker会在做为数据流的⼀部分随着数据⽽流动。Watermarker包含有⼀个时间t,这就表明流中不会再有事件时间t’<=t的元素存在。

Watermarker(t)= Max event time seen by Procee Node - MaxAllowOrderless

水位线的计算方式:最大的可见时间 - 最大的允许迟到的时间

Watermarker

在Flink中常⻅的⽔位线的计算⽅式:

① 固定频次计算⽔位线(推荐):

class UserDefineAssignerWithPeriodicWatermarks extends AssignerWithPeriodicWatermarks[(String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= 0L //不可以取Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //系统定期的调⽤ 计算当前的⽔位线的值
    override def getCurrentWatermark: Watermark = {
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    //更新⽔位线的值,同时提取EventTime  (返回当前事件的时间),每次到水位线计算时间调用一次此方法
    override def extractTimestamp(element: (String, Long), 
                                  previousElementTimestamp:Long): Long = {
        //始终将最⼤的时间返回
        maxSeenEventTime=Math.max(maxSeenEventTime,element._2)
        println("ET:"+(element._1,sdf.format(element._2))+
                "WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        element._2
    }
}

② Per Event 计算⽔位线 (不推荐):每来一个元素计算一次水位线

class UserDefineAssignerWithPunctuatedWatermarks extends
AssignerWithPunctuatedWatermarks[(String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //每接收⼀条记录系统计算⼀次,调用一次此方法
    override def checkAndGetNextWatermark(lastElement: (String, Long),
                                          extractedTimestamp: Long): Watermark = {
        maxSeenEventTime=Math.max(maxSeenEventTime,lastElement._2)
        println("ET:"+(lastElement._1,sdf.format(lastElement._2))+
                "WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    override def extractTimestamp(element: (String, Long), previousElementTimestamp:
                                  Long): Long = {
        //始终将最⼤的时间返回
        element._2
    }
}

测试案例

object FlinkEventTimeTumblingWindow {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1) //⽅便测试将并⾏度设置为 1,防止并行度过高产生多个水位线对测试构成干扰
        //默认时间特性是ProcessingTime,需要设置为EventTime
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //设置定期调⽤⽔位线频次 1s,如果使用定期计算水位线的方式则使用此配置
        //env.getConfig.setAutoWatermarkInterval(1000)
        //字符 时间戳
        env.socketTextStream("CentOS", 9999)
        .map(line=>line.split("\\s+"))
        //关于事件时间的计算,元素中必须含有关于时间的字段
        .map(ts=>(ts(0),ts(1).toLong))
        //自定义要使用的水位线更新方式(上面两种类二选一,推荐第一种)
        .assignTimestampsAndWatermarks(new UserDefineAssignerWithPunctuatedWatermarks)
        .windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
        .apply(new UserDefineAllWindowFucntion)
        .print("输出")
        env.execute("Tumbling Event Time Window Stream")
    }
}

注意当流中存在多个Watermarker的时候,取最⼩值作为⽔位线。(高并发时使用)

迟到数据

在Flink中,⽔位线⼀旦没过窗⼝的EndTime,这个时候如果还有数据落⼊到已经被⽔位线淹没的窗⼝,我们定义该数据为迟到的数据。这些数据在Spark是没法进⾏任何处理的。在Flink中⽤户可以定义窗⼝元素的迟到时间 t’。

如果 Watermarker时间 t < 窗⼝EndTime t’’ + t’ ,则该数据还可以参与窗⼝计算。

object FlinkEventTimeTumblingWindowLateData {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)//⽅便测试将并⾏度设置为 1
        //默认时间特性是ProcessingTime,需要设置为EventTime
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //设置定期调⽤⽔位线频次 1s
        // env.getConfig.setAutoWatermarkInterval(1000)
        //字符 时间戳
        env.socketTextStream("CentOS", 9999)
        .map(line=>line.split("\\s+"))
        .map(ts=>(ts(0),ts(1).toLong))
        .assignTimestampsAndWatermarks(new UserDefineAssignerWithPunctuatedWatermarks)
        .windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
        .allowedLateness(Time.seconds(2))
        .apply(new UserDefineAllWindowFucntion)
        .print("输出")
        env.execute("Tumbling Event Time Window Stream")
    }
}

如果Watermarker时间t >= 窗⼝EndTime t’’ + t’ 则该数据默认情况下Flink会丢弃。当然⽤户可以将too late数据通过side out输出获取,⼀遍⽤户知道哪些迟到的数据没能加⼊正常计算。

object FlinkEventTimeTumblingWindowTooLateData {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)//⽅便测试将并⾏度设置为 1
        //默认时间特性是ProcessingTime,需要设置为EventTime
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //设置定期调⽤⽔位线频次 1s
        // env.getConfig.setAutoWatermarkInterval(1000)
        val lateTag= new OutputTag[(String,Long)]("late")
        //字符 时间戳
        var result=env.socketTextStream("CentOS", 9999)
        .map(line=>line.split("\\s+"))
        .map(ts=>(ts(0),ts(1).toLong))
        .assignTimestampsAndWatermarks(new UserDefineAssignerWithPunctuatedWatermarks)
        .windowAll(TumblingEventTimeWindows.of(Time.seconds(2)))
        .allowedLateness(Time.seconds(2))
        .sideOutputLateData(lateTag)
        .apply(new UserDefineAllWindowFucntion)
        result.print("正常")
        result.getSideOutput(lateTag).printToErr("太迟")
        env.execute("Tumbling Event Time Window Stream")
    }
}

Joining链接

Window Join (了解)

窗⼝join将共享相同key并位于同⼀窗⼝中的两个流的元素联接在⼀起。可以使⽤窗⼝分配器定义这些窗⼝,并根据两个流中的元素对其进⾏评估。然后将双⽅的元素传递到⽤户定义的JoinFunction或FlatJoinFunction,在此⽤户可以发出满⾜联接条件的结果。

stream.join(otherStream)
 .where(<KeySelector>)
 .equalTo(<KeySelector>)
 .window(<WindowAssigner>)
 .apply(<JoinFunction>)

Note

创建两个流的元素的成对组合的⾏为就像⼀个内部联接,这意味着如果⼀个流中的元素没有与另⼀流中要连接的元素对应的元素,则不会发出该元素。
那些确实加⼊的元素将以最⼤的时间戳(仍位于相应窗⼝中)作为时间戳。例如,以[5,10)为边界的窗⼝将导致连接的元素具有9作为其时间戳。

Tumbling Window Join

当执⾏滚动窗⼝联接时,所有具有公共键和公共滚动窗⼝的元素都按成对组合联接,并传递到JoinFunction或FlatJoinFunction。因为它的⾏为就像⼀个内部联接,所以在其滚动窗⼝中不发射⼀个流中没有其他流元素的元素!
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PhqOZ6SQ-1584107236532)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1584106349826.png)]

class TumblingAssignerWithPeriodicWatermarks extends
AssignerWithPeriodicWatermarks[(String,String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= 0L //不可以取Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //系统定期的调⽤ 计算当前的⽔位线的值
    override def getCurrentWatermark: Watermark = {
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    //更新⽔位线的值,同时提取EventTime
    override def extractTimestamp(element: (String,String, Long),
                                  previousElementTimestamp: Long): Long = {
        //始终将最⼤的时间返回
        maxSeenEventTime=Math.max(maxSeenEventTime,element._3)
        println("ET:"+(element._1,element._2,sdf.format(element._3))+"
WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        element._3
    }
}
object FlinkTumblingWindowJoin {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //001 zhansgan 时间戳
        val stream1 = env.socketTextStream("CentOS", 9999)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new TumblingAssignerWithPeriodicWatermarks)
        //apple 001 时间戳
        val stream2 = env.socketTextStream("CentOS", 8888)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new TumblingAssignerWithPeriodicWatermarks)
        stream1.join(stream2)
        .where(t=>t._1) //stream1 ⽤户ID
        .equalTo(t=> t._2) //stream2 ⽤户ID
        .window(TumblingEventTimeWindows.of(Time.seconds(2)))
        .apply((s1,s2,out:Collector[(String,String,String)])=>{
            out.collect(s1._1,s1._2,s2._1)
        })
        .print("join结果")
        env.execute("FlinkTumblingWindowJoin")
    }
}
Sliding Window Join

执⾏滑动窗⼝连接时,所有具有公共键和公共滑动窗⼝的元素都按成对组合进⾏连接,并传递给JoinFunction或FlatJoinFunction。在当前滑动窗⼝中,⼀个流中没有其他流元素的元素不会被发出!请注意,某些元素可能在⼀个滑动窗⼝中连接,但不能在另⼀个窗⼝中连接!

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NeyHue4i-1584107236533)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1584106414650.png)]

class SlidlingAssignerWithPeriodicWatermarks extends
AssignerWithPeriodicWatermarks[(String,String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= 0L //不可以取Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //系统定期的调⽤ 计算当前的⽔位线的值
    override def getCurrentWatermark: Watermark = {
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    //更新⽔位线的值,同时提取EventTime
    override def extractTimestamp(element: (String,String, Long),
                                  previousElementTimestamp: Long): Long = {
        //始终将最⼤的时间返回
        maxSeenEventTime=Math.max(maxSeenEventTime,element._3)
        println("ET:"+(element._1,element._2,sdf.format(element._3))+"
WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        element._3
    }
}
object FlinkSlidlingWindowJoin {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //001 zhansgan 时间戳
        val stream1 = env.socketTextStream("CentOS", 9999)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new SlidlingAssignerWithPeriodicWatermarks)
        //apple 001 时间戳
        val stream2 = env.socketTextStream("CentOS", 8888)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new SlidlingAssignerWithPeriodicWatermarks)
        stream1.join(stream2)
        .where(t=>t._1) //stream1 ⽤户ID
        .equalTo(t=> t._2) //stream2 ⽤户ID
        .window(SlidingEventTimeWindows.of(Time.seconds(4),Time.seconds(2)))
        .apply((s1,s2,out:Collector[(String,String,String)])=>{
            out.collect(s1._1,s1._2,s2._1)
        })
        .print("join结果")
        env.execute("FlinkSlidlingWindowJoin")
    }
}
Session Window Join

在执⾏会话窗⼝连接时,具有“组合”时满⾜会话条件的相同键的所有元素将以成对组合的⽅式连接在⼀起,并传递给JoinFunction或FlatJoinFunction。再次执⾏内部联接,因此,如果有⼀个会话窗⼝仅包含⼀个流中的元素,则不会发出任何输出!

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JA4RJ7KY-1584107236533)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1584106489410.png)]

class SessionAssignerWithPeriodicWatermarks extends
AssignerWithPeriodicWatermarks[(String,String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= 0L //不可以取Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //系统定期的调⽤ 计算当前的⽔位线的值
    override def getCurrentWatermark: Watermark = {
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    //更新⽔位线的值,同时提取EventTime
    override def extractTimestamp(element: (String,String, Long),
                                  previousElementTimestamp: Long): Long = {
        //始终将最⼤的时间返回
        maxSeenEventTime=Math.max(maxSeenEventTime,element._3)
        println("ET:"+(element._1,element._2,sdf.format(element._3))+"
WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        element._3
    }
}
object FlinkSessionWindowJoin {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //001 zhansgan 时间戳
        val stream1 = env.socketTextStream("CentOS", 9999)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new SessionAssignerWithPeriodicWatermarks)
        //apple 001 时间戳
        val stream2 = env.socketTextStream("CentOS", 8888)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new SessionAssignerWithPeriodicWatermarks)
        stream1.join(stream2)
        .where(t=>t._1) //stream1 ⽤户ID
        .equalTo(t=> t._2) //stream2 ⽤户ID
        .window(EventTimeSessionWindows.withGap(Time.seconds(2)))
        .apply((s1,s2,out:Collector[(String,String,String)])=>{
            out.collect(s1._1,s1._2,s2._1)
        })
        .print("join结果")
        env.execute("FlinkSessionWindowJoin")
    }
}
Interval Join(区间join)

间隔连接使⽤公共key连接两个流(现在将它们分别称为A和B)的元素,并且流B的元素时间位于流A的元素时间戳的间隔之中,则A和B的元素就可以join

链接条件

b.timestamp ∈ [a.timestamp + lowerBound; a.timestamp + upperBound]
或者
a.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fe74GVwz-1584107236534)(C:\Users\Administrator\AppData\Roaming\Typora\typora-user-images\1584106581498.png)]

class IntervaAssignerWithPeriodicWatermarks extends
AssignerWithPeriodicWatermarks[(String,String,Long)] {
    var maxAllowOrderness=2000L
    var maxSeenEventTime= 0L //不可以取Long.MinValue
    var sdf=new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
    //系统定期的调⽤ 计算当前的⽔位线的值
    override def getCurrentWatermark: Watermark = {
        new Watermark(maxSeenEventTime-maxAllowOrderness)
    }
    //更新⽔位线的值,同时提取EventTime
    override def extractTimestamp(element: (String,String, Long),
                                  previousElementTimestamp: Long): Long = {
        //始终将最⼤的时间返回
        maxSeenEventTime=Math.max(maxSeenEventTime,element._3)
        println("ET:"+(element._1,element._2,sdf.format(element._3))+"
WM:"+sdf.format(maxSeenEventTime-maxAllowOrderness))
        element._3
    }
}
object FlinkIntervalJoin {
    def main(args: Array[String]): Unit = {
        val env = StreamExecutionEnvironment.getExecutionEnvironment
        env.setParallelism(1)
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
        //001 zhansgan 时间戳
        val stream1 = env.socketTextStream("CentOS", 9999)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new IntervaAssignerWithPeriodicWatermarks)
        .keyBy(t=>t._1)
        //apple 001 时间戳
        val stream2 = env.socketTextStream("CentOS", 8888)
        .map(_.split("\\s+"))
        .map(ts=>(ts(0),ts(1),ts(2).toLong))
        .assignTimestampsAndWatermarks(new IntervaAssignerWithPeriodicWatermarks)
        .keyBy(t=>t._2)
        stream1.intervalJoin(stream2)
        .between(Time.seconds(0),Time.seconds(2))//默认是边界包含
        //.lowerBoundExclusive() 排除下边界
        //.upperBoundExclusive() 排除上边界
        .process(new ProcessJoinFunction[(String,String,Long),
                                         (String,String,Long),String] {
            override def processElement(left: (String, String, Long),
                                        right: (String, String, Long),
                                        ctx: ProcessJoinFunction[(String, String,Long),                                                 (String, String, Long), String]#Context,
                                        out: Collector[String]): Unit = {
                println("l:"+ctx.getLeftTimestamp+"
"+"r:"+ctx.getRightTimestamp+",t:"+ctx.getTimestamp)
                out.collect(left._1+" "+left._2+" "+right._1)
            }
        })
        .print()
        env.execute("FlinkIntervalJoin")
    }
}

Flink JobManager HA


Overview

JobManager协调每个Flink部署。它负责调度和资源管理。 默认情况下,每个Flink集群只有⼀个JobManager实例。这将创建⼀个单点故障(SPOF- single point of failure):如果JobManager崩溃,则⽆法提交任何新程序,并且正在运⾏的程序也会失败。 使⽤JobManager⾼可⽤性,您可以从JobManager故障中恢复,从⽽消除SPOF。您可以为Standalone群集和YARN群集配置⾼可⽤性。

Standalone Cluster High Availability

独⽴群集的JobManager⾼可⽤性的总体思想是,随时都有⼀个leader的JobManager,并且有多个备⽤JobManager可以在leader失败的情况下接管leader。这样可以确保没有单点故障,并且只要backup的JobManager处于leader地位,程序就可以正常执⾏。备⽤JobManager实例和主JobManager实例之间没有明显区别。每个JobManager都可以充当master⻆⾊或standby⻆⾊。

搭建过程

时钟同步 (苹果使用如下)(windows使用虚拟机的配置项)

[root@CentOSX ~]# ntpdate time.apple.com
13 Mar 17:09:10 ntpdate[6581]: step time server 17.253.84.253 offset 2169739.408920 sec
[root@CentOSX ~]# clock -w

IP和主机映射

[root@CentOSX ~]# vi /etc/hosts
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.52.130 CentOSA
192.168.52.131 CentOSB
192.168.52.132 CentOSC

SSH免密码认证

[root@CentOSX ~]# ssh-keygen -t rsa
[root@CentOSX ~]# ssh-copy-id CentOSA
[root@CentOSX ~]# ssh-copy-id CentOSB
[root@CentOSX ~]# ssh-copy-id CentOSC

关闭防⽕墙

[root@CentOSX ~]# systemctl stop firewalld
[root@CentOSX ~]# systemctl disable firewalld

安装JDK,配置JAVA_HOME

[root@CentOSX ~]# rpm -ivh jdk-8u171-linux-x64.rpm
[root@CentOSX ~]# vi .bashrc
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin
CLASSPATH=.
export JAVA_HOME
export CLASSPATH
export PATH
[root@CentOSX ~]# source .bashrc

安装Zookeeper集群 -启动ZK集群

[root@CentOSX ~]# tar -zxf zookeeper-3.4.6.tar.gz -C /usr/
[root@CentOSX ~]# mkdir /root/zkdata
[root@CentOSA ~]# echo 1 >> /root/zkdata/myid
[root@CentOSB ~]# echo 2 >> /root/zkdata/myid
[root@CentOSC ~]# echo 3 >> /root/zkdata/myid
[root@CentOSX ~]# touch /usr/zookeeper-3.4.6/conf/zoo.cfg
[root@CentOSX ~]# vi /usr/zookeeper-3.4.6/conf/zoo.cfg
tickTime=2000
dataDir=/root/zkdata
clientPort=2181
initLimit=5
syncLimit=2
server.1=CentOSA:2887:3887
server.2=CentOSB:2887:3887
server.3=CentOSC:2887:3887
[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh start zoo.cfg
[root@CentOSX ~]# /usr/zookeeper-3.4.6/bin/zkServer.sh status zoo.cfg
JMX enabled by default
Using config: /usr/zookeeper-3.4.6/bin/../conf/zoo.cfg
Mode: `follower|leader`
[root@CentOSX ~]# jps
5879 `QuorumPeerMain`
7423 Jps

安装HDFS-HA

[root@CentOSX ~]# tar -zxf hadoop-2.9.2.tar.gz -C /usr/
[root@CentOSX ~]# vi .bashrc
HADOOP_HOME=/usr/hadoop-2.9.2
JAVA_HOME=/usr/java/latest
PATH=$PATH:$JAVA_HOME/bin:$M2_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
CLASSPATH=.
export JAVA_HOME
export PATH
export CLASSPATH
export HADOOP_HOME
export HADOOP_CLASSPATH=`hadoop classpath`
[root@CentOSX ~]# source .bashrc
[root@CentOSX ~]# vi /usr/hadoop-2.9.2/etc/hadoop/core-site.xml
<!--配置Namenode服务ID-->
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://mycluster</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/hadoop-2.9.2/hadoop-${user.name}</value>
</property>
<property>
    <name>fs.trash.interval</name>
    <value>30</value>
</property>
<!--配置ZK服务信息-->
<property>
    <name>ha.zookeeper.quorum</name>
    <value>CentOSA:2181,CentOSB:2181,CentOSC:2181</value>
</property>
<!--配置SSH秘钥位置-->
<property>
    <name>dfs.ha.fencing.methods</name>
    <value>sshfence</value>
</property>
<property>
    <name>dfs.ha.fencing.ssh.private-key-files</name>
    <value>/root/.ssh/id_rsa</value>
</property>
[root@CentOSX ~]# vi /usr/hadoop-2.9.2/etc/hadoop/hdfs-site.xml
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
<!--开启⾃动故障转移-->
<property>
    <name>dfs.ha.automatic-failover.enabled</name>
    <value>true</value>
</property>
<!--解释core-site.xml内容-->
<property>
    <name>dfs.nameservices</name>
    <value>mycluster</value>
</property>
<property>
    <name>dfs.ha.namenodes.mycluster</name>
    <value>nn1,nn2</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn1</name>
    <value>CentOSA:9000</value>
</property>
<property>
    <name>dfs.namenode.rpc-address.mycluster.nn2</name>
    <value>CentOSB:9000</value>
</property>
<!--配置⽇志服务器的信息-->
<property>
    <name>dfs.namenode.shared.edits.dir</name>
    <value>qjournal://CentOSA:8485;CentOSB:8485;CentOSC:8485/mycluster</value>
</property>
<!--实现故障转切换的实现类-->
<property>
    <name>dfs.client.failover.proxy.provider.mycluster</name>
    <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>
[root@CentOSX ~]# vi /usr/hadoop-2.9.2/etc/hadoop/slaves
CentOSA
CentOSB
CentOSC

启动HDFS(集群初始化启动)

[root@CentOSX ~]# hadoop-daemon.sh start journalnode (等待10s钟)
[root@CentOSA ~]# hdfs namenode -format
[root@CentOSA ~]# hadoop-daemon.sh start namenode
[root@CentOSB ~]# hdfs namenode -bootstrapStandby
[root@CentOSB ~]# hadoop-daemon.sh start namenode
#注册Namenode信息到zookeeper中,只需要在CentOSA或者B上任意⼀台执⾏⼀下指令
[root@CentOSA|B ~]# hdfs zkfc -formatZK
[root@CentOSA ~]# hadoop-daemon.sh start zkfc
[root@CentOSB ~]# hadoop-daemon.sh start zkfc
[root@CentOSX ~]# hadoop-daemon.sh start datanode

搭建配置Flink

[root@CentOSX ~]# tar -zxf flink-1.10.0-bin-scala_2.11.tgz -C /usr/
[root@CentOSX ~]# vi /usr/flink-1.10.0/conf/flink-conf.yaml
taskmanager.numberOfTaskSlots: 4
parallelism.default: 3
queryable-state.enable: true
#==============================================================================
# High Availability
#==============================================================================
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: CentOSA:2181,CentOSB:2181,CentOSC:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /default_ns
#==============================================================================
# Fault tolerance and checkpointing
#==============================================================================
 state.backend: rocksdb
 state.checkpoints.dir: hdfs://mycluster/flink-checkpoints
 state.savepoints.dir: hdfs://mycluster/flink-savepoints
 state.backend.incremental: false
[root@CentOSX ~]# vi /usr/flink-1.10.0/conf/masters
CentOSA:8081
CentOSB:8081
CentOSC:8081
[root@CentOSX ~]# vi /usr/flink-1.10.0/conf/slaves
CentOSA
CentOSB
CentOSC
[root@CentOSA flink-1.10.0]# ./bin/start|stop-cluster.sh
Starting HA cluster with 3 masters.
Starting standalonesession daemon on host CentOSA.
Starting standalonesession daemon on host CentOSB.
Starting standalonesession daemon on host CentOSC.
Starting taskexecutor daemon on host CentOSA.
Starting taskexecutor daemon on host CentOSB.
Starting taskexecutor daemon on host CentOSC.

⽤户可以⾃leader主机执⾏

[root@CentOSX flink-1.10.0]# ./bin/jobmanager.sh start|stop

=======

High Availability

#==============================================================================
high-availability: zookeeper
high-availability.storageDir: hdfs:///flink/ha/
high-availability.zookeeper.quorum: CentOSA:2181,CentOSB:2181,CentOSC:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: /default_ns
#==============================================================================

Fault tolerance and checkpointing

#==============================================================================
state.backend: rocksdb
state.checkpoints.dir: hdfs://mycluster/flink-checkpoints
state.savepoints.dir: hdfs://mycluster/flink-savepoints
state.backend.incremental: false


```shell
[root@CentOSX ~]# vi /usr/flink-1.10.0/conf/masters
CentOSA:8081
CentOSB:8081
CentOSC:8081
[root@CentOSX ~]# vi /usr/flink-1.10.0/conf/slaves
CentOSA
CentOSB
CentOSC
[root@CentOSA flink-1.10.0]# ./bin/start|stop-cluster.sh
Starting HA cluster with 3 masters.
Starting standalonesession daemon on host CentOSA.
Starting standalonesession daemon on host CentOSB.
Starting standalonesession daemon on host CentOSC.
Starting taskexecutor daemon on host CentOSA.
Starting taskexecutor daemon on host CentOSB.
Starting taskexecutor daemon on host CentOSC.

⽤户可以⾃leader主机执⾏

[root@CentOSX flink-1.10.0]# ./bin/jobmanager.sh start|stop

猜你喜欢

转载自blog.csdn.net/origin_cx/article/details/104850383
今日推荐