Flink actual combat-page advertisement analysis, and real-time detection of malicious click behavior

Websites generally need to formulate corresponding pricing strategies and adjust marketing methods based on the amount of advertising clicks. Generally, they also collect some user preferences and other information. Here, a statistics of users’ clicks on different advertisements in different provinces/or cities is implemented. It helps the marketing department to more accurately place the advertisements, and to prevent someone from maliciously clicking on the same advertisement (of course, the same ip keeps clicking on different advertisements)

The prepared log file ClickLog.csv:

543462,1715,beijing,beijing,1512652431
543461,1713,shanghai,shanghai,1512652433
543464,1715,shanxi,xian,1512652435
543464,1715,shanxi,weinan,1512652441
543464,1715,shanxi,weinan,1512652442
543464,1715,shanxi,weinan,1512652443
543464,1715,shanxi,weinan,1512652444
543464,1715,shanxi,weinan,1512652445
543464,1715,shanxi,weinan,1512652446
543464,1715,shanxi,weinan,1512652447
543464,1715,shanxi,weinan,1512652451
543464,1715,shanxi,weinan,1512652452
543464,1715,shanxi,weinan,1512652453
543464,1715,shanxi,weinan,1512652454
543464,1715,shanxi,weinan,1512652455
543464,1715,shanxi,weinan,1512652456
543464,1715,shanxi,weinan,1512652457
543464,1715,shanxi,hanzhong,1512652461
543464,1715,shanxi,yanan,1512652561

Code:

/*
 *
 * @author mafei
 * @date 2021/1/10
*/
package com.mafei.market_analysis

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import java.sql.Timestamp

/**
 * 定义输入的样例类
 * 543464,1715,shanxi,weinan,1512652459
 */
case class AdClickLog(userId: Long,adId: Long,province: String, city: String,timestamp:Long)

/**
 * 定义输出的样例类
 * 统计每个省对每个广告的点击量
 */
case class AdClickCountByProvince(windowEnd: String,province: String, count: Long)

object AdClickAnalysis {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //指定事件时间为窗口和watermark的时间
    env.setParallelism(1)

    //从文件中读取数据
    val resource = getClass.getResource("/ClickLog.csv")
    val inputStream = env.readTextFile(resource.getPath)

    // 转换成样例类,并提取时间戳watermark
      val adLogStream = inputStream
        .map(d=>{
          val arr = d.split(",")
          AdClickLog(arr(0).toLong,arr(1).toLong,arr(2),arr(3),arr(4).toLong)
        })
        .assignAscendingTimestamps(_.timestamp * 1000L)

    // 定义窗口,聚合统计
    val adCountResultStream = adLogStream
      .keyBy(_.province)
      .timeWindow(Time.days(1),Time.seconds(50))
      .aggregate(new AdCountAgg(), new AdCountWindowResult())

    adCountResultStream.print()
    env.execute("统计广告点击情况")
  }
}

class AdCountAgg() extends AggregateFunction[AdClickLog, Long,Long]{
  override def createAccumulator(): Long = 0L

  override def add(in: AdClickLog, acc: Long): Long = acc+1

  override def getResult(acc: Long): Long = acc

  override def merge(acc: Long, acc1: Long): Long = acc + acc1
}

class AdCountWindowResult() extends WindowFunction[Long,AdClickCountByProvince,String,TimeWindow]{
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[AdClickCountByProvince]): Unit = {

    out.collect(AdClickCountByProvince(windowEnd = new Timestamp(window.getEnd).toString, province = key, count = input.head))
  }
}

Code structure and operation effect

Flink actual combat-page advertisement analysis, and real-time detection of malicious click behavior

Blacklist refresh order filter

In the above code, repeated clicks of the same user will be superimposed and calculated. In actual production scenarios, the same user may repeatedly click on an advertisement, but if the user clicks on the advertisement very frequently within a period of time, this is obviously not This is a normal behavior, so you can set a limit on the click volume. For example, the same advertisement can be clicked by the same person up to 100 times a day. If it is exceeded, the user will be added to the blacklist and alert, and the subsequent click behavior will no longer Count
that improved version:

/*
 *
 * @author mafei
 * @date 2021/1/10
*/
package com.mafei.market_analysis

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ValueState, ValueStateDescriptor}
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

import java.sql.Timestamp

/**
 * 定义输入的样例类
 * 543464,1715,shanxi,weinan,1512652459
 */
case class AdClickLog(userId: Long, adId: Long, province: String, city: String, timestamp: Long)

/**
 * 定义输出的样例类
 * 统计每个省对每个广告的点击量
 */
case class AdClickCountByProvince(windowEnd: String, province: String, count: Long)

/**
 * 黑名单预警输出的样例类
 */
case class UserBlackListWarning(userId: String, adId: String, msg: String)

object AdClickAnalysis {
  def main(args: Array[String]): Unit = {

    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) //指定事件时间为窗口和watermark的时间
    env.setParallelism(1)

    //从文件中读取数据
    val resource = getClass.getResource("/ClickLog.csv")
    val inputStream = env.readTextFile(resource.getPath)

    // 转换成样例类,并提取时间戳watermark
    val adLogStream = inputStream
      .map(d => {
        val arr = d.split(",")
        AdClickLog(arr(0).toLong, arr(1).toLong, arr(2), arr(3), arr(4).toLong)
      })
      .assignAscendingTimestamps(_.timestamp * 1000L)

    // 插入一步操作,把有刷单行为的用户信息输出到黑名单(侧输出流中)并做过滤
    val userBlackListFilterStream: DataStream[AdClickLog] = adLogStream
      .keyBy(data => {
        (data.userId, data.adId)
      })
      .process(new FilterUserBlackListResult(10L))

    // 定义窗口,聚合统计
    val adCountResultStream = userBlackListFilterStream
      .keyBy(_.province)
      .timeWindow(Time.days(1), Time.seconds(50))
      .aggregate(new AdCountAgg(), new AdCountWindowResult())

    adCountResultStream.print()

    //打印测输出流
    userBlackListFilterStream.getSideOutput(new OutputTag[UserBlackListWarning]("warning")).print("测输出流")
    env.execute("统计广告点击情况")
  }
}

class AdCountAgg() extends AggregateFunction[AdClickLog, Long, Long] {
  override def createAccumulator(): Long = 0L

  override def add(in: AdClickLog, acc: Long): Long = acc + 1

  override def getResult(acc: Long): Long = acc

  override def merge(acc: Long, acc1: Long): Long = acc + acc1
}

class AdCountWindowResult() extends WindowFunction[Long, AdClickCountByProvince, String, TimeWindow] {
  override def apply(key: String, window: TimeWindow, input: Iterable[Long], out: Collector[AdClickCountByProvince]): Unit = {

    out.collect(AdClickCountByProvince(windowEnd = new Timestamp(window.getEnd).toString, province = key, count = input.head))
  }
}

/**
 * key是上面定义的二元组
 * 输入和输出不变,只是做过滤
 */
class FilterUserBlackListResult(macCount: Long) extends KeyedProcessFunction[(Long, Long), AdClickLog, AdClickLog] {
  /**
   * 定义状态,保存每一个用户对每个广告的点击量
   */
  lazy val countState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long]("count", classOf[Long]))

  /**
   * 定义每天0点定时清空状态的时间戳
   */
  lazy val resetTimeTsState: ValueState[Long] = getRuntimeContext.getState(new ValueStateDescriptor[Long]("resetTs", classOf[Long]))

  /**
   * 定义用户有没有进入黑名单
   */
  lazy val isBlackList: ValueState[Boolean] = getRuntimeContext.getState(new ValueStateDescriptor[Boolean]("isBlackList", classOf[Boolean]))

  override def processElement(i: AdClickLog, context: KeyedProcessFunction[(Long, Long), AdClickLog, AdClickLog]#Context, collector: Collector[AdClickLog]): Unit = {
    val curCount = countState.value()

    //初始状态
    if(curCount == 0){
      /**
       * 获取明天0点的时间戳,用来注册定时器,明天0点把状态全部置空
       *
       *
       * 获取明天的天数: context.timerService().currentProcessingTime()/(1000*60*60*24)+1
       *  * (24*60*60*1000) 是转换成明天0点的时间戳
       *   - 8*60*60*1000   是从伦敦时间转为东8区
       *
       */
      val ts = (context.timerService().currentProcessingTime()/(1000*60*60*24)+1) * (24*60*60*1000) - 8*60*60*1000
      context.timerService().registerProcessingTimeTimer(ts)
      resetTimeTsState.update(ts)   //定义重置的时间点
    }

    //判断次数是不是超过了定义的阈值,如果超过了那就输出到侧输出流
    if(curCount > macCount){
//      println("超出阈值了,curCount:"+curCount + " isBlackList:"+isBlackList.value())
      //判断下,是不是在黑名单里头,没有的话才输出到侧输出流,否则就会重复输出
      if(!isBlackList.value()){
        isBlackList.update(true)
        context.output(new OutputTag[UserBlackListWarning]("warning"),UserBlackListWarning(i.userId.toString,i.adId.toString,curCount+"超过了出现的次数"+macCount))
      }
      return
    }

    //正常情况,每次都计数加1,然后把数据原样输出,毕竟这里只是为了裹一层
    countState.update(curCount +1)
    collector.collect(i)
  }

  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[(Long, Long), AdClickLog, AdClickLog]#OnTimerContext, out: Collector[AdClickLog]): Unit = {
    if(timestamp == resetTimeTsState.value()){
      isBlackList.clear()
      countState.clear()
    }

  }
}

Code structure and operation effect

Flink actual combat-page advertisement analysis, and real-time detection of malicious click behavior

Guess you like

Origin blog.51cto.com/mapengfei/2604456