Application of spark in risk control user groups

Application of spark in risk control user groups

introduction

At the beginning of 2020, the business has grown steadily, and the risk control has encountered challenges. As a C2C (similar to Dewu, Xianyu) trading platform, it is inevitable that user groups will assemble platform wool and self-buy and sell. The risk control team will After the user’s historical data is integrated, it is hoped that various large and small groups can be selected from the circle to prevent transactions within and between groups. The amount of data is large and the latitude is complicated. After several weeks of running-in attempts, and some small groups The exchange of the loan company finally determined a complete gang discovery system, and finally moved towards real-time data, in which Spark's GraphX ​​and streaming occupies an important position. Next, briefly describe the application of version 1.0.

SparkGraphX

updating. . .

Neo4j

neo4j is an open source graph database (cluster version is not open source, charge by node, thief is expensive, single node is free), the same business is split to meet current data needs. Only query, not calculation, calculation was considered before, low efficiency, high memory consumption, and finally chose SparkGraphX ​​as the calculation.
updating. . .

SparkStreaming

sparkStreaming (use flink conditionally) does two things:

1. Second-level real-time relational storage

Data source: binlog (kafka)
Data storage: neo4j, kafka, redis
Processing process: (1), ordinary data, a feature of user A, directly stored in neo4j, kafka, redis
(2), multiple join data, two For a topic to join, it is not useful to use join directly. The window function + groupByKey is used. The
window size is 4s, and the sliding interval is 2s (determined according to the delay of the business itself and the delay of the data itself, because after this action is triggered, the app There will be a waiting time of a few seconds and the results of multiple actions. After ensuring that the risk control data is in place, the business will call it. Otherwise, when the wool party has finished the wool, it will detect the risk, is it a bit nonsense , So some actions are more stuck, not caused by data congestion, maybe just sleep(n) written by the programmer, hahaha). For example, the relationship between uid and payment id requires two tables. The key fields in the table are (order_id, uid)(order_id, pay_id)
. First merge the two topics , and then call out the same fields.

 .map(elem => (((elem._1, elem._2), elem._3), 1)) // ((提现表关联 ,order_id),uid/pay_id)
 .reduceByKeyAndWindow((x:Int, y:Int)=>x+y, Seconds(30), Seconds(15))
 .map(elem=>(elem._1._1, elem._1._2)) //
 .groupByKey() // (统一表名 ,订单id) => (uid+pay_id)
 .map(elem=>{
    
    
 	//整理数据,除了提现表关联,还有很多关联关系
 }).foreach(存储)

neo4j

  //更新关系
  private def relationShip(session: Session, uid: Int, flag_label:String, flag_type:String, fuid: String): Any = {
    
    

    val sql =  s"""
                  |match (p1:User{userId:$uid})-[r:$flag_type]->(p2:$flag_label{flagId:"$fuid"}) return p1.userId as uid
                """.stripMargin

    val result: StatementResult = session.run(sql)

    if (!result.hasNext) {
    
    

      val sql1 =
        s"""
           |MERGE (p1:User{userId:$uid})
           |MERGE (p2:$flag_label{flagId:"$fuid"})
        """.stripMargin

      session.run(sql1)

      val sql =
        s"""
           |MATCH (p1:User{userId:$uid}),(p2:$flag_label{flagId:"$fuid"})
           |MERGE (p1)-[r:$flag_type{score:1}]->(p2)
           |RETURN p1.userId as uid
        """.stripMargin
      val rel: StatementResult = session.run(sql)
      if (rel.hasNext) {
    
    
        val record = rel.next()
        val uid = record.get("uid").toString
        println(uid)
      }else{
    
    
        System.err.println("error:"+uid+flag_label+flag_type+fuid)
      }
    }
  }

  /**
    * 获取Driver
    * @return
    */
  def getDriver(): Driver = {
    
    
    val url = "bolt://neo4j01:8687"
    val user = "user"
    val password = "password"
    val driver = GraphDatabase.driver(url, AuthTokens.basic(user, password), Config.build()
      .withMaxIdleSessions(1000)
      .withConnectionLivenessCheckTimeout(10, TimeUnit.SECONDS)
      .toConfig)
    driver
  }

  /**
    * 获取Session
    * @param driver
    * @return
    */
  def getSession(driver: Driver): Session = {
    
    
    val session = driver.session()
    session
  }

kafka

 def resToKafka(ssc: StreamingContext, kafkaDStreamValue: DStream[(String, String, String, String)]): Unit = {
    
    
    //广播KafkaSink
    val kafkaProducer: Broadcast[KafkaSink[String, String]] = {
    
    
      val kafkaProducerConfig = {
    
    
        val p = new Properties()
        p.put("group.id", "realtime")
        p.put("acks", "all")
        p.put("retries ", "1")
        p.setProperty("bootstrap.servers", GetPropKey.brokers)
        p.setProperty("key.serializer", classOf[StringSerializer].getName)
        p.setProperty("value.serializer", classOf[StringSerializer].getName)
        p
      }
      ssc.sparkContext.broadcast(KafkaSink[String, String](kafkaProducerConfig))
    }

    //写入Kafka
    kafkaDStreamValue.foreachRDD(rdd => {
    
    
      if (!rdd.isEmpty()) {
    
    
        rdd.foreachPartition(partition => {
    
    
          partition.foreach(elem => {
    
    
            val flag_label = elem._3
            if (!flag_label.equals("null")) {
    
    
              val auth_info = elem._1
              val uid = elem._2.toInt
              val flag_type = elem._4
              val value = dataJson(uid, auth_info, flag_label, flag_type)
              kafkaProducer.value.send("risk_user_auth_info", value)
            }
          })
        })
      }
    })
  }


  /**
    * json格式化
    *
    * @param uid
    * @param fid
    * @param flag_label
    * @param flag_type
    * @return
    */
  def dataJson(uid: Int, fid: String, flag_label: String, flag_type: String): String = {
    
    
    val map = Map(
      "user_id" -> uid,
      "flag_id" -> fid,
      "flag_label" -> flag_label,
      "flag_type" -> flag_type
    )
    JSONObject.apply(map).toString()
  }

2. Latitude association + storage

Data source: risk_user_auth_info (the delay queue written in Kafka in 1)
Data storage: redis
Process: take data from the delay queue, query the relationship in redis, and boil it down to the parent node (the iterative version adds an algorithm to merge And split map), and then save to redis

Drools

updating. . .

Guess you like

Origin blog.csdn.net/jklcl/article/details/112852112