FileBeat + Flume + Kafka + HDFS + Neo4j + Flink + Redis:【案例】三度关系推荐V2.0版本07:每周一计算最近一周内主播主播的三度关系列表

一、数据计算步骤汇总

下面我们通过文字梳理一下具体的数据计算步骤。
第一步:历史粉丝关注数据初始化
第二步:实时维护粉丝关注数据
第三步:每天定时更新主播等级
第四步:每天定时更新用户活跃时间
第五步:每周一计算最近一月主播视频评级
第六步:每周一计算最近一周内主播主播的三度关系列表
第七步:三度关系列表数据导出到Redis

代码下载:

链接:https://pan.baidu.com/s/1kzuwD3XarH26_roq255Yyg?pwd=559p 
提取码:559p

二、每周一计算最近一周内主活主播的三度关系列表

使用Flink程序实现每周一计算最近一周内主活主播的三度关系列表

1、创建项目

创建子module项目:get_recommend_list
在项目中创建scala目录,引入scala2.12版本的SDK
创建package:com.imooc.flink
在pom.xml中添加依赖

<dependencies>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
    </dependency>
    <dependency>
        <groupId>org.neo4j.driver</groupId>
        <artifactId>neo4j-java-driver</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.12</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.12</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
    </dependency>
</dependencies>

在resources目录中添加log4j.properties配置文件

注意:此时需要通过flink读取neo4j中的数据,但是针对DataSet是不支持addSource方法的,但是它里面有一个createInput,可以接收一个自定义的InputFormat,所以我就需要定义一个Neo4jInputFormat了

2、创建Neo4jInputFormat

创建类:Neo4jInputFormat
代码如下:

注意:此代码中的输入组件只能使用单并行度执行,如果使用多并行度查询可能会出现重复数据

package com.imooc.flink

import org.apache.flink.api.common.io.statistics.BaseStatistics
import org.apache.flink.api.common.io.{DefaultInputSplitAssigner, NonParallelInput, RichInputFormat}
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.io.{GenericInputSplit, InputSplit, InputSplitAssigner}
import org.neo4j.driver.{AuthTokens, Driver, GraphDatabase, Result, Session}

/**
 * 从Neo4j中查询满足条件的主播
 * 一周内活跃过,并且主播等级大于4
 * 
 */
class Neo4jInputFormat extends RichInputFormat[String,InputSplit] with NonParallelInput{
  //注意:with NonParallelInput 表示此组件不支持多并行度

  //保存neo4j相关的配置参数
  var param: Map[String,String] = Map()

  var driver: Driver = _
  var session: Session = _
  var result: Result = _

  /**
   * 构造函数
   * 接收neo4j相关的配置参数
   * @param param
   */
  def this(param: Map[String,String]){
    this()
    this.param = param
  }

  /**
    * 配置此输入格式
    * @param parameters
    */
  override def configure(parameters: Configuration): Unit = {}

  /**
    * 获取输入数据的基本统计信息
    * @param cachedStatistics
    * @return
    */
  override def getStatistics(cachedStatistics: BaseStatistics): BaseStatistics = {
    cachedStatistics
  }

  /**
    * 对输入数据切分split
    * @param minNumSplits
    * @return
    */
  override def createInputSplits(minNumSplits: Int): Array[InputSplit] = {
    Array(new GenericInputSplit(0,1))
  }

  /**
    * 获取切分的split
    * @param inputSplits
    * @return
    */
  override def getInputSplitAssigner(inputSplits: Array[InputSplit]): InputSplitAssigner = {
    new DefaultInputSplitAssigner(inputSplits)
  }

  /**
   * 初始化方法:只执行一次
   * 获取neo4j连接,开启会话
   */
  override def openInputFormat(): Unit = {
    //初始化Neo4j连接
    this.driver = GraphDatabase.driver(param("boltUrl"), AuthTokens.basic(param("userName"), param("passWord")))
    this.session = driver.session()
  }

  /**
   * 关闭Neo4j连接
   */
  override def closeInputFormat(): Unit = {
    if (driver != null) {
      driver.close()
    }
  }

  /**
   * 此方法也是只执行一次
   * 执行查询操作
   * @param split
   */
  override def open(split: InputSplit): Unit = {
    this.result = session.run("match (a:User) where a.timestamp >= "+param("timestamp")+" and a.level >= "+param("level")+" return a.uid")
  }

  /**
   * 如果数据读取完毕之后,需要返回true
   * @return
   */
  override def reachedEnd(): Boolean = {
    !result.hasNext
  }

  /**
   * 读取结果数据,一次读取一条
   * @param reuse
   * @return
   */
  override def nextRecord(reuse: String): String = {
    val record = result.next()
    val uid = record.get(0).asString()
    uid
  }

  /**
   * 关闭会话
   */
  override def close(): Unit = {
    if (session != null) {
      session.close()
    }
  }
}

3、创建GetRecommendList

创建类:GetRecommendList
代码如下:

package com.imooc.flink

import org.apache.flink.api.common.typeinfo.BasicTypeInfo
import org.apache.flink.api.scala.ExecutionEnvironment
import org.neo4j.driver.{AuthTokens, GraphDatabase}

import scala.collection.mutable.{ArrayBuffer, ListBuffer}

/**
 * 任务6:
 * 每周一计算最近一周内主活主播的三度关系列表
 * 注意:
 * 1:待推荐主播最近一周内活跃过
 * 2:待推荐主播等级>4
 * 3:待推荐主播最近1个月视频评级满足3B+或2A+(flag=1)
 * 4:待推荐主播的粉丝列表关注重合度>2
 * 
 */
object GetRecommendListScala {
  def main(args: Array[String]): Unit = {
    var appName = "GetRecommendListScala"
    var boltUrl = "bolt://bigdata04:7687"
    var userName = "neo4j"
    var passWord = "admin"
    var timestamp = 0L //过滤最近一周内是否活跃过
    var duplicateNum = 2 //粉丝列表关注重合度
    var level = 4 //主播等级
    var outputPath = "hdfs://bigdata01:9000/data/recommend_data/20260201"
    if(args.length > 0){
      appName = args(0)
      boltUrl = args(1)
      userName = args(2)
      passWord = args(3)
      timestamp = args(4).toLong
      duplicateNum = args(5).toInt
      level = args(6).toInt
      outputPath = args(7)
    }

    val env = ExecutionEnvironment.getExecutionEnvironment
    //添加隐式转换代码
    import org.apache.flink.api.scala._
    val param = Map("boltUrl"->boltUrl,"userName"->userName,"passWord"->passWord,"timestamp"->timestamp.toString,"level"->level.toString)
    //获取一周内主活的主播 并且主播等级大于4的数据
    val uidSet = env.createInput(new Neo4jInputFormat(param))

    //一次处理一批
    //过滤出粉丝关注重合度>2的数据,并且对关注重合度倒序排列
    //最终的数据格式是:主播id,待推荐的主播id
    val mapSet = uidSet.mapPartition(it=>{
      //获取neo4j的连接
      val driver = GraphDatabase.driver(boltUrl, AuthTokens.basic(userName, passWord))
      //开启一个会话
      val session = driver.session()
      //保存计算出来的结果
      val resultArr = ArrayBuffer[String]()
      it.foreach(uid=>{
        //计算一个用户的三度关系(主播的二度关系)
        //注意:数据量大了之后,这个计算操作是非常耗时
        val result = session.run("match (a:User {uid:'" + uid + "'}) <-[:follow]- (b:User) -[:follow]-> (c:User) return a.uid as auid,c.uid as cuid,count(c.uid) as sum order by sum desc limit 30")
        //对b、c的主活时间进行过滤,以及对c的level和flag值进行过滤
        /*val result = session.run("match (a:User {uid:'" + uid + "'}) <-[:follow]- (b:User) -[:follow]-> (c:User) " +
          "where b.timestamp >= " + timestamp + " and c.timestamp >= " + timestamp + " and c.level >= " + level + " and c.flag = 1 " +
          "return a.uid as auid,c.uid as cuid,count(c.uid) as sum order by sum desc limit 30")*/
        while (result.hasNext) {
          val record = result.next()
          val sum = record.get("sum").asInt()
          if (sum > duplicateNum) {
            resultArr += record.get("auid").asString() + "\t" + record.get("cuid").asString()
          }
        }

      })
      //关闭会话
      session.close()
      //关闭连接
      driver.close()
      resultArr.iterator
    })

    //把数据转成tuple2的形式
    val tup2Set = mapSet.map(line => {
      val splits = line.split("\t")
      (splits(0), splits(1))
    })

    //根据主播id进行分组,可以获取到这个主播的待推荐列表
    val reduceSet = tup2Set.groupBy(_._1).reduceGroup(it=>{
      val list = it.toList
      val tmpList = ListBuffer[String]()
      for(l <- list){
        tmpList += l._2
      }
      //把结果组装成这种形式 1001 1002,1003,1004
      (list.head._1,tmpList.toList.mkString(","))
    })

    //注意:writeAsCsv只能保存tuple类型的数据
    //writeAsText可以支持任何类型,如果是对象,会调用对象的toString方法写入到文件中
    reduceSet.writeAsCsv(outputPath,"\n","\t")

    //执行任务
    env.execute(appName)
  }

}

其实我们也可以直接在这里将结果输入写入到redis中,不过为了整体看起来更加规范,在这就先把数据临时写到hdfs上面。

在本地执行代码,然后到hdfs上面确认结果

[root@bigdata04 jobs]# hdfs dfs -cat /data/recommend_data/20260201/*
1000    1005,1004
1005    1000
1004    1000

4、打包配置

接下来对程序编译打包
在pom.xml中添加编译打包配置

<dependencies>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-api</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.slf4j</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.neo4j.driver</groupId>
        <artifactId>neo4j-java-driver</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-java</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-scala_2.12</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.flink</groupId>
        <artifactId>flink-clients_2.12</artifactId>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <scope>provided</scope>
    </dependency>
</dependencies>
<build>
    <plugins>
        <!-- 编译插件 -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.6.0</version>
            <configuration>
                <source>1.8</source>
                <target>1.8</target>
                <encoding>UTF-8</encoding>
            </configuration>
        </plugin>
        <!-- scala编译插件 -->
        <plugin>
            <groupId>net.alchim31.maven</groupId>
            <artifactId>scala-maven-plugin</artifactId>
            <version>3.1.6</version>
            <configuration>
                <scalaCompatVersion>2.12</scalaCompatVersion>
                <scalaVersion>2.12.11</scalaVersion>
                <encoding>UTF-8</encoding>
            </configuration>
            <executions>
                <execution>
                    <id>compile-scala</id>
                    <phase>compile</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>compile</goal>
                    </goals>
                </execution>
                <execution>
                    <id>test-compile-scala</id>
                    <phase>test-compile</phase>
                    <goals>
                        <goal>add-source</goal>
                        <goal>testCompile</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
        <!-- 打jar包插件(会包含所有依赖) -->
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <version>2.6</version>
            <configuration>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
                <archive>
                    <manifest>
                        <!-- 可以设置jar包的入口类(可选) -->
                        <mainClass></mainClass>
                    </manifest>
                </archive>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>

5、打包

打jar包

D:\IdeaProjects\db_video_recommend_v2\get_recommend_list>mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO] Building jar: D:\IdeaProjects\db_video_recommend_v2\get_recommend_list\target\get_recommend_list-1.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-assembly-plugin:2.6:single (make-assembly) @ get_recommend_list ---
[INFO] Building jar: D:\IdeaProjects\db_video_recommend_v2\get_recommend_list\target\get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.543s
[INFO] Finished at: Mon Sep 07 12:31:23 CST 2020
[INFO] Final Memory: 45M/890M
[INFO] ------------------------------------------------------------------------

在这里插入图片描述

6、开发脚本

开发任务脚本
startGetRecommendList.sh

#!/bin/bash
#默认获取上周一的时间
dt=`date -d "7 days ago" +"%Y%m%d"`
if [ "x$1" != "x" ]
then
    dt=`date -d "7 days ago $1" +"%Y%m%d"`
fi

masterUrl="yarn-cluster"
appName="GetRecommendListScala"`date +%s`
boltUrl="bolt://bigdata04:7687"
userName="neo4j"
passWord="admin"
#获取上周一的时间戳(单位:毫秒)
timestamp=`date --date="${dt}" +%s`000
#粉丝列表关注重合度
duplicateNum=2
#主播等级
level=4

#输出结果数据路径
outputPath="hdfs://bigdata01:9000/data/recommend_data/${dt}"

#注意:需要将flink脚本路径配置到Linux的环境变量中
flink run \
-m ${masterUrl} \
-ynm ${appName} \
-yqu default \
-yjm 1024 \
-ytm 1024 \
-ys 1 \
-p 2 \
-c com.imooc.flink.GetRecommendListScala \
/data/soft/video_recommend_v2/jobs/get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar ${appName} ${boltUrl} ${userName} ${passWord} ${timestamp} ${duplicateNum} ${level} ${outputPath}

#验证任务执行状态
appStatus=`yarn application -appStates FINISHED -list | grep ${appName} | awk '{print $8}'`
if [ "${appStatus}" != "SUCCEEDED" ]
then
    echo "任务执行失败"
    # 发送短信或者邮件
else
    echo "任务执行成功"
fi

7、上传jar包脚本

将jar包和任务脚本上传到jobs目录中

[root@bigdata04 jobs]# ll
total 98296
-rw-r--r--. 1 root root  4575461 Sep  7  2020 get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar
-rw-r--r--. 1 root root     1189 Sep  7  2020 startGetRecommendList.sh

在这里插入图片描述

8、提交任务

向集群提交任务

[root@bigdata04 jobs]# sh -x startGetRecommendList.sh 20260201

在这里插入图片描述

任务执行成功,没有问题。

验证结果

[root@bigdata04 jobs]# hdfs dfs -cat /data/recommend_data/20260125/*
1000    1005,1004
1005    1000
1004    1000

猜你喜欢

转载自blog.csdn.net/weixin_40612128/article/details/123543616