一、数据计算步骤汇总
下面我们通过文字梳理一下具体的数据计算步骤。
第一步:历史粉丝关注数据初始化
第二步:实时维护粉丝关注数据
第三步:每天定时更新主播等级
第四步:每天定时更新用户活跃时间
第五步:每周一计算最近一月主播视频评级
第六步:每周一计算最近一周内主播主播的三度关系列表
第七步:三度关系列表数据导出到Redis
代码下载:
链接:https://pan.baidu.com/s/1kzuwD3XarH26_roq255Yyg?pwd=559p
提取码:559p
二、每周一计算最近一周内主活主播的三度关系列表
使用Flink程序实现每周一计算最近一周内主活主播的三度关系列表
1、创建项目
创建子module项目:get_recommend_list
在项目中创建scala目录,引入scala2.12版本的SDK
创建package:com.imooc.flink
在pom.xml中添加依赖
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
</dependency>
<dependency>
<groupId>org.neo4j.driver</groupId>
<artifactId>neo4j-java-driver</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
</dependency>
</dependencies>
在resources目录中添加log4j.properties配置文件
注意:此时需要通过flink读取neo4j中的数据,但是针对DataSet是不支持addSource方法的,但是它里面有一个createInput,可以接收一个自定义的InputFormat,所以我就需要定义一个Neo4jInputFormat了
2、创建Neo4jInputFormat
创建类:Neo4jInputFormat
代码如下:
注意:此代码中的输入组件只能使用单并行度执行,如果使用多并行度查询可能会出现重复数据
package com.imooc.flink
import org.apache.flink.api.common.io.statistics.BaseStatistics
import org.apache.flink.api.common.io.{DefaultInputSplitAssigner, NonParallelInput, RichInputFormat}
import org.apache.flink.configuration.Configuration
import org.apache.flink.core.io.{GenericInputSplit, InputSplit, InputSplitAssigner}
import org.neo4j.driver.{AuthTokens, Driver, GraphDatabase, Result, Session}
/**
* 从Neo4j中查询满足条件的主播
* 一周内活跃过,并且主播等级大于4
*
*/
class Neo4jInputFormat extends RichInputFormat[String,InputSplit] with NonParallelInput{
//注意:with NonParallelInput 表示此组件不支持多并行度
//保存neo4j相关的配置参数
var param: Map[String,String] = Map()
var driver: Driver = _
var session: Session = _
var result: Result = _
/**
* 构造函数
* 接收neo4j相关的配置参数
* @param param
*/
def this(param: Map[String,String]){
this()
this.param = param
}
/**
* 配置此输入格式
* @param parameters
*/
override def configure(parameters: Configuration): Unit = {}
/**
* 获取输入数据的基本统计信息
* @param cachedStatistics
* @return
*/
override def getStatistics(cachedStatistics: BaseStatistics): BaseStatistics = {
cachedStatistics
}
/**
* 对输入数据切分split
* @param minNumSplits
* @return
*/
override def createInputSplits(minNumSplits: Int): Array[InputSplit] = {
Array(new GenericInputSplit(0,1))
}
/**
* 获取切分的split
* @param inputSplits
* @return
*/
override def getInputSplitAssigner(inputSplits: Array[InputSplit]): InputSplitAssigner = {
new DefaultInputSplitAssigner(inputSplits)
}
/**
* 初始化方法:只执行一次
* 获取neo4j连接,开启会话
*/
override def openInputFormat(): Unit = {
//初始化Neo4j连接
this.driver = GraphDatabase.driver(param("boltUrl"), AuthTokens.basic(param("userName"), param("passWord")))
this.session = driver.session()
}
/**
* 关闭Neo4j连接
*/
override def closeInputFormat(): Unit = {
if (driver != null) {
driver.close()
}
}
/**
* 此方法也是只执行一次
* 执行查询操作
* @param split
*/
override def open(split: InputSplit): Unit = {
this.result = session.run("match (a:User) where a.timestamp >= "+param("timestamp")+" and a.level >= "+param("level")+" return a.uid")
}
/**
* 如果数据读取完毕之后,需要返回true
* @return
*/
override def reachedEnd(): Boolean = {
!result.hasNext
}
/**
* 读取结果数据,一次读取一条
* @param reuse
* @return
*/
override def nextRecord(reuse: String): String = {
val record = result.next()
val uid = record.get(0).asString()
uid
}
/**
* 关闭会话
*/
override def close(): Unit = {
if (session != null) {
session.close()
}
}
}
3、创建GetRecommendList
创建类:GetRecommendList
代码如下:
package com.imooc.flink
import org.apache.flink.api.common.typeinfo.BasicTypeInfo
import org.apache.flink.api.scala.ExecutionEnvironment
import org.neo4j.driver.{AuthTokens, GraphDatabase}
import scala.collection.mutable.{ArrayBuffer, ListBuffer}
/**
* 任务6:
* 每周一计算最近一周内主活主播的三度关系列表
* 注意:
* 1:待推荐主播最近一周内活跃过
* 2:待推荐主播等级>4
* 3:待推荐主播最近1个月视频评级满足3B+或2A+(flag=1)
* 4:待推荐主播的粉丝列表关注重合度>2
*
*/
object GetRecommendListScala {
def main(args: Array[String]): Unit = {
var appName = "GetRecommendListScala"
var boltUrl = "bolt://bigdata04:7687"
var userName = "neo4j"
var passWord = "admin"
var timestamp = 0L //过滤最近一周内是否活跃过
var duplicateNum = 2 //粉丝列表关注重合度
var level = 4 //主播等级
var outputPath = "hdfs://bigdata01:9000/data/recommend_data/20260201"
if(args.length > 0){
appName = args(0)
boltUrl = args(1)
userName = args(2)
passWord = args(3)
timestamp = args(4).toLong
duplicateNum = args(5).toInt
level = args(6).toInt
outputPath = args(7)
}
val env = ExecutionEnvironment.getExecutionEnvironment
//添加隐式转换代码
import org.apache.flink.api.scala._
val param = Map("boltUrl"->boltUrl,"userName"->userName,"passWord"->passWord,"timestamp"->timestamp.toString,"level"->level.toString)
//获取一周内主活的主播 并且主播等级大于4的数据
val uidSet = env.createInput(new Neo4jInputFormat(param))
//一次处理一批
//过滤出粉丝关注重合度>2的数据,并且对关注重合度倒序排列
//最终的数据格式是:主播id,待推荐的主播id
val mapSet = uidSet.mapPartition(it=>{
//获取neo4j的连接
val driver = GraphDatabase.driver(boltUrl, AuthTokens.basic(userName, passWord))
//开启一个会话
val session = driver.session()
//保存计算出来的结果
val resultArr = ArrayBuffer[String]()
it.foreach(uid=>{
//计算一个用户的三度关系(主播的二度关系)
//注意:数据量大了之后,这个计算操作是非常耗时
val result = session.run("match (a:User {uid:'" + uid + "'}) <-[:follow]- (b:User) -[:follow]-> (c:User) return a.uid as auid,c.uid as cuid,count(c.uid) as sum order by sum desc limit 30")
//对b、c的主活时间进行过滤,以及对c的level和flag值进行过滤
/*val result = session.run("match (a:User {uid:'" + uid + "'}) <-[:follow]- (b:User) -[:follow]-> (c:User) " +
"where b.timestamp >= " + timestamp + " and c.timestamp >= " + timestamp + " and c.level >= " + level + " and c.flag = 1 " +
"return a.uid as auid,c.uid as cuid,count(c.uid) as sum order by sum desc limit 30")*/
while (result.hasNext) {
val record = result.next()
val sum = record.get("sum").asInt()
if (sum > duplicateNum) {
resultArr += record.get("auid").asString() + "\t" + record.get("cuid").asString()
}
}
})
//关闭会话
session.close()
//关闭连接
driver.close()
resultArr.iterator
})
//把数据转成tuple2的形式
val tup2Set = mapSet.map(line => {
val splits = line.split("\t")
(splits(0), splits(1))
})
//根据主播id进行分组,可以获取到这个主播的待推荐列表
val reduceSet = tup2Set.groupBy(_._1).reduceGroup(it=>{
val list = it.toList
val tmpList = ListBuffer[String]()
for(l <- list){
tmpList += l._2
}
//把结果组装成这种形式 1001 1002,1003,1004
(list.head._1,tmpList.toList.mkString(","))
})
//注意:writeAsCsv只能保存tuple类型的数据
//writeAsText可以支持任何类型,如果是对象,会调用对象的toString方法写入到文件中
reduceSet.writeAsCsv(outputPath,"\n","\t")
//执行任务
env.execute(appName)
}
}
其实我们也可以直接在这里将结果输入写入到redis中,不过为了整体看起来更加规范,在这就先把数据临时写到hdfs上面。
在本地执行代码,然后到hdfs上面确认结果
[root@bigdata04 jobs]# hdfs dfs -cat /data/recommend_data/20260201/*
1000 1005,1004
1005 1000
1004 1000
4、打包配置
接下来对程序编译打包
在pom.xml中添加编译打包配置
<dependencies>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.neo4j.driver</groupId>
<artifactId>neo4j-java-driver</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<!-- 编译插件 -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<!-- scala编译插件 -->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
<configuration>
<scalaCompatVersion>2.12</scalaCompatVersion>
<scalaVersion>2.12.11</scalaVersion>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<!-- 打jar包插件(会包含所有依赖) -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.6</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<!-- 可以设置jar包的入口类(可选) -->
<mainClass></mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
5、打包
打jar包
D:\IdeaProjects\db_video_recommend_v2\get_recommend_list>mvn clean package -DskipTests
[INFO] Scanning for projects...
[INFO] Building jar: D:\IdeaProjects\db_video_recommend_v2\get_recommend_list\target\get_recommend_list-1.0-SNAPSHOT.jar
[INFO]
[INFO] --- maven-assembly-plugin:2.6:single (make-assembly) @ get_recommend_list ---
[INFO] Building jar: D:\IdeaProjects\db_video_recommend_v2\get_recommend_list\target\get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 9.543s
[INFO] Finished at: Mon Sep 07 12:31:23 CST 2020
[INFO] Final Memory: 45M/890M
[INFO] ------------------------------------------------------------------------
6、开发脚本
开发任务脚本
startGetRecommendList.sh
#!/bin/bash
#默认获取上周一的时间
dt=`date -d "7 days ago" +"%Y%m%d"`
if [ "x$1" != "x" ]
then
dt=`date -d "7 days ago $1" +"%Y%m%d"`
fi
masterUrl="yarn-cluster"
appName="GetRecommendListScala"`date +%s`
boltUrl="bolt://bigdata04:7687"
userName="neo4j"
passWord="admin"
#获取上周一的时间戳(单位:毫秒)
timestamp=`date --date="${dt}" +%s`000
#粉丝列表关注重合度
duplicateNum=2
#主播等级
level=4
#输出结果数据路径
outputPath="hdfs://bigdata01:9000/data/recommend_data/${dt}"
#注意:需要将flink脚本路径配置到Linux的环境变量中
flink run \
-m ${masterUrl} \
-ynm ${appName} \
-yqu default \
-yjm 1024 \
-ytm 1024 \
-ys 1 \
-p 2 \
-c com.imooc.flink.GetRecommendListScala \
/data/soft/video_recommend_v2/jobs/get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar ${appName} ${boltUrl} ${userName} ${passWord} ${timestamp} ${duplicateNum} ${level} ${outputPath}
#验证任务执行状态
appStatus=`yarn application -appStates FINISHED -list | grep ${appName} | awk '{print $8}'`
if [ "${appStatus}" != "SUCCEEDED" ]
then
echo "任务执行失败"
# 发送短信或者邮件
else
echo "任务执行成功"
fi
7、上传jar包脚本
将jar包和任务脚本上传到jobs目录中
[root@bigdata04 jobs]# ll
total 98296
-rw-r--r--. 1 root root 4575461 Sep 7 2020 get_recommend_list-1.0-SNAPSHOT-jar-with-dependencies.jar
-rw-r--r--. 1 root root 1189 Sep 7 2020 startGetRecommendList.sh
8、提交任务
向集群提交任务
[root@bigdata04 jobs]# sh -x startGetRecommendList.sh 20260201
任务执行成功,没有问题。
验证结果
[root@bigdata04 jobs]# hdfs dfs -cat /data/recommend_data/20260125/*
1000 1005,1004
1005 1000
1004 1000