[Big data practical e-commerce recommendation system]

Article directory


Chapter 1 Project System Framework Design


Chapter 2 Tool Environment Construction

  • Install the latest version of MongoDB => Solve the problem of missing dependencies when installing mongodb on Ubuntu
  • Use CentOS7 system and follow the tool environment construction process to install MongoDB, Redis, Spark, Zookeeper, Flume-ng, and Kafka

Chapter 3 Project creation and initialization of business data

3.1 IDEA creates Maven project (omitted)

3.2 Data loading preparation (instructions)

3.3 Initialize data to MongoDB [DataLoader data loading module]

Data loader main implementation + data writing to MongoDB

  • Define several sample classes for raw data, read data from the file through the textFile method of SparkContext, convert it into a DataFrame, and then use the write method provided by Spark SQL to perform distributed insertion of data.
  • Create a new package under DataLoader/src/main/scala, name it com.atguigu.recommender, and create a new scala class file named DataLoader.

Program main code:

package com.atguigu.recommender

import com.mongodb.casbah.commons.MongoDBObject
import com.mongodb.casbah.{
    
    MongoClient, MongoClientURI}
import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

/**
  *
  * Product数据集
  * 3982                            商品ID
  * Fuhlen 富勒 M8眩光舞者时尚节能    商品名称
  * 1057,439,736                    商品分类ID,不需要
  * B009EJN4T2                      亚马逊ID,不需要
  * https://images-cn-4.ssl-image   商品的图片URL
  * 外设产品|鼠标|电脑/办公           商品分类
  * 富勒|鼠标|电子产品|好用|外观漂亮   商品UGC标签
  */
case class Product( productId: Int, name: String, imageUrl: String, categories: String, tags: String )

/**
  * Rating数据集
  * 4867        用户ID
  * 457976      商品ID
  * 5.0         评分
  * 1395676800  时间戳
  */
case class Rating( userId: Int, productId: Int, score: Double, timestamp: Int )

/**
  * MongoDB连接配置
  * @param uri    MongoDB的连接uri
  * @param db     要操作的db
  */
case class MongoConfig( uri: String, db: String )

object DataLoader {
    
    
  // 定义数据文件路径
  val PRODUCT_DATA_PATH = "D:\\Projects\\BigData\\ECommerceRecommendSystem\\recommender\\DataLoader\\src\\main\\resources\\products.csv"
  val RATING_DATA_PATH = "D:\\Projects\\BigData\\ECommerceRecommendSystem\\recommender\\DataLoader\\src\\main\\resources\\ratings.csv"
  // 定义mongodb中存储的表名
  val MONGODB_PRODUCT_COLLECTION = "Product"
  val MONGODB_RATING_COLLECTION = "Rating"

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("DataLoader")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._

    // 加载数据
    val productRDD = spark.sparkContext.textFile(PRODUCT_DATA_PATH)
    val productDF = productRDD.map( item => {
    
    
      // product数据通过^分隔,切分出来
      val attr = item.split("\\^")
      // 转换成Product
      Product( attr(0).toInt, attr(1).trim, attr(4).trim, attr(5).trim, attr(6).trim )
    } ).toDF()

    val ratingRDD = spark.sparkContext.textFile(RATING_DATA_PATH)
    val ratingDF = ratingRDD.map( item => {
    
    
      val attr = item.split(",")
      Rating( attr(0).toInt, attr(1).toInt, attr(2).toDouble, attr(3).toInt )
    } ).toDF()

    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )
    storeDataInMongoDB( productDF, ratingDF )

    spark.stop()
  }


  /**
  * 数据写入MongoDB
  */
  def storeDataInMongoDB( productDF: DataFrame, ratingDF: DataFrame )(implicit mongoConfig: MongoConfig): Unit ={
    
    
    // 新建一个mongodb的连接,客户端
    val mongoClient = MongoClient( MongoClientURI(mongoConfig.uri) )
    // 定义要操作的mongodb表,可以理解为 db.Product
    val productCollection = mongoClient( mongoConfig.db )( MONGODB_PRODUCT_COLLECTION )
    val ratingCollection = mongoClient( mongoConfig.db )( MONGODB_RATING_COLLECTION )

    // 如果表已经存在,则删掉
    productCollection.dropCollection()
    ratingCollection.dropCollection()

    // 将当前数据存入对应的表中
    productDF.write
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_PRODUCT_COLLECTION)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    ratingDF.write
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    // 对表创建索引
    productCollection.createIndex( MongoDBObject( "productId" -> 1 ) )
    ratingCollection.createIndex( MongoDBObject( "productId" -> 1 ) )
    ratingCollection.createIndex( MongoDBObject( "userId" -> 1 ) )

    mongoClient.close()
  }
}
  • Firewall problem: The firewall needs to be turned off when connecting to mongodb

StatisticsRecommender statistics recommendation module

Code part:

package com.atguigu.statistics

import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

case class Rating( userId: Int, productId: Int, score: Double, timestamp: Int )
case class MongoConfig( uri: String, db: String )

object StatisticsRecommender {
    
    
  // 定义mongodb中存储的表名
  val MONGODB_RATING_COLLECTION = "Rating"
  val RATE_MORE_PRODUCTS = "RateMoreProducts"
  val RATE_MORE_RECENTLY_PRODUCTS = "RateMoreRecentlyProducts"
  val AVERAGE_PRODUCTS = "AverageProducts"

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[1]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("StatisticsRecommender")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据
    val ratingDF = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[Rating]
      .toDF()

    // 创建一张叫ratings的临时表
    ratingDF.createOrReplaceTempView("ratings")

    // TODO: 用spark sql去做不同的统计推荐
    // 1. 历史热门商品,按照评分个数统计,productId,count
    val rateMoreProductsDF = spark.sql("select productId, count(productId) as count from ratings group by productId order by count desc")
    storeDFInMongoDB( rateMoreProductsDF, RATE_MORE_PRODUCTS )

    // 2. 近期热门商品,把时间戳转换成yyyyMM格式进行评分个数统计,最终得到productId, count, yearmonth
    // 创建一个日期格式化工具
    val simpleDateFormat = new SimpleDateFormat("yyyyMM")
    // 注册UDF,将timestamp转化为年月格式yyyyMM
    spark.udf.register("changeDate", (x: Int)=>simpleDateFormat.format(new Date(x * 1000L)).toInt)
    // 把原始rating数据转换成想要的结构productId, score, yearmonth
    val ratingOfYearMonthDF = spark.sql("select productId, score, changeDate(timestamp) as yearmonth from ratings")
    ratingOfYearMonthDF.createOrReplaceTempView("ratingOfMonth")
    val rateMoreRecentlyProductsDF = spark.sql("select productId, count(productId) as count, yearmonth from ratingOfMonth group by yearmonth, productId order by yearmonth desc, count desc")
    // 把df保存到mongodb
    storeDFInMongoDB( rateMoreRecentlyProductsDF, RATE_MORE_RECENTLY_PRODUCTS )

    // 3. 优质商品统计,商品的平均评分,productId,avg
    val averageProductsDF = spark.sql("select productId, avg(score) as avg from ratings group by productId order by avg desc")
    storeDFInMongoDB( averageProductsDF, AVERAGE_PRODUCTS )

    spark.stop()
  }
  def storeDFInMongoDB(df: DataFrame, collection_name: String)(implicit mongoConfig: MongoConfig): Unit ={
    
    
    df.write
      .option("uri", mongoConfig.uri)
      .option("collection", collection_name)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()
  }
}

Code analysis:

  • Temporary table->result table
  • Register UDF and convert timestamp into year and month format yyyyMM
spark.udf.register("changeDate", (x: Int)=>simpleDateFormat.format(new Date(x * 1000L)).toInt)

Chapter 4 Offline recommendation service construction

4.1 Offline recommendation service

  • The offline recommendation service integrates all the historical data of users and uses the set offline statistical algorithm and offline recommendation algorithm to periodically collect and save the results. The calculated results are fixed within a certain period of time, and the frequency of changes depends on How often the algorithm is scheduled.
  • The offline recommendation service mainly calculates some indicators that can be counted and calculated in advance to provide data support for real-time calculation and front-end business response.
  • Offline recommendation services are mainly divided into statistical recommendations, latent semantic model-based collaborative filtering recommendations, and content-based and Item-CF-based similar recommendations.
  • This chapter mainly introduces the first two parts. Content-based and Item-CF recommendations are similar in overall structure and implementation. We will introduce them in detail in Chapter 7.

4.2 Offline statistical service [statistical recommendation module]

  • Create a new sub-project StatisticsRecommender under recommender, and only need to introduce the relevant dependencies of spark, scala and mongodb in the pom.xml file:
<dependencies>
    <!-- Spark的依赖引入 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
    </dependency>
    <!-- 引入Scala -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
    </dependency>
    <!-- 加入MongoDB的驱动 -->
    <!-- 用于代码方式连接MongoDB -->
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>casbah-core_2.11</artifactId>
        <version>${
    
    casbah.version}</version>
    </dependency>
    <!-- 用于Spark和MongoDB的对接 -->
    <dependency>
        <groupId>org.mongodb.spark</groupId>
        <artifactId>mongo-spark-connector_2.11</artifactId>
        <version>${
    
    mongodb-spark.version}</version>
    </dependency>
</dependencies>

Introduce log4j.properties under the resources folder, and then create a new scala singleton object com.atguigu.statistics.StatisticsRecommender under src/main/scala.
Similarly, we should build the sample class first, define the configuration, create SparkSession and load data in the main() method, and finally close spark.

  • Historical popular product statistics: Based on all historical rating data, calculate the product with the most historical ratings
    • Read the rating data set through Spark SQL and count the products with the most ratings among all ratings.
    • Then sort from large to small, and write the final results into MongoDB's RateMoreProducts data set
  • Recent popular product statistics: Based on the ratings, calculate the collection of products with the most ratings in the most recent month on a monthly basis.
    • Read the rating data set through Spark SQL, modify the rating data time to month through the UDF function, and then count the number of monthly product ratings.
    • After the statistics are completed, the data is written to the RateMoreRecentlyProducts data set of MongoDB.
  • Product average score statistics: Based on the ratings of all users on the product in historical data, the average score of each product is periodically calculated.
    • Read the Rating data set saved in MongDB through Spark SQL, and implement the average score statistics of the product by executing the following SQL statement
    • After the statistics are completed, the generated new DataFrame is written out to the AverageProducts collection of MongoDB.

Main code (src/main/scala/com.atguigu.statistics/StatisticsRecommender.scala):

package com.atguigu.statistics

import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

case class Rating( userId: Int, productId: Int, score: Double, timestamp: Int )
case class MongoConfig( uri: String, db: String )

object StatisticsRecommender {
    
    
  // 定义mongodb中存储的表名
  val MONGODB_RATING_COLLECTION = "Rating"
  val RATE_MORE_PRODUCTS = "RateMoreProducts"
  val RATE_MORE_RECENTLY_PRODUCTS = "RateMoreRecentlyProducts"
  val AVERAGE_PRODUCTS = "AverageProducts"

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[1]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("StatisticsRecommender")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据
    val ratingDF = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[Rating]
      .toDF()

    // 创建一张叫ratings的临时表
    ratingDF.createOrReplaceTempView("ratings")

    // TODO: 【 用spark sql去做不同的统计推荐 】
    // todo: (1)历史热门商品,按照评分个数统计,productId,count
    val rateMoreProductsDF = spark.sql("select productId, count(productId) as count from ratings group by productId order by count desc")
    storeDFInMongoDB( rateMoreProductsDF, RATE_MORE_PRODUCTS )

    // todo: (2)近期热门商品,把时间戳转换成yyyyMM格式进行评分个数统计,最终得到productId, count, yearmonth
    // 创建一个日期格式化工具
    val simpleDateFormat = new SimpleDateFormat("yyyyMM")
    // 注册UDF,将timestamp转化为年月格式yyyyMM
    spark.udf.register("changeDate", (x: Int)=>simpleDateFormat.format(new Date(x * 1000L)).toInt)
    // 把原始rating数据转换成想要的结构productId, score, yearmonth
    val ratingOfYearMonthDF = spark.sql("select productId, score, changeDate(timestamp) as yearmonth from ratings")
    ratingOfYearMonthDF.createOrReplaceTempView("ratingOfMonth")
    val rateMoreRecentlyProductsDF = spark.sql("select productId, count(productId) as count, yearmonth from ratingOfMonth group by yearmonth, productId order by yearmonth desc, count desc")
    // 把df保存到mongodb
    storeDFInMongoDB( rateMoreRecentlyProductsDF, RATE_MORE_RECENTLY_PRODUCTS )

    // todo: (3)优质商品统计,商品的平均评分,productId,avg
    val averageProductsDF = spark.sql("select productId, avg(score) as avg from ratings group by productId order by avg desc")
    storeDFInMongoDB( averageProductsDF, AVERAGE_PRODUCTS )

    spark.stop()
  }

  // TODO: 【 保存到MongoDB数据库 】
  def storeDFInMongoDB(df: DataFrame, collection_name: String)(implicit mongoConfig: MongoConfig): Unit ={
    
    
    df.write
      .option("uri", mongoConfig.uri)
      .option("collection", collection_name)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()
  }
}

4.3 Collaborative filtering recommendation based on latent semantic model [LFM offline recommendation module]

  • The project uses ALS as a collaborative filtering algorithm, and calculates the offline user product recommendation list and product similarity matrix based on the user rating table in MongoDB.

4.3.1 User product recommendation list

  • The model trained by ALS is used to calculate the recommended list of all current user products. The main idea is as follows:
    • The Cartesian product of userId and productId produces a tuple of (userId, productId)
    • Predict the rating corresponding to (userId, productId) through the model.
    • Sort prediction results by prediction score.
    • Return the K products with the highest scores as the current user's recommendation list.
  • The final generated data structure is as follows: Save the data to the UserRecs table of MongoDB
    Insert image description here
  • Create a new recommender sub-project OfflineRecommender and introduce the dependencies of spark, scala, mongo and jblas:
<dependencies>

    <dependency>
        <groupId>org.scalanlp</groupId>
        <artifactId>jblas</artifactId>
        <version>${
    
    jblas.version}</version>
    </dependency>

    <!-- Spark的依赖引入 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_2.11</artifactId>
    </dependency>
    <!-- 引入Scala -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
    </dependency>

    <!-- 加入MongoDB的驱动 -->
    <!-- 用于代码方式连接MongoDB -->
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>casbah-core_2.11</artifactId>
        <version>${
    
    casbah.version}</version>
    </dependency>
    <!-- 用于Spark和MongoDB的对接 -->
    <dependency>
        <groupId>org.mongodb.spark</groupId>
        <artifactId>mongo-spark-connector_2.11</artifactId>
        <version>${
    
    mongodb-spark.version}</version>
    </dependency>
</dependencies>
  • After the same previous steps of building sample classes, declaring configurations, and creating SparkSession, you can load data and start calculating the model.

The core code is as follows: src/main/scala/com.atguigu.offline/OfflineRecommender.scala

case class ProductRating(userId: Int, productId: Int, score: Double, timestamp: Int)

case class MongoConfig(uri:String, db:String)

// 标准推荐对象,productId,score
case class Recommendation(productId: Int, score:Double)

// 用户推荐列表
case class UserRecs(userId: Int, recs: Seq[Recommendation])

// 商品相似度(商品推荐)
case class ProductRecs(productId: Int, recs: Seq[Recommendation])

object OfflineRecommmeder {
    
    

  // 定义常量
  val MONGODB_RATING_COLLECTION = "Rating"

  // 推荐表的名称
  val USER_RECS = "UserRecs"
  val PRODUCT_RECS = "ProductRecs"

  val USER_MAX_RECOMMENDATION = 20

  def main(args: Array[String]): Unit = {
    
    
    // 定义配置
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )

    // 创建spark session
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("OfflineRecommender")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    implicit val mongoConfig = MongoConfig(config("mongo.uri"),config("mongo.db"))

    import spark.implicits._
	//读取mongoDB中的业务数据
	val ratingRDD = spark
	.read
	.option("uri",mongoConfig.uri)
	.option("collection",MONGODB_RATING_COLLECTION)
	.format("com.mongodb.spark.sql")
	.load()
	.as[ProductRating]
	.rdd
	.map(rating=> (rating.userId, rating.productId, rating.score)).cache()
	//用户的数据集 RDD[Int]
	val userRDD = ratingRDD.map(_._1).distinct()
	val prodcutRDD = ratingRDD.map(_._2).distinct()
	
	//创建训练数据集
	val trainData = ratingRDD.map(x => Rating(x._1,x._2,x._3))
	// rank 是模型中隐语义因子的个数, iterations 是迭代的次数, lambda 是ALS的正则化参
	val (rank,iterations,lambda) = (50, 5, 0.01)
	// 调用ALS算法训练隐语义模型
	val model = ALS.train(trainData,rank,iterations,lambda)
	
	//计算用户推荐矩阵
	val userProducts = userRDD.cartesian(productRDD)
	// model已训练好,把id传进去就可以得到预测评分列表RDD[Rating] (userId,productId,rating)
	val preRatings = model.predict(userProducts)
	
	val userRecs = preRatings
	.filter(_.rating > 0)
	.map(rating => (rating.user,(rating.product, rating.rating)))
	.groupByKey()    
	.map{
    
    
	case (userId,recs) => UserRecs(userId,recs.toList.sortWith(_._2 >
	_._2).take(USER_MAX_RECOMMENDATION).map(x => Recommendation(x._1,x._2)))
	}.toDF()
	
	userRecs.write
	.option("uri",mongoConfig.uri)
	.option("collection",USER_RECS)
	.mode("overwrite")
	.format("com.mongodb.spark.sql")
	.save()
	
	//TODO:计算商品相似度矩阵
	
	// 关闭spark
	spark.stop()
	}
}

4.3.2 Product similarity matrix

  • The product similarity matrix is ​​calculated through ALS, which is used to query similar products of the current product and serve the real-time recommendation system.

  • The ALS algorithm for offline calculation, the algorithm will eventually generate the final feature matrix for the user and the product, which are the U(mxk) matrix representing the user feature matrix, each user is described by k features; the V(nxk) matrix representing the item feature matrix ) matrix, each item is also described by k features.

  • V(nxk) represents the feature matrix of the item, and each row is a k-dimensional vector. Although we don't know the meaning of each dimension's feature, the k-dimensional mathematical vector represents the feature of the product corresponding to the row.

  • Therefore, each commodity uses the <t 1 ,t 2 ,t 3 ,…> vector of each row of V(nxk) to represent its features, so any two commodities p: the feature vector is V p =< t p1 ,t p2 , t p3 ,…,t pk >, product q: the similarity sim(p,q) between the feature vectors is V q =< t q1 ,t q2 ,t q3 ,…,t qk > can use the cosine value of the sum To represent:
    Insert image description here

  • The similarity between any two commodities in the data set can be calculated by a formula, and the similarity between commodities is basically a fixed value for a period of time. The final generated data is saved to MongoDB's ProductRecs table.
    Insert image description here

//计算商品相似度矩阵
//获取商品的特征矩阵,数据格式 RDD[(scala.Int, scala.Array[scala.Double])]
val productFeatures = model.productFeatures.map{
    
    case (productId,features) =>
  (productId, new DoubleMatrix(features))
}

// 计算笛卡尔积并过滤合并
val productRecs = productFeatures.cartesian(productFeatures)
  .filter{
    
    case (a,b) => a._1 != b._1}  
  .map{
    
    case (a,b) =>
    val simScore = this.consinSim(a._2,b._2) // 求余弦相似度
    (a._1,(b._1,simScore))
  }.filter(_._2._2 > 0.6)    
  .groupByKey()             
  .map{
    
    case (productId,items) =>
    ProductRecs(productId,items.toList.map(x => Recommendation(x._1,x._2)))
  }.toDF()

productRecs
  .write
  .option("uri", mongoConfig.uri)
  .option("collection",PRODUCT_RECS)
  .mode("overwrite")
  .format("com.mongodb.spark.sql")
  .save()
  
//计算两个商品之间的余弦相似度
def consinSim(product1: DoubleMatrix, product2:DoubleMatrix) : Double ={
    
    
  product1.dot(product2) / ( product1.norm2()  * product2.norm2() )
}

4.3.3 Model evaluation and parameter selection

  • In the process of training the above model, we directly gave the three parameters of rank, iterations and lambda of the latent semantic model.
  • For our model, this is not necessarily the optimal parameter selection, so we need to evaluate the model.
  • A common approach is to calculate the root mean square error (RMSE), which examines the error between the predicted score and the actual score.
    Insert image description here
  • With RMSE, we can select the group with the smallest RMSE as the optimization choice for our model by adjusting the parameter values ​​multiple times.

Core code: Create a new singleton object ALSTrainer under scala/com.atguigu.offline/

def main(args: Array[String]): Unit = {
    
    
  val config = Map(
    "spark.cores" -> "local[*]",
    "mongo.uri" -> "mongodb://localhost:27017/recommender",
    "mongo.db" -> "recommender"
  )
  //创建SparkConf
  val sparkConf = new SparkConf().setAppName("ALSTrainer").setMaster(config("spark.cores"))
  //创建SparkSession
  val spark = SparkSession.builder().config(sparkConf).getOrCreate()

  val mongoConfig = MongoConfig(config("mongo.uri"),config("mongo.db"))

  import spark.implicits._

  //加载评分数据
  val ratingRDD = spark
    .read
    .option("uri",mongoConfig.uri)
    .option("collection",OfflineRecommender.MONGODB_RATING_COLLECTION)
    .format("com.mongodb.spark.sql")
    .load()
    .as[ProductRating]
    .rdd
    .map(rating => Rating(rating.userId,rating.productId,rating.score)).cache()

  // 将一个RDD随机切分成两个RDD,用以划分训练集和测试集
  val splits = ratingRDD.randomSplit(Array(0.8, 0.2))

  val trainingRDD = splits(0)
  val testingRDD = splits(1)

  //输出最优参数
  adjustALSParams(trainingRDD, testingRDD)

  //关闭Spark
  spark.close()
}

  • The adjustALSParams method is the core of model evaluation. It inputs a set of training data and test data, and outputs the set of parameters that calculates the minimum RMSE.
  • The code is implemented as follows:
// 输出最终的最优参数
def adjustALSParams(trainData:RDD[Rating], testData:RDD[Rating]): Unit ={
    
    
// 这里指定迭代次数为5,rank和lambda在几个值中选取调整
  val result = for(rank <- Array(100,200,250); lambda <- Array(1, 0.1, 0.01, 0.001))
    yield {
    
    
      val model = ALS.train(trainData,rank,5,lambda)
      val rmse = getRMSE(model, testData)
      (rank,lambda,rmse)
    }
  // 按照rmse排序
  println(result.sortBy(_._3).head)
}
  • The function getRMSE code for calculating RMSE is implemented as follows:
def getRMSE(model:MatrixFactorizationModel, data:RDD[Rating]):Double={
    
    
  val userProducts = data.map(item => (item.user,item.product))
  val predictRating = model.predict(userProducts)
val real = data.map(item => ((item.user,item.product),item.rating))
  val predict = predictRating.map(item => ((item.user,item.product),item.rating))
  // 计算RMSE
  sqrt(
    real.join(predict).map{
    
    case ((userId,productId),(real,pre))=>
      // 真实值和预测值之间的差
      val err = real - pre
      err * err
    }.mean()
  )
}
  • Run the code to get the optimal model parameters for the current data

Code body:

package com.atguigu.offline

import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation.{
    
    ALS, Rating}
import org.apache.spark.sql.SparkSession
import org.jblas.DoubleMatrix

case class ProductRating( userId: Int, productId: Int, score: Double, timestamp: Int )
case class MongoConfig( uri: String, db: String )

// 定义标准推荐对象
case class Recommendation( productId: Int, score: Double )
// 定义用户的推荐列表
case class UserRecs( userId: Int, recs: Seq[Recommendation] )
// 定义商品相似度列表
case class ProductRecs( productId: Int, recs: Seq[Recommendation] )

object OfflineRecommender {
    
    
  // 定义mongodb中存储的表名
  val MONGODB_RATING_COLLECTION = "Rating"

  val USER_RECS = "UserRecs"
  val PRODUCT_RECS = "ProductRecs"
  val USER_MAX_RECOMMENDATION = 20

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("OfflineRecommender")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据
    val ratingRDD = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[ProductRating]
      .rdd
      .map(
        rating => (rating.userId, rating.productId, rating.score)
      ).cache()

    // 提取出所有用户和商品的数据集
    val userRDD = ratingRDD.map(_._1).distinct()
    val productRDD = ratingRDD.map(_._2).distinct()

    // 核心计算过程
    // 1. 训练隐语义模型
    val trainData = ratingRDD.map(x=>Rating(x._1,x._2,x._3))
    // 定义模型训练的参数,rank隐特征个数,iterations迭代词数,lambda正则化系数
    val ( rank, iterations, lambda ) = ( 5, 10, 0.01 )
    val model = ALS.train( trainData, rank, iterations, lambda )

    // 2. 获得预测评分矩阵,得到用户的推荐列表
    // 用userRDD和productRDD做一个笛卡尔积,得到空的userProductsRDD表示的评分矩阵
    val userProducts = userRDD.cartesian(productRDD)
    val preRating = model.predict(userProducts)

    // 从预测评分矩阵中提取得到用户推荐列表
    val userRecs = preRating.filter(_.rating>0)
      .map(
        rating => ( rating.user, ( rating.product, rating.rating ) )
      )
      .groupByKey()
      .map{
    
    
        case (userId, recs) =>
          UserRecs( userId, recs.toList.sortWith(_._2>_._2).take(USER_MAX_RECOMMENDATION).map(x=>Recommendation(x._1,x._2)) )
      }
      .toDF()
    userRecs.write
      .option("uri", mongoConfig.uri)
      .option("collection", USER_RECS)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    // 3. 利用商品的特征向量,计算商品的相似度列表
    val productFeatures = model.productFeatures.map{
    
    
      case (productId, features) => ( productId, new DoubleMatrix(features) )
    }
    // 两两配对商品,计算余弦相似度
    val productRecs = productFeatures.cartesian(productFeatures)
      .filter{
    
    
        case (a, b) => a._1 != b._1
      }
      // 计算余弦相似度
      .map{
    
    
        case (a, b) =>
          val simScore = consinSim( a._2, b._2 )
          ( a._1, ( b._1, simScore ) )
      }
      .filter(_._2._2 > 0.4)
      .groupByKey()
      .map{
    
    
        case (productId, recs) =>
          ProductRecs( productId, recs.toList.sortWith(_._2>_._2).map(x=>Recommendation(x._1,x._2)) )
      }
      .toDF()
    productRecs.write
      .option("uri", mongoConfig.uri)
      .option("collection", PRODUCT_RECS)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    spark.stop()
  }
  def consinSim(product1: DoubleMatrix, product2: DoubleMatrix): Double ={
    
    
    product1.dot(product2)/ ( product1.norm2() * product2.norm2() )
  }
}

Chapter 5 Real-time recommendation service construction [Real-time recommendation module]

  • The biggest difference between real-time calculation and offline calculation when applied to recommendation systems is that the real-time calculation recommendation results should reflect the user's recent preferences in the most recent period, while the offline calculation recommendation results are based on all the user's rating records starting from the first rating. overall preference.
  • User preferences for items will always change over time.
    • For example, if a user u gives a very high rating to product p at a certain moment, then u is very likely to like other products similar to product p in the near future;
    • And if user u gives a very low rating to product q at a certain moment, then it is very likely that u will not like other products similar to product q in the near future.
    • Therefore, for real-time recommendations, when a user evaluates a product, the user will hope that the recommended results will be updated based on the recent ratings, so that the recommended results match the user's recent preferences and satisfy the user's recent tastes.
  • If the real-time recommendation continues to use the ALS algorithm in offline recommendation, due to the huge running time of the algorithm, it does not have the ability to obtain new recommendation results in real time; and because the algorithm itself uses a scoring table, only the total score is updated after the user's current rating One item in the table makes the recommendation result after the algorithm is basically the same as the recommendation result before the user's current rating, thus giving the user a feeling that the recommendation result has not changed, which greatly affects the user experience.
  • In addition, in real-time recommendation, since the time performance must meet real-time or quasi-real-time requirements, the calculation amount of the algorithm cannot be too large to avoid the degradation of user experience caused by complex and excessive calculations. Given this, recommendation accuracy tends not to be very high. The real-time recommendation system is more concerned about the dynamic change ability of the recommendation results. As long as the reasons for updating the recommendation results are reasonable, the accuracy requirements of the recommendations can be relaxed appropriately.
  • Therefore, there are two main requirements for real-time recommendation algorithms:
    • The system can obviously update the recommendation results after the user’s current rating, or after the last few ratings;
    • The amount of calculation is not large, and it meets the real-time or quasi-real-time requirements in terms of response time;

5.2 Real-time recommendation model and code framework

5.2.1 Real-time recommendation model algorithm design

5.2.2 Real-time recommendation module framework

  • Create a new sub-project StreamingRecommender under recommender and introduce dependencies on spark, scala, mongo, redis and kafka:
<dependencies>
    <!-- Spark的依赖引入 -->
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_2.11</artifactId>
    </dependency>
    <!-- 引入Scala -->
    <dependency>
        <groupId>org.scala-lang</groupId>
        <artifactId>scala-library</artifactId>
    </dependency>

    <!-- 加入MongoDB的驱动 -->
    <!-- 用于代码方式连接MongoDB -->
    <dependency>
        <groupId>org.mongodb</groupId>
        <artifactId>casbah-core_2.11</artifactId>
        <version>${
    
    casbah.version}</version>
    </dependency>
    <!-- 用于Spark和MongoDB的对接 -->
    <dependency>
        <groupId>org.mongodb.spark</groupId>
        <artifactId>mongo-spark-connector_2.11</artifactId>
        <version>${
    
    mongodb-spark.version}</version>
    </dependency>

    <!-- redis -->
    <dependency>
        <groupId>redis.clients</groupId>
        <artifactId>jedis</artifactId>
        <version>2.9.0</version>
    </dependency>

    <!-- kafka -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>0.10.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
        <version>${
    
    spark.version}</version>
    </dependency>

</dependencies>
  • In the code, we first define the sample class and a connection helper object (used to establish redis and mongo connections), and define some constants in StreamingRecommender.

Core code: src/main/scala/com.atguigu.streaming/StreamingRecommender.scala

// 连接助手对象
object ConnHelper extends Serializable{
    
    
  lazy val jedis = new Jedis("localhost")
  lazy val mongoClient = MongoClient(MongoClientURI("mongodb://localhost:27017/recommender"))
}

case class MongConfig(uri:String,db:String)

// 标准推荐
case class Recommendation(productId:Int, score:Double)

// 用户的推荐
case class UserRecs(userId:Int, recs:Seq[Recommendation])

//商品的相似度
case class ProductRecs(productId:Int, recs:Seq[Recommendation])

object StreamingRecommender {
    
    

  val MAX_USER_RATINGS_NUM = 20
  val MAX_SIM_PRODUCTS_NUM = 20
  val MONGODB_STREAM_RECS_COLLECTION = "StreamRecs"
  val MONGODB_RATING_COLLECTION = "Rating"
  val MONGODB_PRODUCT_RECS_COLLECTION = "ProductRecs"
//入口方法
def main(args: Array[String]): Unit = {
    
    
	}
}
  • The real-time recommendation body code is as follows:
def main(args: Array[String]): Unit = {
    
    

  val config = Map(
    "spark.cores" -> "local[*]",
    "mongo.uri" -> "mongodb://localhost:27017/recommender",
    "mongo.db" -> "recommender",
    "kafka.topic" -> "recommender"
  )
  //创建一个SparkConf配置
  val sparkConf = new SparkConf().setAppName("StreamingRecommender").setMaster(config("spark.cores"))
  val spark = SparkSession.builder().config(sparkConf).getOrCreate()
  val sc = spark.sparkContext
  val ssc = new StreamingContext(sc,Seconds(2))

  implicit val mongConfig = MongConfig(config("mongo.uri"),config("mongo.db"))
  import spark.implicits._

  // 广播商品相似度矩阵
  //装换成为 Map[Int, Map[Int,Double]]
  val simProductsMatrix = spark
    .read
    .option("uri",config("mongo.uri"))
    .option("collection",MONGODB_PRODUCT_RECS_COLLECTION)
    .format("com.mongodb.spark.sql")
    .load()
    .as[ProductRecs]   
    .rdd
    .map{
    
    recs =>
      (recs.productId,recs.recs.map(x=> (x.productId,x.score)).toMap)
    }.collectAsMap()  

  val simProductsMatrixBroadCast = sc.broadcast(simProductsMatrix)

  //创建到Kafka的连接
  val kafkaPara = Map(
    "bootstrap.servers" -> "localhost:9092",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "recommender",
    "auto.offset.reset" -> "latest"
  )

  val kafkaStream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(config("kafka.topic")),kafkaPara))

  // UID|MID|SCORE|TIMESTAMP
  // 产生评分流
  val ratingStream = kafkaStream.map{
    
    case msg=>
    var attr = msg.value().split("\\|")
    (attr(0).toInt,attr(1).toInt,attr(2).toDouble,attr(3).toInt)
  }

// 核心实时推荐算法
  ratingStream.foreachRDD{
    
    rdd =>
    rdd.map{
    
    case (userId,productId,score,timestamp) =>
      println(">>>>>>>>>>>>>>>>")

      //获取当前最近的M次商品评分
      val userRecentlyRatings = getUserRecentlyRating(MAX_USER_RATINGS_NUM,userId,ConnHelper.jedis)

      //获取商品P最相似的K个商品
      val simProducts = getTopSimProducts(MAX_SIM_PRODUCTS_NUM,productId,userId,simProductsMatrixBroadCast.value)

      //计算待选商品的推荐优先级
      val streamRecs = computeProductScores(simProductsMatrixBroadCast.value,userRecentlyRatings,simProducts)

      //将数据保存到MongoDB
      saveRecsToMongoDB(userId,streamRecs)

    }.count()
  }

  //启动Streaming程序
  ssc.start()
  ssc.awaitTermination()
}

5.3 Implementation of real-time recommendation algorithm

  • The premise of real-time recommendation algorithm:
    • The Redis cluster stores each user's recent K ratings of the product. Real-time algorithms are available quickly.
    • The offline recommendation algorithm has calculated the product similarity matrix in MongoDB in advance.
    • Kafka has obtained real-time user rating data.
  • The algorithm process is as follows:
    • The input of the real-time recommendation algorithm is a rating <userId, productId, rate, timestamp>
    • Core elements of execution include:
      • Get the last K ratings of userId
      • Get the K most similar products with productId
      • Calculate the recommendation priority of candidate products
      • Update real-time recommendation results for userId

5.3.1 Get the user’s K latest ratings

  • When the business server receives user ratings, it will insert the ratings into the queue corresponding to the user in Redis in the format of userId, productId, rate, and timestamp by default. In the real-time algorithm, only the corresponding The queue content can be
import scala.collection.JavaConversions._
/**
  * 获取当前最近的M次商品评分
  * @param num  评分的个数
  * @param userId  谁的评分
  * @return
  */
def getUserRecentlyRating(num:Int, userId:Int,jedis:Jedis): Array[(Int,Double)] ={
    
    
  //从用户的队列中取出num个评分
  jedis.lrange("userId:"+userId.toString, 0, num).map{
    
    item =>
    val attr = item.split("\\:")
    (attr(0).trim.toInt, attr(1).trim.toDouble)
  }.toArray
}

5.3.2 Get the K most similar products to the current product

In the offline algorithm, the product similarity matrix has been calculated in advance, so the most similar K products of each productId are easy to obtain: read the ProductRecs data from MongoDB, and get the subhash corresponding to productId in simHash Get the top K products with the highest similarity in the table. The output is an array whose data type is Array[Int], which represents the collection of products most similar to productId, and is named candidateProducts as a collection of candidate products.

/**
  * 获取当前商品K个相似的商品
  * @param num          相似商品的数量
  * @param productId          当前商品的ID
  * @param userId          当前的评分用户
  * @param simProducts    商品相似度矩阵的广播变量值
  * @param mongConfig   MongoDB的配置
  * @return
  */
def getTopSimProducts(num:Int, productId:Int, userId:Int, simProducts:scala.collection.Map[Int,scala.collection.immutable.Map[Int,Double]])(implicit mongConfig: MongConfig): Array[Int] ={
    
    
  //从广播变量的商品相似度矩阵中获取当前商品所有的相似商品
  val allSimProducts = simProducts.get(productId).get.toArray
  //获取用户已经观看过得商品
  val ratingExist = ConnHelper.mongoClient(mongConfig.db)(MONGODB_RATING_COLLECTION).find(MongoDBObject("userId" -> userId)).toArray.map{
    
    item =>
    item.get("productId").toString.toInt
  }
  //过滤掉已经评分过得商品,并排序输出
  allSimProducts.filter(x => !ratingExist.contains(x._1)).sortWith(_._2 > _._2).take(num).map(x => x._1)
}

5.3.3 Product recommendation priority calculation

  • For the recent K recentRatings of the candidate product set simiHash and userId, the algorithm code content is as follows:
/**
  * 计算待选商品的推荐分数
  * @param simProducts            商品相似度矩阵
  * @param userRecentlyRatings  用户最近的k次评分
  * @param topSimProducts         当前商品最相似的K个商品
  * @return
  */
def computeProductScores(
	simProducts:scala.collection.Map[Int,scala.collection.immutable.Map[Int,Doub
	le]],userRecentlyRatings:Array[(Int,Double)],topSimProducts: Array[Int]): 
	Array[(Int,Double)] ={
    
    

  //用于保存每一个待选商品和最近评分的每一个商品的权重得分
  val score = scala.collection.mutable.ArrayBuffer[(Int,Double)]()

  //用于保存每一个商品的增强因子数
  val increMap = scala.collection.mutable.HashMap[Int,Int]()

  //用于保存每一个商品的减弱因子数
  val decreMap = scala.collection.mutable.HashMap[Int,Int]()

  for (topSimProduct <- topSimProducts; userRecentlyRating <- userRecentlyRatings){
    
    
    val simScore = getProductsSimScore(simProducts,userRecentlyRating._1,topSimProduct)
    if(simScore > 0.6){
    
    
      score += ((topSimProduct, simScore * userRecentlyRating._2 ))
      if(userRecentlyRating._2 > 3){
    
    
        increMap(topSimProduct) = increMap.getOrDefault(topSimProduct,0) + 1
      }else{
    
    
        decreMap(topSimProduct) = decreMap.getOrDefault(topSimProduct,0) + 1
      }
    }
  }

  score.groupBy(_._1).map{
    
    case (productId,sims) =>
    (productId,sims.map(_._2).sum / sims.length + log(increMap.getOrDefault(productId, 1)) - log(decreMap.getOrDefault(productId, 1)))
  }.toArray.sortWith(_._2>_._2)

}
  • Among them, getProductSimScore is to obtain the similarity between the candidate product and the rated product. The code is as follows:
/**
  * 获取当个商品之间的相似度
  * @param simProducts       商品相似度矩阵
  * @param userRatingProduct 用户已经评分的商品
  * @param topSimProduct     候选商品
  * @return
  */
def getProductsSimScore(
simProducts:scala.collection.Map[Int,scala.collection.immutable.Map[Int,Double]], userRatingProduct:Int, topSimProduct:Int): Double ={
    
    
  simProducts.get(topSimProduct) match {
    
    
    case Some(sim) => sim.get(userRatingProduct) match {
    
    
      case Some(score) => score
      case None => 0.0
    }
    case None => 0.0
  }
}
  • And log is a logarithmic operation, which is implemented here by taking the logarithm of 10 (common logarithm):
//取10的对数
def log(m:Int):Double ={
    
    
  math.log(m) / math.log(10)
}

5.3.4 Save results to mongoDB

  • The saveRecsToMongoDB function implements the saving of results:
/**
  * 将数据保存到MongoDB    userId -> 1,  recs -> 22:4.5|45:3.8
  * @param streamRecs  流式的推荐结果
  * @param mongConfig  MongoDB的配置
  */
def saveRecsToMongoDB(userId:Int,streamRecs:Array[(Int,Double)])(implicit mongConfig: MongConfig): Unit ={
    
    
  //到StreamRecs的连接
  val streaRecsCollection = ConnHelper.mongoClient(mongConfig.db)(MONGODB_STREAM_RECS_COLLECTION)

  streaRecsCollection.findAndRemove(MongoDBObject("userId" -> userId))
  streaRecsCollection.insert(MongoDBObject("userId" -> userId, "recs" ->
	streamRecs.map( x => MongoDBObject("productId"->x._1,"score"->x._2)) ))
}

5.3.5 Update real-time recommendation results

  • After calculating the updatedRecommends<productId, E> array of recommendation priorities of candidate products, this array will be sent to the web background server, and merged and replaced with the last real-time recommendation result recentRecommends<productId, E> of userId on the background server And select the top K products with priority E as this new real-time recommendation. in particular:
    • Merge: merge updatedRecommends and recentRecommends into a new <productId, E> array;
    • Replacement (deduplication): When updatedRecommends and recentRecommends have duplicate productIds, the recommendation priority of productId in recentRecommends is the result of the last real-time recommendation, so it will be invalidated and replaced with a recommendation representing the updated productId of updatedRecommends priority;
    • Select TopK: Based on the merged and replaced <productId, E> array, according to the recommendation priority of each product, select the top K products as the final result of this real-time recommendation.

Code body:

package com.atguigu.online

import com.mongodb.casbah.commons.MongoDBObject
import com.mongodb.casbah.{
    
    MongoClient, MongoClientURI}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{
    
    ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
import redis.clients.jedis.Jedis

// 定义一个连接助手对象,建立到redis和mongodb的连接
object ConnHelper extends Serializable{
    
    
  // 懒变量定义,使用的时候才初始化
  lazy val jedis = new Jedis("localhost")
  lazy val mongoClient = MongoClient(MongoClientURI("mongodb://localhost:27017/recommender"))
}

case class MongoConfig( uri: String, db: String )

// 定义标准推荐对象
case class Recommendation( productId: Int, score: Double )
// 定义用户的推荐列表
case class UserRecs( userId: Int, recs: Seq[Recommendation] )
// 定义商品相似度列表
case class ProductRecs( productId: Int, recs: Seq[Recommendation] )

object OnlineRecommender {
    
    
  // 定义常量和表名
  val MONGODB_RATING_COLLECTION = "Rating"
  val STREAM_RECS = "StreamRecs"
  val PRODUCT_RECS = "ProductRecs"

  val MAX_USER_RATING_NUM = 20
  val MAX_SIM_PRODUCTS_NUM = 20

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender",
      "kafka.topic" -> "recommender"
    )

    // 创建spark conf
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("OnlineRecommender")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val sc = spark.sparkContext
    val ssc = new StreamingContext(sc, Seconds(2))

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据,相似度矩阵,广播出去
    val simProductsMatrix = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", PRODUCT_RECS)
      .format("com.mongodb.spark.sql")
      .load()
      .as[ProductRecs]
      .rdd
      // 为了后续查询相似度方便,把数据转换成map形式
      .map{
    
    item =>
        ( item.productId, item.recs.map( x=>(x.productId, x.score) ).toMap )
      }
      .collectAsMap()
    // 定义广播变量
    val simProcutsMatrixBC = sc.broadcast(simProductsMatrix)

    // 创建kafka配置参数
    val kafkaParam = Map(
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "recommender",
      "auto.offset.reset" -> "latest"
    )
    // 创建一个DStream
    val kafkaStream = KafkaUtils.createDirectStream[String, String]( ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String]( Array(config("kafka.topic")), kafkaParam )
    )
    // 对kafkaStream进行处理,产生评分流,userId|productId|score|timestamp
    val ratingStream = kafkaStream.map{
    
    msg=>
      var attr = msg.value().split("\\|")
      ( attr(0).toInt, attr(1).toInt, attr(2).toDouble, attr(3).toInt )
    }

    // 核心算法部分,定义评分流的处理流程
    ratingStream.foreachRDD{
    
    
      rdds => rdds.foreach{
    
    
        case ( userId, productId, score, timestamp ) =>
          println("rating data coming!>>>>>>>>>>>>>>>>>>")

          // TODO: 核心算法流程
          // 1. 从redis里取出当前用户的最近评分,保存成一个数组Array[(productId, score)]
          val userRecentlyRatings = getUserRecentlyRatings( MAX_USER_RATING_NUM, userId, ConnHelper.jedis )

          // 2. 从相似度矩阵中获取当前商品最相似的商品列表,作为备选列表,保存成一个数组Array[productId]
          val candidateProducts = getTopSimProducts( MAX_SIM_PRODUCTS_NUM, productId, userId, simProcutsMatrixBC.value )

          // 3. 计算每个备选商品的推荐优先级,得到当前用户的实时推荐列表,保存成 Array[(productId, score)]
          val streamRecs = computeProductScore( candidateProducts, userRecentlyRatings, simProcutsMatrixBC.value )

          // 4. 把推荐列表保存到mongodb
          saveDataToMongoDB( userId, streamRecs )
      }
    }

    // 启动streaming
    ssc.start()
    println("streaming started!")
    ssc.awaitTermination()

  }

  /**
    * 从redis里获取最近num次评分
    */
  import scala.collection.JavaConversions._
  def getUserRecentlyRatings(num: Int, userId: Int, jedis: Jedis): Array[(Int, Double)] = {
    
    
    // 从redis中用户的评分队列里获取评分数据,list键名为uid:USERID,值格式是 PRODUCTID:SCORE
    jedis.lrange( "userId:" + userId.toString, 0, num )
      .map{
    
     item =>
        val attr = item.split("\\:")
        ( attr(0).trim.toInt, attr(1).trim.toDouble )
      }
      .toArray
  }
  // 获取当前商品的相似列表,并过滤掉用户已经评分过的,作为备选列表
  def getTopSimProducts(num: Int,
                        productId: Int,
                        userId: Int,
                        simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]])
                       (implicit mongoConfig: MongoConfig): Array[Int] ={
    
    
    // 从广播变量相似度矩阵中拿到当前商品的相似度列表
    val allSimProducts = simProducts(productId).toArray

    // 获得用户已经评分过的商品,过滤掉,排序输出
    val ratingCollection = ConnHelper.mongoClient( mongoConfig.db )( MONGODB_RATING_COLLECTION )
    val ratingExist = ratingCollection.find( MongoDBObject("userId"->userId) )
      .toArray
      .map{
    
    item=> // 只需要productId
        item.get("productId").toString.toInt
      }
    // 从所有的相似商品中进行过滤
    allSimProducts.filter( x => ! ratingExist.contains(x._1) )
      .sortWith(_._2 > _._2)
      .take(num)
      .map(x=>x._1)
  }
  // 计算每个备选商品的推荐得分
  def computeProductScore(candidateProducts: Array[Int],
                          userRecentlyRatings: Array[(Int, Double)],
                          simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]])
  : Array[(Int, Double)] ={
    
    
    // 定义一个长度可变数组ArrayBuffer,用于保存每一个备选商品的基础得分,(productId, score)
    val scores = scala.collection.mutable.ArrayBuffer[(Int, Double)]()
    // 定义两个map,用于保存每个商品的高分和低分的计数器,productId -> count
    val increMap = scala.collection.mutable.HashMap[Int, Int]()
    val decreMap = scala.collection.mutable.HashMap[Int, Int]()

    // 遍历每个备选商品,计算和已评分商品的相似度
    for( candidateProduct <- candidateProducts; userRecentlyRating <- userRecentlyRatings ){
    
    
      // 从相似度矩阵中获取当前备选商品和当前已评分商品间的相似度
      val simScore = getProductsSimScore( candidateProduct, userRecentlyRating._1, simProducts )
      if( simScore > 0.4 ){
    
    
        // 按照公式进行加权计算,得到基础评分
        scores += ( (candidateProduct, simScore * userRecentlyRating._2) )
        if( userRecentlyRating._2 > 3 ){
    
    
          increMap(candidateProduct) = increMap.getOrDefault(candidateProduct, 0) + 1
        } else {
    
    
          decreMap(candidateProduct) = decreMap.getOrDefault(candidateProduct, 0) + 1
        }
      }
    }

    // 根据公式计算所有的推荐优先级,首先以productId做groupby
    scores.groupBy(_._1).map{
    
    
      case (productId, scoreList) =>
        ( productId, scoreList.map(_._2).sum/scoreList.length + log(increMap.getOrDefault(productId, 1)) - log(decreMap.getOrDefault(productId, 1)) )
    }
    // 返回推荐列表,按照得分排序
      .toArray
      .sortWith(_._2>_._2)
  }

  def getProductsSimScore(product1: Int, product2: Int,
                          simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]]): Double ={
    
    
    simProducts.get(product1) match {
    
    
      case Some(sims) => sims.get(product2) match {
    
    
        case Some(score) => score
        case None => 0.0
      }
      case None => 0.0
    }
  }
  // 自定义log函数,以N为底
  def log(m: Int): Double = {
    
    
    val N = 10
    math.log(m)/math.log(N)
  }
  // 写入mongodb
  def saveDataToMongoDB(userId: Int, streamRecs: Array[(Int, Double)])(implicit mongoConfig: MongoConfig): Unit ={
    
    
    val streamRecsCollection = ConnHelper.mongoClient(mongoConfig.db)(STREAM_RECS)
    // 按照userId查询并更新
    streamRecsCollection.findAndRemove( MongoDBObject( "userId" -> userId ) )
    streamRecsCollection.insert( MongoDBObject( "userId" -> userId,
                                  "recs" -> streamRecs.map(x=>MongoDBObject("productId"->x._1, "score"->x._2)) ) )
  }

}

5.4 Real-time system joint debugging

  • The real-time recommended data flow direction of our system is: business system -> log -> flume log collection -> kafka streaming data cleaning and preprocessing -> spark streaming streaming computing. After we complete the code for the real-time recommendation service, we should conduct joint debugging tests with other tools to ensure that the system is running normally.

5.4.1 Starting the basic components of a real-time system

  • Start the real-time recommendation system StreamingRecommender as well as mongodb and redis

5.4.2 Start zookeeper

bin/zkServer.sh start

5.4.3 Start kafka

bin/kafka-server-start.sh -daemon ./config/server.properties

5.4.4 Build Kafka Streaming program

  • Create a new module under recommender, KafkaStreaming, which is mainly used to preprocess log data and filter out the required content. The pom.xml file needs to introduce dependencies:
<dependencies>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-streams</artifactId>
        <version>0.10.2.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka-clients</artifactId>
        <version>0.10.2.1</version>
    </dependency>
</dependencies>

<build>
    <finalName>kafkastream</finalName>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-assembly-plugin</artifactId>
            <configuration>
                <archive>
                    <manifest>
                        <mainClass>com.atguigu.kafkastream.Application</mainClass>
                    </manifest>
                </archive>
                <descriptorRefs>
                    <descriptorRef>jar-with-dependencies</descriptorRef>
                </descriptorRefs>
            </configuration>
            <executions>
                <execution>
                    <id>make-assembly</id>
                    <phase>package</phase>
                    <goals>
                        <goal>single</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
  • Create a new java class com.atguigu.kafkastreaming.Application under src/main/java
public class Application {
    
    
    public static void main(String[] args){
    
    

        String brokers = "localhost:9092";
        String zookeepers = "localhost:2181";

        // 定义输入和输出的topic
        String from = "log";
        String to = "recommender";

        // 定义kafka streaming的配置
        Properties settings = new Properties();
        settings.put(StreamsConfig.APPLICATION_ID_CONFIG, "logFilter");
        settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
        settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, zookeepers);

        StreamsConfig config = new StreamsConfig(settings);

        // 拓扑建构器
        TopologyBuilder builder = new TopologyBuilder();

        // 定义流处理的拓扑结构
        builder.addSource("SOURCE", from)
                .addProcessor("PROCESS", () -> new LogProcessor(), "SOURCE")
                .addSink("SINK", to, "PROCESS");

        KafkaStreams streams = new KafkaStreams(builder, config);
        streams.start();
    }
}
  • This program will obtain the information flow with the topic "log" for processing, and forward it with "recommender" as the new topic.
  • Stream processing programLogProcess.java
public class LogProcessor implements Processor<byte[],byte[]> {
    
    
    private ProcessorContext context;

    public void init(ProcessorContext context) {
    
    
        this.context = context;
    }

    public void process(byte[] dummy, byte[] line) {
    
    
        String input = new String(line);
        // 根据前缀过滤日志信息,提取后面的内容
        if(input.contains("PRODUCT_RATING_PREFIX:")){
    
    
            System.out.println("product rating coming!!!!" + input);
            input = input.split("PRODUCT_RATING_PREFIX:")[1].trim();
            context.forward("logProcessor".getBytes(), input.getBytes());
        }
    }
    public void punctuate(long timestamp) {
    
    
    }
    public void close() {
    
    
    }
}
  • After completing the code, start the Application.

5.4.5 Configure and start flume

  • Create new log-kafka.properties in flume's conf directory and configure flume to connect to kafka:
agent.sources = exectail
agent.channels = memoryChannel
agent.sinks = kafkasink

# For each one of the sources, the type is defined
agent.sources.exectail.type = exec
# 下面这个路径是需要收集日志的绝对路径,改为自己的日志目录
agent.sources.exectail.command = tail –f
/mnt/d/Projects/BigData/ECommerceRecommenderSystem/businessServer/src/main/log/agent.log
agent.sources.exectail.interceptors=i1
agent.sources.exectail.interceptors.i1.type=regex_filter
# 定义日志过滤前缀的正则
agent.sources.exectail.interceptors.i1.regex=.+PRODUCT_RATING_PREFIX.+
# The channel can be defined as follows.
agent.sources.exectail.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.kafka.topic = log
agent.sinks.kafkasink.kafka.bootstrap.servers = localhost:9092
agent.sinks.kafkasink.kafka.producer.acks = 1
agent.sinks.kafkasink.kafka.flumeBatchSize = 20

#Specify the channel the sink should use
agent.sinks.kafkasink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000

After configuration, start flume:
./bin/flume-ng agent -c ./conf/ -f ./conf/log-kafka.properties -n agent -Dflume.root.logger=INFO,console

5.4.6 Start the business system background

  • Add business code to the system. Note that in log4j.properties under src/main/resources/, the value of log4j.appender.file.File should be replaced with your own log directory, which should be the same as the configuration in flume.
  • Start the business system backend and visit localhost:8088/index.html; click on a product to rate and check whether the real-time recommendation list will change.

Chapter 6 Cold Start Problem Handling

  • The entire recommendation system relies more on the preference information used to recommend products, so there will be a problem. For newly registered users, there is no preference information record. Then there will be problems in the recommendation at this time, resulting in no Any recommended items appear.
  • To deal with this problem, when the user logs in for the first time, provide the user with an interactive window to obtain the user's preference for the item, and let the user check the preset interest tags.
    After obtaining the user's preferences, recommendations for corresponding types of products can be directly given.

Chapter 7 Other forms of offline similar recommendation services

7.1 Content-based similar recommendations

  • The tag file in the original data is the tag on the product by the user. It is not easy to directly convert this part of the content into a score, but we can extract the tag content to obtain the content feature vector of the product, and then obtain the similarity degree matrix. This part can be directly connected with the real-time recommendation system to calculate products similar to the user's currently rated products to achieve content-based real-time recommendations. In order to avoid the impact of popular tags on feature extraction, we can also adjust the weight of tags through the TF-IDF algorithm, so as to be as close as possible to user preferences.
  • Based on the above ideas, the core code for obtaining product feature vectors by adding the TF-IDF algorithm is as follows:
// 载入商品数据集
val productTagsDF = spark
  .read
  .option("uri",mongoConfig.uri)
  .option("collection",MONGODB_PRODUCT_COLLECTION)
  .format("com.mongodb.spark.sql")
  .load()
  .as[Product]
  .map(x => (x.productId, x.name, x.genres.map(c => if(c == '|') ' ' else c)))
  .toDF("productId", "name", "tags").cache()

// 实例化一个分词器,默认按空格分
val tokenizer = new Tokenizer().setInputCol("tags").setOutputCol("words")

// 用分词器做转换
val wordsData = tokenizer.transform(productTagsDF)

// 定义一个HashingTF工具
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)

// 用 HashingTF 做处理
val featurizedData = hashingTF.transform(wordsData)

// 定义一个IDF工具
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

// 将词频数据传入,得到idf模型(统计文档)
val idfModel = idf.fit(featurizedData)

// 用tf-idf算法得到新的特征矩阵
val rescaledData = idfModel.transform(featurizedData)

// 从计算得到的 rescaledData 中提取特征向量
val productFeatures = rescaledData.map{
    
    
  case row => ( row.getAs[Int]("productId"),row.getAs[SparseVector]("features").toArray )
}
  .rdd
  .map(x => {
    
    
    (x._1, new DoubleMatrix(x._2) )
  })
  • Then, the similarity matrix can be obtained through the feature vector of the product, and similar recommendations can be given on the product detail page; usually on e-commerce websites, after the user browses the product or completes the purchase, a similar recommendation list will be displayed.
  • The obtained similarity matrix can also provide a basis for real-time recommendations and obtain a user recommendation list. It can be seen that the purpose of content-based and latent semantic models is to extract the feature vectors of items, so that the similarity matrix can be calculated. Our real-time recommendation system algorithm is defined based on similarity.

7.2 Item-based collaborative filtering similar recommendation

  • Item-based collaborative filtering (Item-CF), only needs to collect the user's routine behavior data (such as click, favorite, purchase) to get the similarity between products, which is widely used in actual projects.
  • Our overall idea is that if two products have the same audience (people of interest), then they are inherently related. Therefore, existing behavioral data can be used to analyze the similarity of product audiences, and then derive the similarity between products. We define this method as the "co-occurrence similarity" of items, and the formula is as follows:
    Insert image description here
  • Among them, Ni is the list of users who purchased product i (or rated product i), and Nj is the list of users who purchased product j.
  • The core code is implemented as follows:
 val ratingDF = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[Rating]
      .map(x=> (x.userId, x.productId, x.score) )
      .toDF("userId", "productId", "rating")

    // 统计每个商品的评分个数,并通过内连接添加到 ratingDF 中
    val numRatersPerProduct = ratingDF.groupBy("productId").count()
    val ratingWithCountDF = ratingDF.join(numRatersPerProduct, "productId")

    // 将商品评分按 userId 两两配对,可以统计两个商品被同一用户做出评分的次数
    val joinedDF = ratingWithCountDF.join(ratingWithCountDF, "userId")
      .toDF("userId", "product1", "rating1", "count1", "product2", "rating2", "count2")
      .select("userId", "product1", "count1", "product2", "count2")
    joinedDF.createOrReplaceTempView("joined")
    val cooccurrenceDF = spark.sql(
      """
        |select product1
        |, product2
        |, count(userId) as coocount
        |, first(count1) as count1
        |, first(count2) as count2
        |from joined
        |group by product1, product2
      """.stripMargin
    ).cache()

    val simDF = cooccurrenceDF.map{
    
     row =>
      // 用同现的次数和各自的次数,计算同现相似度
      val coocSim = cooccurrenceSim( row.getAs[Long]("coocount"), row.getAs[Long]("count1"), row.getAs[Long]("count2") )
      ( row.getAs[Int]("product1"), ( row.getAs[Int]("product2"), coocSim ) )
    }
      .rdd
      .groupByKey()
      .map{
    
    
        case (productId, recs) =>
          ProductRecs( productId,
            recs.toList
              .filter(x=>x._1 != productId)
              .sortWith(_._2>_._2)
              .map(x=>Recommendation(x._1,x._2))
              .take(MAX_RECOMMENDATION)
          )
      }
      .toDF()
  • Among them, the function code for calculating co-occurrence similarity is implemented as follows:
def cooccurrenceSim(cooCount: Long, count1: Long, count2: Long): Double ={
    
    
      cooCount / math.sqrt( count1 * count2 )
    }

Chapter 8 Program Deployment and Operation

8.1 Publish project

  • Compile the project: execute the clean package phase of the root project
    Insert image description here

8.2 Install the front-end project

  • Unzip website-release.tar.gz to the /var/www/html directory, and place the files inside in the root directory, as follows:
    Insert image description here
  • Start the Apache server and visit http://IP:80

8.3 Install business server

  • Place BusinessServer.war in the webapp directory of tomcat, and place the extracted files in the ROOT directory:
    Insert image description here
  • Start the Tomcat server

8.4 Kafka configuration and startup

  • Start Kafka
  • Create two Topics in kafka, one for log and one for recommender
  • Start the kafkaStream program to format data between the log and recommender topics.
java -cp kafkastream.jar com.atguigu.kafkastream.Application linux:9092 linux:2181 log recommender

8.5 Flume configuration and startup

  • Create log-kafka.properties in the conf folder in the flume installation directory
agent.sources = exectail
agent.channels = memoryChannel
agent.sinks = kafkasink

# For each one of the sources, the type is defined
agent.sources.exectail.type = exec
agent.sources.exectail.command = tail -f /home/bigdata/cluster/apache-tomcat-8.5.23/logs/catalina.out
agent.sources.exectail.interceptors=i1
agent.sources.exectail.interceptors.i1.type=regex_filter
agent.sources.exectail.interceptors.i1.regex=.+PRODUCT_RATING_PREFIX.+
# The channel can be defined as follows.
agent.sources.exectail.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.kafka.topic = log
agent.sinks.kafkasink.kafka.bootstrap.servers = linux:9092
agent.sinks.kafkasink.kafka.producer.acks = 1
agent.sinks.kafkasink.kafka.flumeBatchSize = 20


#Specify the channel the sink should use
agent.sinks.kafkasink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000
  • Start flume
bin/flume-ng agent -c ./conf/ -f ./conf/log-kafka.properties -n agent

8.6 Deploy streaming computing service

  • Submit SparkStreaming program:
bin/spark-submit --class com.atguigu.streamingRecommender.StreamingRecommender streamingRecommender-1.0-SNAPSHOT.jar

8.7 Azkaban scheduling offline algorithm

  • Create scheduling project
    Insert image description here
  • Create two job files as follows:
  • Azkaban-stat.job:
type=command
command=/home/bigdata/cluster/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --class com.atguigu.offline.RecommenderTrainerApp
 offlineRecommender-1.0-SNAPSHOT.jar
  • Azkaban-offline.job:
type=command
command=/home/bigdata/cluster/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --class com.atguigu.statisticsRecommender.StatisticsApp
 statisticsRecommender-1.0-SNAPSHOT.jar
  • Upload the Job file into a ZIP package and upload it to azkaban:
    Insert image description here
  • as follows:
    Insert image description here
  • Define the specified time for each task separately:
    Insert image description here
  • After the definition is completed, click Scheduler.

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131566357