[Big Data Practical E-commerce Recommendation System] Overview Version

Article directory


Chapter 1 Project System Framework Design (Instructions)


Chapter 2 Tool Environment Construction (Instructions)

  • Install the latest version of MongoDB => Solve the problem of missing dependencies when installing mongodb on Ubuntu
  • Use CentOS7 system and follow the tool environment construction process to install MongoDB, Redis, Spark, Zookeeper, Flume-ng, and Kafka

Chapter 3 Project creation and initialization of business data

3.1 IDEA creates Maven project (omitted)

3.2 Data loading preparation (instructions)

3.3 Initialize data to MongoDB [DataLoader data loading module]

Data loader main implementation + data writing to MongoDB

  • Define several sample classes for raw data, read data from the file through the textFile method of SparkContext, convert it into a DataFrame, and then use the write method provided by Spark SQL to perform distributed insertion of data.
  • Create a new package under DataLoader/src/main/scala, name it com.atguigu.recommender, and create a new scala class file named DataLoader.
  • Firewall problem: The firewall needs to be turned off when connecting to mongodb

StatisticsRecommender statistics recommendation module

Code analysis:

  • temporary table -> result table
  • Register UDF and convert timestamp into year and month format yyyyMM
spark.udf.register("changeDate", (x: Int)=>simpleDateFormat.format(new Date(x * 1000L)).toInt)

Chapter 4 Offline recommendation service construction

4.1 Offline recommendation service

  • The offline recommendation service integrates all the historical data of users and uses the set offline statistical algorithm and offline recommendation algorithm to periodically collect and save the results. The calculated results are fixed within a certain period of time, and the frequency of changes depends on How often the algorithm is scheduled.
  • The offline recommendation service mainly calculates some indicators that can be counted and calculated in advance to provide data support for real-time calculation and front-end business response.
  • Offline recommendation services are mainly divided into statistical recommendations, latent semantic model-based collaborative filtering recommendations, and content-based and Item-CF-based similar recommendations.
  • This chapter mainly introduces the first two parts. Content-based and Item-CF recommendations are similar in overall structure and implementation. We will introduce them in detail in Chapter 7.

4.2 Offline statistical service [statistical recommendation module]

  • Create a new sub-project StatisticsRecommender under recommender, and only need to introduce the relevant dependencies of spark, scala and mongodb in the pom.xml file:

  • Introduce log4j.properties under the resources folder, and then create a new scala singleton object com.atguigu.statistics.StatisticsRecommender under src/main/scala.

  • Similarly, we should build the sample class first, define the configuration, create SparkSession and load data in the main() method, and finally close spark.

  • Historical popular product statistics: Based on all historical rating data, calculate the product with the most historical ratings

    • Read the rating data set through Spark SQL and count the products with the most ratings among all ratings.
    • Then sort from large to small, and write the final results into MongoDB's RateMoreProducts data set
  • Recent popular product statistics: Based on the ratings, calculate the collection of products with the most ratings in the most recent month on a monthly basis.

    • Read the rating data set through Spark SQL, modify the rating data time to month through the UDF function, and then count the number of monthly product ratings.
    • After the statistics are completed, the data is written to the RateMoreRecentlyProducts data set of MongoDB.
  • Product average score statistics: Based on the ratings of all users on the product in historical data, the average score of each product is periodically calculated.

    • Read the Rating data set saved in MongDB through Spark SQL, and implement the average score statistics of the product by executing the following SQL statement
    • After the statistics are completed, the generated new DataFrame is written out to the AverageProducts collection of MongoDB.

Main code (src/main/scala/com.atguigu.statistics/StatisticsRecommender.scala):

package com.atguigu.statistics

import java.text.SimpleDateFormat
import java.util.Date

import org.apache.spark.SparkConf
import org.apache.spark.sql.{
    
    DataFrame, SparkSession}

case class Rating( userId: Int, productId: Int, score: Double, timestamp: Int )
case class MongoConfig( uri: String, db: String )

object StatisticsRecommender {
    
    
  // 定义mongodb中存储的表名
  val MONGODB_RATING_COLLECTION = "Rating"
  val RATE_MORE_PRODUCTS = "RateMoreProducts"
  val RATE_MORE_RECENTLY_PRODUCTS = "RateMoreRecentlyProducts"
  val AVERAGE_PRODUCTS = "AverageProducts"

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[1]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("StatisticsRecommender")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据
    val ratingDF = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[Rating]
      .toDF()

    // 创建一张叫ratings的临时表
    ratingDF.createOrReplaceTempView("ratings")

    // TODO: 【 用spark sql去做不同的统计推荐 】
    // todo: (1)历史热门商品,按照评分个数统计,productId,count
    val rateMoreProductsDF = spark.sql("select productId, count(productId) as count from ratings group by productId order by count desc")
    storeDFInMongoDB( rateMoreProductsDF, RATE_MORE_PRODUCTS )

    // todo: (2)近期热门商品,把时间戳转换成yyyyMM格式进行评分个数统计,最终得到productId, count, yearmonth
    // 创建一个日期格式化工具
    val simpleDateFormat = new SimpleDateFormat("yyyyMM")
    // 注册UDF,将timestamp转化为年月格式yyyyMM
    spark.udf.register("changeDate", (x: Int)=>simpleDateFormat.format(new Date(x * 1000L)).toInt)
    // 把原始rating数据转换成想要的结构productId, score, yearmonth
    val ratingOfYearMonthDF = spark.sql("select productId, score, changeDate(timestamp) as yearmonth from ratings")
    ratingOfYearMonthDF.createOrReplaceTempView("ratingOfMonth")
    val rateMoreRecentlyProductsDF = spark.sql("select productId, count(productId) as count, yearmonth from ratingOfMonth group by yearmonth, productId order by yearmonth desc, count desc")
    // 把df保存到mongodb
    storeDFInMongoDB( rateMoreRecentlyProductsDF, RATE_MORE_RECENTLY_PRODUCTS )

    // todo: (3)优质商品统计,商品的平均评分,productId,avg
    val averageProductsDF = spark.sql("select productId, avg(score) as avg from ratings group by productId order by avg desc")
    storeDFInMongoDB( averageProductsDF, AVERAGE_PRODUCTS )

    spark.stop()
  }

  // TODO: 【 保存到MongoDB数据库 】
  def storeDFInMongoDB(df: DataFrame, collection_name: String)(implicit mongoConfig: MongoConfig): Unit ={
    
    
    df.write
      .option("uri", mongoConfig.uri)
      .option("collection", collection_name)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()
  }
}

4.3 Collaborative filtering recommendation based on latent semantic model [LFM offline recommendation module]

  • The project uses ALS as a collaborative filtering algorithm, and calculates the offline user product recommendation list and product similarity matrix based on the user rating table in MongoDB.

4.3.1 User product recommendation list

  • The model trained by ALS is used to calculate the recommended list of all current user products. The main idea is as follows:

    • The Cartesian product of userId and productId produces a tuple of (userId, productId)
    • Predict the rating corresponding to (userId, productId) through the model.
    • Sort prediction results by prediction score.
    • Return the K products with the highest scores as the current user's recommendation list.
  • The final generated data structure is as follows: Save the data to the UserRecs table of MongoDB
    Insert image description here

  • Create a new recommender sub-project OfflineRecommender and introduce the dependencies of spark, scala, mongo and jblas:

  • After the same previous steps of building sample classes, declaring configurations, and creating SparkSession, you can load data and start calculating the model.

4.3.2 Product similarity matrix

  • The product similarity matrix is ​​calculated through ALS, which is used to query similar products of the current product and serve the real-time recommendation system.

  • The ALS algorithm for offline calculation, the algorithm will eventually generate the final feature matrix for the user and the product, which are the U(mxk) matrix representing the user feature matrix, each user is described by k features; the V(nxk) matrix representing the item feature matrix ) matrix, each item is also described by k features.

  • V(nxk) represents the feature matrix of the item, and each row is a k-dimensional vector. Although we don't know the meaning of each dimension's feature, the k-dimensional mathematical vector represents the feature of the product corresponding to the row.

  • Therefore, each commodity uses the <t 1 ,t 2 ,t 3 ,…> vector of each row of V(nxk) to represent its features, so any two commodities p: the feature vector is V p =< t p1 ,t p2 , t p3 ,…,t pk >, product q: the similarity sim(p,q) between the feature vectors is V q =< t q1 ,t q2 ,t q3 ,…,t qk > can use the cosine value of the sum To represent:
    Insert image description here

  • The similarity between any two commodities in the data set can be calculated by a formula, and the similarity between commodities is basically a fixed value for a period of time. The final generated data is saved to MongoDB's ProductRecs table.
    Insert image description here

4.3.3 Model evaluation and parameter selection

  • In the process of training the above model, we directly gave the three parameters of rank, iterations and lambda of the latent semantic model.

  • For our model, this is not necessarily the optimal parameter selection, so we need to evaluate the model.

  • A common approach is to calculate the root mean square error (RMSE), which examines the error between the predicted score and the actual score.
    Insert image description here

  • With RMSE, we can select the group with the smallest RMSE as the optimization choice for our model by adjusting the parameter values ​​multiple times.

  • The adjustALSParams method is the core of model evaluation. It inputs a set of training data and test data, and outputs the set of parameters that calculates the minimum RMSE.

  • The code is implemented as follows:

  • The function getRMSE code for calculating RMSE is implemented as follows:

  • Run the code to get the optimal model parameters for the current data

Code body:

package com.atguigu.offline

import org.apache.spark.SparkConf
import org.apache.spark.mllib.recommendation.{
    
    ALS, Rating}
import org.apache.spark.sql.SparkSession
import org.jblas.DoubleMatrix

case class ProductRating( userId: Int, productId: Int, score: Double, timestamp: Int )
case class MongoConfig( uri: String, db: String )

// 定义标准推荐对象
case class Recommendation( productId: Int, score: Double )
// 定义用户的推荐列表
case class UserRecs( userId: Int, recs: Seq[Recommendation] )
// 定义商品相似度列表
case class ProductRecs( productId: Int, recs: Seq[Recommendation] )

object OfflineRecommender {
    
    
  // 定义mongodb中存储的表名
  val MONGODB_RATING_COLLECTION = "Rating"

  val USER_RECS = "UserRecs"
  val PRODUCT_RECS = "ProductRecs"
  val USER_MAX_RECOMMENDATION = 20

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender"
    )
    // 创建一个spark config
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("OfflineRecommender")
    // 创建spark session
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据
    val ratingRDD = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[ProductRating]
      .rdd
      .map(
        rating => (rating.userId, rating.productId, rating.score)
      ).cache()

    // 提取出所有用户和商品的数据集
    val userRDD = ratingRDD.map(_._1).distinct()
    val productRDD = ratingRDD.map(_._2).distinct()

    // 核心计算过程
    // 1. 训练隐语义模型
    val trainData = ratingRDD.map(x=>Rating(x._1,x._2,x._3))
    // 定义模型训练的参数,rank隐特征个数,iterations迭代词数,lambda正则化系数
    val ( rank, iterations, lambda ) = ( 5, 10, 0.01 )
    val model = ALS.train( trainData, rank, iterations, lambda )

    // 2. 获得预测评分矩阵,得到用户的推荐列表
    // 用userRDD和productRDD做一个笛卡尔积,得到空的userProductsRDD表示的评分矩阵
    val userProducts = userRDD.cartesian(productRDD)
    val preRating = model.predict(userProducts)

    // 从预测评分矩阵中提取得到用户推荐列表
    val userRecs = preRating.filter(_.rating>0)
      .map(
        rating => ( rating.user, ( rating.product, rating.rating ) )
      )
      .groupByKey()
      .map{
    
    
        case (userId, recs) =>
          UserRecs( userId, recs.toList.sortWith(_._2>_._2).take(USER_MAX_RECOMMENDATION).map(x=>Recommendation(x._1,x._2)) )
      }
      .toDF()
    userRecs.write
      .option("uri", mongoConfig.uri)
      .option("collection", USER_RECS)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    // 3. 利用商品的特征向量,计算商品的相似度列表
    val productFeatures = model.productFeatures.map{
    
    
      case (productId, features) => ( productId, new DoubleMatrix(features) )
    }
    // 两两配对商品,计算余弦相似度
    val productRecs = productFeatures.cartesian(productFeatures)
      .filter{
    
    
        case (a, b) => a._1 != b._1
      }
      // 计算余弦相似度
      .map{
    
    
        case (a, b) =>
          val simScore = consinSim( a._2, b._2 )
          ( a._1, ( b._1, simScore ) )
      }
      .filter(_._2._2 > 0.4)
      .groupByKey()
      .map{
    
    
        case (productId, recs) =>
          ProductRecs( productId, recs.toList.sortWith(_._2>_._2).map(x=>Recommendation(x._1,x._2)) )
      }
      .toDF()
    productRecs.write
      .option("uri", mongoConfig.uri)
      .option("collection", PRODUCT_RECS)
      .mode("overwrite")
      .format("com.mongodb.spark.sql")
      .save()

    spark.stop()
  }
  def consinSim(product1: DoubleMatrix, product2: DoubleMatrix): Double ={
    
    
    product1.dot(product2)/ ( product1.norm2() * product2.norm2() )
  }
}

Chapter 5 Real-time recommendation service construction [Real-time recommendation module]

  • The biggest difference between real-time calculation and offline calculation when applied to recommendation systems is that the real-time calculation recommendation results should reflect the user's recent preferences in the most recent period, while the offline calculation recommendation results are based on all the user's rating records starting from the first rating. overall preference.
  • User preferences for items will always change over time.
    • For example, if a user u gives a very high rating to product p at a certain moment, then u is very likely to like other products similar to product p in the near future;
    • And if user u gives a very low rating to product q at a certain moment, then it is very likely that u will not like other products similar to product q in the near future.
    • Therefore, for real-time recommendations, when a user evaluates a product, the user will hope that the recommended results will be updated based on the recent ratings, so that the recommended results match the user's recent preferences and satisfy the user's recent tastes.
  • If real-time recommendation continues to use the ALS algorithm in offline recommendation, due to the huge running time of the algorithm, it does not have the ability to obtain new recommendation results in real time; and because the algorithm itself uses a rating table, only the total rating will be updated after the user's rating. One item in the table makes the recommendation result after the algorithm is basically the same as the recommendation result before the user's current rating, thus giving the user a feeling that the recommendation result has not changed, which greatly affects the user experience.
  • In addition, in real-time recommendation, since the time performance must meet real-time or quasi-real-time requirements, the calculation amount of the algorithm cannot be too large to avoid the degradation of user experience caused by complex and excessive calculations. Given this, recommendation accuracy tends not to be very high. The real-time recommendation system is more concerned about the dynamic change ability of the recommendation results. As long as the reasons for updating the recommendation results are reasonable, the accuracy requirements of the recommendations can be relaxed appropriately.
  • Therefore, there are two main requirements for real-time recommendation algorithms:
    • The system can obviously update the recommendation results after the user’s current rating, or after the last few ratings;
    • The amount of calculation is not large, and it meets the real-time or quasi-real-time requirements in terms of response time;

5.2 Real-time recommendation model and code framework

5.2.1 Real-time recommendation model algorithm design

5.2.2 Real-time recommendation module framework

  • Create a new sub-project StreamingRecommender under recommender and introduce dependencies on spark, scala, mongo, redis and kafka:

  • In the code, we first define the sample class and a connection helper object (used to establish redis and mongo connections), and define some constants in StreamingRecommender.

  • The real-time recommendation body code is as follows:

def main(args: Array[String]): Unit = {
    
    

  val config = Map(
    "spark.cores" -> "local[*]",
    "mongo.uri" -> "mongodb://localhost:27017/recommender",
    "mongo.db" -> "recommender",
    "kafka.topic" -> "recommender"
  )
  //创建一个SparkConf配置
  val sparkConf = new SparkConf().setAppName("StreamingRecommender").setMaster(config("spark.cores"))
  val spark = SparkSession.builder().config(sparkConf).getOrCreate()
  val sc = spark.sparkContext
  val ssc = new StreamingContext(sc,Seconds(2))

  implicit val mongConfig = MongConfig(config("mongo.uri"),config("mongo.db"))
  import spark.implicits._

  // 广播商品相似度矩阵
  //装换成为 Map[Int, Map[Int,Double]]
  val simProductsMatrix = spark
    .read
    .option("uri",config("mongo.uri"))
    .option("collection",MONGODB_PRODUCT_RECS_COLLECTION)
    .format("com.mongodb.spark.sql")
    .load()
    .as[ProductRecs]   
    .rdd
    .map{
    
    recs =>
      (recs.productId,recs.recs.map(x=> (x.productId,x.score)).toMap)
    }.collectAsMap()  

  val simProductsMatrixBroadCast = sc.broadcast(simProductsMatrix)

  //创建到Kafka的连接
  val kafkaPara = Map(
    "bootstrap.servers" -> "localhost:9092",
    "key.deserializer" -> classOf[StringDeserializer],
    "value.deserializer" -> classOf[StringDeserializer],
    "group.id" -> "recommender",
    "auto.offset.reset" -> "latest"
  )

  val kafkaStream = KafkaUtils.createDirectStream[String,String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String,String](Array(config("kafka.topic")),kafkaPara))

  // UID|MID|SCORE|TIMESTAMP
  // 产生评分流
  val ratingStream = kafkaStream.map{
    
    case msg=>
    var attr = msg.value().split("\\|")
    (attr(0).toInt,attr(1).toInt,attr(2).toDouble,attr(3).toInt)
  }

// 核心实时推荐算法
  ratingStream.foreachRDD{
    
    rdd =>
    rdd.map{
    
    case (userId,productId,score,timestamp) =>
      println(">>>>>>>>>>>>>>>>")

      //获取当前最近的M次商品评分
      val userRecentlyRatings = getUserRecentlyRating(MAX_USER_RATINGS_NUM,userId,ConnHelper.jedis)

      //获取商品P最相似的K个商品
      val simProducts = getTopSimProducts(MAX_SIM_PRODUCTS_NUM,productId,userId,simProductsMatrixBroadCast.value)

      //计算待选商品的推荐优先级
      val streamRecs = computeProductScores(simProductsMatrixBroadCast.value,userRecentlyRatings,simProducts)

      //将数据保存到MongoDB
      saveRecsToMongoDB(userId,streamRecs)

    }.count()
  }

  //启动Streaming程序
  ssc.start()
  ssc.awaitTermination()
}

5.3 Implementation of real-time recommendation algorithm

  • The premise of real-time recommendation algorithm:
    • The Redis cluster stores each user's recent K ratings of the product. Real-time algorithms are available quickly.
    • The offline recommendation algorithm has calculated the product similarity matrix in MongoDB in advance.
    • Kafka has obtained real-time user rating data.
  • The algorithm process is as follows:
    • The input of the real-time recommendation algorithm is a rating <userId, productId, rate, timestamp>
    • Core elements of execution include:
      • Get the last K ratings of userId
      • Get the K most similar products with productId
      • Calculate the recommendation priority of candidate products
      • Update real-time recommendation results for userId

5.3.1 Get the user’s K latest ratings

  • When the business server receives user ratings, it will insert the ratings into the queue corresponding to the user in Redis in the format of userId, productId, rate, and timestamp by default. In the real-time algorithm, only the corresponding The queue content can be

5.3.2 Get the K most similar products to the current product

In the offline algorithm, the similarity matrix of the products has been calculated in advance, so the K most similar products of each product productId are easy to obtain: read the ProductRecs data from MongoDB, and get the sub-hash corresponding to the productId in simHash Get the top K products with the highest similarity in the table. The output is an array of data type Array[Int], which represents the product collection most similar to productId and is named candidateProducts as the candidate product collection.

5.3.3 Product recommendation priority calculation

  • For the recent K recentRatings of the candidate product set simiHash and userId, the algorithm code content is as follows:

  • Among them, getProductSimScore is to obtain the similarity between the candidate product and the rated product. The code is as follows:

  • And log is a logarithmic operation, which is implemented here by taking the logarithm of 10 (common logarithm):

5.3.4 Save results to mongoDB

  • The saveRecsToMongoDB function implements the saving of results:

5.3.5 Update real-time recommendation results

  • When the array updatedRecommends<productId, E> of the recommendation priority of the candidate product is calculated, this array will be sent to the Web backend server, and merged and replaced with the last real-time recommendation result recentRecommends<productId, E> of userId on the backend server. And select the top K products with priority E as this new real-time recommendation. in particular:
    • Merge: merge updatedRecommends and recentRecommends into a new <productId, E> array;
    • Replacement (removal): When updatedRecommends and recentRecommends have duplicate productIds, the recommendation priority of productId in recentRecommends will be invalidated because it is the result of the last real-time recommendation, and will be replaced by the recommendation representing the productId of updatedRecommends. priority;
    • Select TopK: Based on the merged and replaced <productId, E> array, according to the recommendation priority of each product, select the top K products as the final result of this real-time recommendation.

Code body:

package com.atguigu.online

import com.mongodb.casbah.commons.MongoDBObject
import com.mongodb.casbah.{
    
    MongoClient, MongoClientURI}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.kafka010.{
    
    ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{
    
    Seconds, StreamingContext}
import redis.clients.jedis.Jedis

// 定义一个连接助手对象,建立到redis和mongodb的连接
object ConnHelper extends Serializable{
    
    
  // 懒变量定义,使用的时候才初始化
  lazy val jedis = new Jedis("localhost")
  lazy val mongoClient = MongoClient(MongoClientURI("mongodb://localhost:27017/recommender"))
}

case class MongoConfig( uri: String, db: String )

// 定义标准推荐对象
case class Recommendation( productId: Int, score: Double )
// 定义用户的推荐列表
case class UserRecs( userId: Int, recs: Seq[Recommendation] )
// 定义商品相似度列表
case class ProductRecs( productId: Int, recs: Seq[Recommendation] )

object OnlineRecommender {
    
    
  // 定义常量和表名
  val MONGODB_RATING_COLLECTION = "Rating"
  val STREAM_RECS = "StreamRecs"
  val PRODUCT_RECS = "ProductRecs"

  val MAX_USER_RATING_NUM = 20
  val MAX_SIM_PRODUCTS_NUM = 20

  def main(args: Array[String]): Unit = {
    
    
    val config = Map(
      "spark.cores" -> "local[*]",
      "mongo.uri" -> "mongodb://localhost:27017/recommender",
      "mongo.db" -> "recommender",
      "kafka.topic" -> "recommender"
    )

    // 创建spark conf
    val sparkConf = new SparkConf().setMaster(config("spark.cores")).setAppName("OnlineRecommender")
    val spark = SparkSession.builder().config(sparkConf).getOrCreate()
    val sc = spark.sparkContext
    val ssc = new StreamingContext(sc, Seconds(2))

    import spark.implicits._
    implicit val mongoConfig = MongoConfig( config("mongo.uri"), config("mongo.db") )

    // 加载数据,相似度矩阵,广播出去
    val simProductsMatrix = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", PRODUCT_RECS)
      .format("com.mongodb.spark.sql")
      .load()
      .as[ProductRecs]
      .rdd
      // 为了后续查询相似度方便,把数据转换成map形式
      .map{
    
    item =>
        ( item.productId, item.recs.map( x=>(x.productId, x.score) ).toMap )
      }
      .collectAsMap()
    // 定义广播变量
    val simProcutsMatrixBC = sc.broadcast(simProductsMatrix)

    // 创建kafka配置参数
    val kafkaParam = Map(
      "bootstrap.servers" -> "localhost:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "recommender",
      "auto.offset.reset" -> "latest"
    )
    // 创建一个DStream
    val kafkaStream = KafkaUtils.createDirectStream[String, String]( ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String]( Array(config("kafka.topic")), kafkaParam )
    )
    // 对kafkaStream进行处理,产生评分流,userId|productId|score|timestamp
    val ratingStream = kafkaStream.map{
    
    msg=>
      var attr = msg.value().split("\\|")
      ( attr(0).toInt, attr(1).toInt, attr(2).toDouble, attr(3).toInt )
    }

    // 核心算法部分,定义评分流的处理流程
    ratingStream.foreachRDD{
    
    
      rdds => rdds.foreach{
    
    
        case ( userId, productId, score, timestamp ) =>
          println("rating data coming!>>>>>>>>>>>>>>>>>>")

          // TODO: 核心算法流程
          // 1. 从redis里取出当前用户的最近评分,保存成一个数组Array[(productId, score)]
          val userRecentlyRatings = getUserRecentlyRatings( MAX_USER_RATING_NUM, userId, ConnHelper.jedis )

          // 2. 从相似度矩阵中获取当前商品最相似的商品列表,作为备选列表,保存成一个数组Array[productId]
          val candidateProducts = getTopSimProducts( MAX_SIM_PRODUCTS_NUM, productId, userId, simProcutsMatrixBC.value )

          // 3. 计算每个备选商品的推荐优先级,得到当前用户的实时推荐列表,保存成 Array[(productId, score)]
          val streamRecs = computeProductScore( candidateProducts, userRecentlyRatings, simProcutsMatrixBC.value )

          // 4. 把推荐列表保存到mongodb
          saveDataToMongoDB( userId, streamRecs )
      }
    }

    // 启动streaming
    ssc.start()
    println("streaming started!")
    ssc.awaitTermination()

  }

  /**
    * 从redis里获取最近num次评分
    */
  import scala.collection.JavaConversions._
  def getUserRecentlyRatings(num: Int, userId: Int, jedis: Jedis): Array[(Int, Double)] = {
    
    
    // 从redis中用户的评分队列里获取评分数据,list键名为uid:USERID,值格式是 PRODUCTID:SCORE
    jedis.lrange( "userId:" + userId.toString, 0, num )
      .map{
    
     item =>
        val attr = item.split("\\:")
        ( attr(0).trim.toInt, attr(1).trim.toDouble )
      }
      .toArray
  }
  // 获取当前商品的相似列表,并过滤掉用户已经评分过的,作为备选列表
  def getTopSimProducts(num: Int,
                        productId: Int,
                        userId: Int,
                        simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]])
                       (implicit mongoConfig: MongoConfig): Array[Int] ={
    
    
    // 从广播变量相似度矩阵中拿到当前商品的相似度列表
    val allSimProducts = simProducts(productId).toArray

    // 获得用户已经评分过的商品,过滤掉,排序输出
    val ratingCollection = ConnHelper.mongoClient( mongoConfig.db )( MONGODB_RATING_COLLECTION )
    val ratingExist = ratingCollection.find( MongoDBObject("userId"->userId) )
      .toArray
      .map{
    
    item=> // 只需要productId
        item.get("productId").toString.toInt
      }
    // 从所有的相似商品中进行过滤
    allSimProducts.filter( x => ! ratingExist.contains(x._1) )
      .sortWith(_._2 > _._2)
      .take(num)
      .map(x=>x._1)
  }
  // 计算每个备选商品的推荐得分
  def computeProductScore(candidateProducts: Array[Int],
                          userRecentlyRatings: Array[(Int, Double)],
                          simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]])
  : Array[(Int, Double)] ={
    
    
    // 定义一个长度可变数组ArrayBuffer,用于保存每一个备选商品的基础得分,(productId, score)
    val scores = scala.collection.mutable.ArrayBuffer[(Int, Double)]()
    // 定义两个map,用于保存每个商品的高分和低分的计数器,productId -> count
    val increMap = scala.collection.mutable.HashMap[Int, Int]()
    val decreMap = scala.collection.mutable.HashMap[Int, Int]()

    // 遍历每个备选商品,计算和已评分商品的相似度
    for( candidateProduct <- candidateProducts; userRecentlyRating <- userRecentlyRatings ){
    
    
      // 从相似度矩阵中获取当前备选商品和当前已评分商品间的相似度
      val simScore = getProductsSimScore( candidateProduct, userRecentlyRating._1, simProducts )
      if( simScore > 0.4 ){
    
    
        // 按照公式进行加权计算,得到基础评分
        scores += ( (candidateProduct, simScore * userRecentlyRating._2) )
        if( userRecentlyRating._2 > 3 ){
    
    
          increMap(candidateProduct) = increMap.getOrDefault(candidateProduct, 0) + 1
        } else {
    
    
          decreMap(candidateProduct) = decreMap.getOrDefault(candidateProduct, 0) + 1
        }
      }
    }

    // 根据公式计算所有的推荐优先级,首先以productId做groupby
    scores.groupBy(_._1).map{
    
    
      case (productId, scoreList) =>
        ( productId, scoreList.map(_._2).sum/scoreList.length + log(increMap.getOrDefault(productId, 1)) - log(decreMap.getOrDefault(productId, 1)) )
    }
    // 返回推荐列表,按照得分排序
      .toArray
      .sortWith(_._2>_._2)
  }

  def getProductsSimScore(product1: Int, product2: Int,
                          simProducts: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]]): Double ={
    
    
    simProducts.get(product1) match {
    
    
      case Some(sims) => sims.get(product2) match {
    
    
        case Some(score) => score
        case None => 0.0
      }
      case None => 0.0
    }
  }
  // 自定义log函数,以N为底
  def log(m: Int): Double = {
    
    
    val N = 10
    math.log(m)/math.log(N)
  }
  // 写入mongodb
  def saveDataToMongoDB(userId: Int, streamRecs: Array[(Int, Double)])(implicit mongoConfig: MongoConfig): Unit ={
    
    
    val streamRecsCollection = ConnHelper.mongoClient(mongoConfig.db)(STREAM_RECS)
    // 按照userId查询并更新
    streamRecsCollection.findAndRemove( MongoDBObject( "userId" -> userId ) )
    streamRecsCollection.insert( MongoDBObject( "userId" -> userId,
                                  "recs" -> streamRecs.map(x=>MongoDBObject("productId"->x._1, "score"->x._2)) ) )
  }

}

5.4 Real-time system joint debugging

  • The real-time recommended data flow direction of our system is: business system -> log -> flume log collection -> kafka streaming data cleaning and preprocessing -> spark streaming streaming computing. After we complete the code for the real-time recommendation service, we should conduct joint debugging tests with other tools to ensure that the system is running normally.

5.4.1 Starting the basic components of a real-time system

  • Start the real-time recommendation system StreamingRecommender as well as mongodb and redis

5.4.2 Start zookeeper

bin/zkServer.sh start

5.4.3 Start kafka

bin/kafka-server-start.sh -daemon ./config/server.properties

5.4.4 Build Kafka Streaming program

  • Create a new module under recommender, KafkaStreaming, which is mainly used to preprocess log data and filter out the required content. The pom.xml file needs to introduce dependencies:
  • Create a new java class com.atguigu.kafkastreaming.Application under src/main/java
public class Application {
    
    
    public static void main(String[] args){
    
    

        String brokers = "localhost:9092";
        String zookeepers = "localhost:2181";

        // 定义输入和输出的topic
        String from = "log";
        String to = "recommender";

        // 定义kafka streaming的配置
        Properties settings = new Properties();
        settings.put(StreamsConfig.APPLICATION_ID_CONFIG, "logFilter");
        settings.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, brokers);
        settings.put(StreamsConfig.ZOOKEEPER_CONNECT_CONFIG, zookeepers);

        StreamsConfig config = new StreamsConfig(settings);

        // 拓扑建构器
        TopologyBuilder builder = new TopologyBuilder();

        // 定义流处理的拓扑结构
        builder.addSource("SOURCE", from)
                .addProcessor("PROCESS", () -> new LogProcessor(), "SOURCE")
                .addSink("SINK", to, "PROCESS");

        KafkaStreams streams = new KafkaStreams(builder, config);
        streams.start();
    }
}
  • This program will obtain the information flow with the topic "log" for processing, and forward it with "recommender" as the new topic.
  • Stream processing programLogProcess.java
public class LogProcessor implements Processor<byte[],byte[]> {
    
    
    private ProcessorContext context;

    public void init(ProcessorContext context) {
    
    
        this.context = context;
    }

    public void process(byte[] dummy, byte[] line) {
    
    
        String input = new String(line);
        // 根据前缀过滤日志信息,提取后面的内容
        if(input.contains("PRODUCT_RATING_PREFIX:")){
    
    
            System.out.println("product rating coming!!!!" + input);
            input = input.split("PRODUCT_RATING_PREFIX:")[1].trim();
            context.forward("logProcessor".getBytes(), input.getBytes());
        }
    }
    public void punctuate(long timestamp) {
    
    
    }
    public void close() {
    
    
    }
}
  • After completing the code, start the Application.

5.4.5 Configure and start flume

  • Create new log-kafka.properties in flume's conf directory and configure flume to connect to kafka:
agent.sources = exectail
agent.channels = memoryChannel
agent.sinks = kafkasink

# For each one of the sources, the type is defined
agent.sources.exectail.type = exec
# 下面这个路径是需要收集日志的绝对路径,改为自己的日志目录
agent.sources.exectail.command = tail –f
/mnt/d/Projects/BigData/ECommerceRecommenderSystem/businessServer/src/main/log/agent.log
agent.sources.exectail.interceptors=i1
agent.sources.exectail.interceptors.i1.type=regex_filter
# 定义日志过滤前缀的正则
agent.sources.exectail.interceptors.i1.regex=.+PRODUCT_RATING_PREFIX.+
# The channel can be defined as follows.
agent.sources.exectail.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.kafka.topic = log
agent.sinks.kafkasink.kafka.bootstrap.servers = localhost:9092
agent.sinks.kafkasink.kafka.producer.acks = 1
agent.sinks.kafkasink.kafka.flumeBatchSize = 20

#Specify the channel the sink should use
agent.sinks.kafkasink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000

After configuration, start flume:
./bin/flume-ng agent -c ./conf/ -f ./conf/log-kafka.properties -n agent -Dflume.root.logger=INFO,console

5.4.6 Start the business system background

  • Add business code to the system. Note that in log4j.properties under src/main/resources/, the value of log4j.appender.file.File should be replaced with your own log directory, which should be the same as the configuration in flume.
  • Start the business system backend and visit localhost:8088/index.html; click on a product to rate and check whether the real-time recommendation list will change.

Chapter 6 Cold Start Problem Handling

  • The entire recommendation system relies more on the user's preference information to recommend products. Then there will be a problem. There is no preference information record for newly registered users. Then there will be problems with the recommendation at this time, resulting in no Any recommended items appear.
  • This problem is usually solved by providing the user with an interactive window to obtain the user's preferences for items when the user logs in for the first time, and allowing the user to check the preset interest tags.
    After obtaining the user's preferences, recommendations for corresponding types of products can be directly given.

Chapter 7 Other forms of offline similar recommendation services

7.1 Content-based similar recommendations

  • The tag files in the original data are the tags users put on the products. It is not easy to directly convert this part of the content into ratings. However, we can extract the tag content to get the content feature vector of the product, and then we can calculate the similarity by degree matrix. This part can be directly connected with the real-time recommendation system to calculate products similar to the user's currently rated products to achieve content-based real-time recommendations. In order to avoid the impact of popular tags on feature extraction, we can also adjust the weight of tags through the TF-IDF algorithm to be as close to user preferences as possible.
  • Based on the above ideas, the core code for obtaining product feature vectors by adding the TF-IDF algorithm is as follows:
// 载入商品数据集
val productTagsDF = spark
  .read
  .option("uri",mongoConfig.uri)
  .option("collection",MONGODB_PRODUCT_COLLECTION)
  .format("com.mongodb.spark.sql")
  .load()
  .as[Product]
  .map(x => (x.productId, x.name, x.genres.map(c => if(c == '|') ' ' else c)))
  .toDF("productId", "name", "tags").cache()

// 实例化一个分词器,默认按空格分
val tokenizer = new Tokenizer().setInputCol("tags").setOutputCol("words")

// 用分词器做转换
val wordsData = tokenizer.transform(productTagsDF)

// 定义一个HashingTF工具
val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(200)

// 用 HashingTF 做处理
val featurizedData = hashingTF.transform(wordsData)

// 定义一个IDF工具
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")

// 将词频数据传入,得到idf模型(统计文档)
val idfModel = idf.fit(featurizedData)

// 用tf-idf算法得到新的特征矩阵
val rescaledData = idfModel.transform(featurizedData)

// 从计算得到的 rescaledData 中提取特征向量
val productFeatures = rescaledData.map{
    
    
  case row => ( row.getAs[Int]("productId"),row.getAs[SparseVector]("features").toArray )
}
  .rdd
  .map(x => {
    
    
    (x._1, new DoubleMatrix(x._2) )
  })
  • Then the similarity matrix is ​​calculated through the product feature vector, and similar recommendations can be given on the product details page; usually in e-commerce websites, after users browse products or complete purchases, a similar recommendation list will be displayed.
  • The obtained similarity matrix can also provide a basis for real-time recommendations and obtain a user recommendation list. It can be seen that the purpose of content-based and latent semantic models is to extract the feature vectors of items, so that the similarity matrix can be calculated. Our real-time recommendation system algorithm is defined based on similarity.

7.2 Item-based collaborative filtering similar recommendation

  • Item-based collaborative filtering (Item-CF), only needs to collect the user's routine behavior data (such as click, favorite, purchase) to get the similarity between products, which is widely used in actual projects.
  • Our overall idea is that if two products have the same audience (people of interest), then they are inherently related. Therefore, existing behavioral data can be used to analyze the similarity of product audiences, and then derive the similarity between products. We define this method as the "co-occurrence similarity" of items, and the formula is as follows:
    Insert image description here
  • Among them, Ni is the list of users who purchased product i (or rated product i), and Nj is the list of users who purchased product j.
  • The core code is implemented as follows:
 val ratingDF = spark.read
      .option("uri", mongoConfig.uri)
      .option("collection", MONGODB_RATING_COLLECTION)
      .format("com.mongodb.spark.sql")
      .load()
      .as[Rating]
      .map(x=> (x.userId, x.productId, x.score) )
      .toDF("userId", "productId", "rating")

    // 统计每个商品的评分个数,并通过内连接添加到 ratingDF 中
    val numRatersPerProduct = ratingDF.groupBy("productId").count()
    val ratingWithCountDF = ratingDF.join(numRatersPerProduct, "productId")

    // 将商品评分按 userId 两两配对,可以统计两个商品被同一用户做出评分的次数
    val joinedDF = ratingWithCountDF.join(ratingWithCountDF, "userId")
      .toDF("userId", "product1", "rating1", "count1", "product2", "rating2", "count2")
      .select("userId", "product1", "count1", "product2", "count2")
    joinedDF.createOrReplaceTempView("joined")
    val cooccurrenceDF = spark.sql(
      """
        |select product1
        |, product2
        |, count(userId) as coocount
        |, first(count1) as count1
        |, first(count2) as count2
        |from joined
        |group by product1, product2
      """.stripMargin
    ).cache()

    val simDF = cooccurrenceDF.map{
    
     row =>
      // 用同现的次数和各自的次数,计算同现相似度
      val coocSim = cooccurrenceSim( row.getAs[Long]("coocount"), row.getAs[Long]("count1"), row.getAs[Long]("count2") )
      ( row.getAs[Int]("product1"), ( row.getAs[Int]("product2"), coocSim ) )
    }
      .rdd
      .groupByKey()
      .map{
    
    
        case (productId, recs) =>
          ProductRecs( productId,
            recs.toList
              .filter(x=>x._1 != productId)
              .sortWith(_._2>_._2)
              .map(x=>Recommendation(x._1,x._2))
              .take(MAX_RECOMMENDATION)
          )
      }
      .toDF()
  • Among them, the function code for calculating co-occurrence similarity is implemented as follows:
def cooccurrenceSim(cooCount: Long, count1: Long, count2: Long): Double ={
    
    
      cooCount / math.sqrt( count1 * count2 )
    }

Chapter 8 Program Deployment and Operation

8.1 Publish project

  • Compile the project: execute the clean package phase of the root project

8.2 Install the front-end project

  • Unzip website-release.tar.gz to the /var/www/html directory and place the files inside in the root directory
  • Start the Apache server and visit http://IP:80

8.3 Install business server

  • Place BusinessServer.war in the webapp directory of tomcat, and place the extracted files in the ROOT directory:
  • Start the Tomcat server

8.4 Kafka configuration and startup

  • Start Kafka
  • Create two Topics in kafka, one for log and one for recommender
  • Start the kafkaStream program to format data between the log and recommender topics.
java -cp kafkastream.jar com.atguigu.kafkastream.Application linux:9092 linux:2181 log recommender

8.5 Flume configuration and startup

  • Create log-kafka.properties in the conf folder in the flume installation directory
agent.sources = exectail
agent.channels = memoryChannel
agent.sinks = kafkasink

# For each one of the sources, the type is defined
agent.sources.exectail.type = exec
agent.sources.exectail.command = tail -f /home/bigdata/cluster/apache-tomcat-8.5.23/logs/catalina.out
agent.sources.exectail.interceptors=i1
agent.sources.exectail.interceptors.i1.type=regex_filter
agent.sources.exectail.interceptors.i1.regex=.+PRODUCT_RATING_PREFIX.+
# The channel can be defined as follows.
agent.sources.exectail.channels = memoryChannel

# Each sink's type must be defined
agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkasink.kafka.topic = log
agent.sinks.kafkasink.kafka.bootstrap.servers = linux:9092
agent.sinks.kafkasink.kafka.producer.acks = 1
agent.sinks.kafkasink.kafka.flumeBatchSize = 20


#Specify the channel the sink should use
agent.sinks.kafkasink.channel = memoryChannel

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 10000
  • Start flume
bin/flume-ng agent -c ./conf/ -f ./conf/log-kafka.properties -n agent

8.6 Deploy streaming computing service

  • Submit SparkStreaming program:
bin/spark-submit --class com.atguigu.streamingRecommender.StreamingRecommender streamingRecommender-1.0-SNAPSHOT.jar

8.7 Azkaban scheduling offline algorithm

  • Create scheduling project
  • Create two job files:
  • Azkaban-stat.job:
type=command
command=/home/bigdata/cluster/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --class com.atguigu.offline.RecommenderTrainerApp
 offlineRecommender-1.0-SNAPSHOT.jar
  • Azkaban-offline.job:
type=command
command=/home/bigdata/cluster/spark-2.1.1-bin-hadoop2.7/bin/spark-submit --class com.atguigu.statisticsRecommender.StatisticsApp
 statisticsRecommender-1.0-SNAPSHOT.jar
  • Upload the Job file into a ZIP package and upload it to azkaban:
  • Define specified time for each task separately
  • After the definition is completed, click Scheduler.

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131623970