【Movie Recommendation System】Real-time recommendation

overview

Technical solutions:

  • Log collection service: Use Flume-ng to collect a rating behavior of a movie by a user on the business platform, and send it to the Kafka cluster in real time.
  • Message buffering service: The project uses Kafka as a caching component for streaming data and accepts data collection requests from Flume. And push the data to the real-time recommendation system part of the project.
  • Real-time recommendation service: The project uses Spark Streaming as the real-time recommendation system. By receiving the data cached in Kafka, the recommended algorithm is designed to process the real-time recommended data, and the structure is merged and updated to the MongoDB database.

1. Implementation ideas

How should we achieve it?

  1. First, redis should be installed , where the user's Kth score is stored (user scores are stored in redis)
  2. Install zookeeper, install kafka, all in standlone mode
  3. Test the joint debugging of Kafka and Spark Streaming. Kafka produces a piece of data, and Spark Streaming can consume it successfully, and recommend it according to the data in redis and MongoDB data, and store it in MongoDB
  4. Write the buried point information in the business system, write it into the local file during the test, and then write it into the log file of the cloud server for the remote test
  5. Flume configuration file writing, kafka creates two topics, and tests the whole process

2 Environment preparation

1.1 redis installation

1.2 zookeeper stand-alone installation

  • zookeeper installation: zookeeper installation
  • Version: 3.7.1

  • admin.serverPort=8001The pitfall encountered: port 8080 is occupied by the connection, we need to add a restart to the zoo.cpg file .

1.3 kafka stand-alone installation

bin/kafka-server-start.sh config/server.properties
  • Create a topic
bin/kafka-topics.sh --create --zookeeper 127.0.0.1:2181 --replication-factor 1 --partitions 1 --topic recommender
  • produce a message
bin/kafka-console-producer.sh --broker-list 127.0.0.1:9092 --topic recommender
  • consume a message
bin/kafka-console-consumer.sh --bootstrap-server 127.0.0.1:9092 --topic recommender --from-beginning

3 Test kafka and spark streaming joint debugging

  • kafka version: 2.2.0
  • Spark version: 2.3.0
  • Therefore usespark-streaming-kafka-0-10

image.png

  1. Start kafka and produce a message
  2. writing program
// 定义kafka连接参数
    val kafkaParam = Map(
      "bootstrap.servers" -> "服务器IP:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "recommender",
      "auto.offset.reset" -> "latest"
    )
    // 通过kafka创建一个DStream
    val kafkaStream = KafkaUtils.createDirectStream[String, String]( ssc,
      LocationStrategies.PreferConsistent,
      ConsumerStrategies.Subscribe[String, String]( Array(config("kafka.topic")), kafkaParam )
    )

    // 把原始数据UID|MID|SCORE|TIMESTAMP 转换成评分流
    // 1|31|4.5|
    val ratingStream = kafkaStream.map{
    
    
      msg =>
        val attr = msg.value().split("\\|")
        ( attr(0).toInt, attr(1).toInt, attr(2).toDouble, attr(3).toInt )
    }
  1. If kafka reports an error, if you are also a cloud server, please pay attention to the configuration information of kafka (very important!)

(1) Solution: Modify the kafka configuration file, set the listeners to the internal network ip, and set the external network ip

(2) Reboot, success

  1. redis error: the protection mode is enabled, the conf file needs to be modified

Effect

Produce a data in kafka, you can get recommended movie results in MongoDB

4 Back-end buried points

After the front-end scores, the click event is triggered, and the back-end performs test burial, and uses log4j to write to the local file.

4.1 Local testing

  • log4j configuration file
log4j.rootLogger=INFO, file, stdout

# write to stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS}  %5p --- [%50t]  %-80c(line:%5L)  :  %m%n


# write to file
log4j.appender.file=org.apache.log4j.RollingFileAppender
log4j.appender.FILE.Append=true
log4j.appender.FILE.Threshold=INFO
log4j.appender.file.File=F:/demoparent/business/src/main/log/agent.txt
log4j.appender.file.MaxFileSize=1024KB
log4j.appender.file.MaxBackupIndex=1
log4j.appender.file.layout=org.apache.log4j.PatternLayout
log4j.appender.file.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS}  %5p --- [%50t]  %-80c(line:%6L)  :  %m%n
  • Buried implementation
//埋点日志
import org.apache.log4j.Logger;

// 关键代码
Logger log = Logger.getLogger(MovieController.class.getName());
log.info(MOVIE_RATING_PREFIX + ":" + uid +"|"+ mid +"|"+ score +"|"+ System.currentTimeMillis()/1000)

4.2 Write remote test

  1. Linux installs syslog service for testing
  2. Host log4j configuration file set server ip
  • log4j configuration: write to remote server
log4j.appender.syslog=org.apache.log4j.net.SyslogAppender
log4j.appender.syslog.SyslogHost= 服务器IP
log4j.appender.syslog.Threshold=INFO
log4j.appender.syslog.layout=org.apache.log4j.PatternLayout
log4j.appender.syslog.layout.ConversionPattern=%d{yyyy-MM-dd HH:mm:ss,SSS}  %5p --- [%20t]  %-130c:(line:%4L)  :   %m%n

5 flume configuration

  1. Flume docking kafka: flume docking file
  2. Flume sets the source and sink, the source is the file address, and the sink is the log of Kafka
# log-kafka.properties
agent.sources = exectail
agent.channels = memoryChannel 
agent.sinks = kafkasink 
agent.sources.exectail.type = exec 
agent.sources.exectail.command = tail -f /project/logs/agent.log agent.sources.exectail.interceptors=i1 agent.sources.exectail.interceptors.i1.type=regex_filter agent.sources.exectail.interceptors.i1.regex=.+MOVIE_RATING_PREFIX.+ agent.sources.exectail.channels = memoryChannel


agent.sinks.kafkasink.type = org.apache.flume.sink.kafka.KafkaSink agent.sinks.kafkasink.kafka.topic = log agent.sinks.kafkasink.kafka.bootstrap.servers = 服务器地址:9092 agent.sinks.kafkasink.kafka.producer.acks = 1 agent.sinks.kafkasink.kafka.flumeBatchSize = 20 

agent.sinks.kafkasink.channel = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000

6 Real-time recommendation

ratingStream.foreachRDD{
    
    
  rdds => rdds.foreach{
    
    
    case (uid, mid, score, timestamp) => {
    
    
      println("rating data coming! >>>>>>>>>>>>>>>>")
      println(uid+",mid:"+mid)
      // 1. 从redis里获取当前用户最近的K次评分,保存成Array[(mid, score)]
      val userRecentlyRatings = getUserRecentlyRating( MAX_USER_RATINGS_NUM, uid, ConnHelper.jedis )
      println("用户最近的K次评分:"+userRecentlyRatings)
      // 2. 从相似度矩阵中取出当前电影最相似的N个电影,作为备选列表,Array[mid]
      val candidateMovies = getTopSimMovies( MAX_SIM_MOVIES_NUM, mid, uid, simMovieMatrixBroadCast.value )
      println("电影最相似的N个电影:"+candidateMovies)
      // 3. 对每个备选电影,计算推荐优先级,得到当前用户的实时推荐列表,Array[(mid, score)]
      val streamRecs = computeMovieScores( candidateMovies, userRecentlyRatings, simMovieMatrixBroadCast.value )
      println("当前用户的实时推荐列表:"+streamRecs)
      // 4. 把推荐数据保存到mongodb
      saveDataToMongoDB( uid, streamRecs )
    }
  }
}
def computeMovieScores(candidateMovies: Array[Int],
                       userRecentlyRatings: Array[(Int, Double)],
                       simMovies: scala.collection.Map[Int, scala.collection.immutable.Map[Int, Double]]): Array[(Int, Double)] ={
    
    
  // 定义一个ArrayBuffer,用于保存每一个备选电影的基础得分
  val scores = scala.collection.mutable.ArrayBuffer[(Int, Double)]()
  // 定义一个HashMap,保存每一个备选电影的增强减弱因子
  val increMap = scala.collection.mutable.HashMap[Int, Int]()
  val decreMap = scala.collection.mutable.HashMap[Int, Int]()

  for( candidateMovie <- candidateMovies; userRecentlyRating <- userRecentlyRatings){
    
    
    // 拿到备选电影和最近评分电影的相似度
    val simScore = getMoviesSimScore( candidateMovie, userRecentlyRating._1, simMovies )

    if(simScore > 0.7){
    
    
      // 计算备选电影的基础推荐得分
      scores += ( (candidateMovie, simScore * userRecentlyRating._2) )
      if( userRecentlyRating._2 > 3 ){
    
    
        increMap(candidateMovie) = increMap.getOrDefault(candidateMovie, 0) + 1
      } else{
    
    
        decreMap(candidateMovie) = decreMap.getOrDefault(candidateMovie, 0) + 1
      }
    }
  }
  // 根据备选电影的mid做groupby,根据公式去求最后的推荐评分
  scores.groupBy(_._1).map{
    
    
    // groupBy之后得到的数据 Map( mid -> ArrayBuffer[(mid, score)] )
    case (mid, scoreList) =>
      ( mid, scoreList.map(_._2).sum / scoreList.length + log(increMap.getOrDefault(mid, 1)) - log(decreMap.getOrDefault(mid, 1)) )
  }.toArray.sortWith(_._2>_._2)
}

7 Boot sequence

  1. Start hadoop, spark container
  • cd /docker
  • docker-compose up -d
  • docker-compose ps
  1. Start mongodb and redis services
  • netstat -lanp | grep "27017"
  • bin/redis-server etc/redis.conf
  1. Start zookeeper, kafka service
  • ./zkServer.sh start
  • bin/kafka-server-start.sh config/server.properties
  1. Start the flume service
  • bin/flume-ng agent -c ./conf/ -f ./conf/log-kafka.properties -n agent

achieve effect

After the front-end scoring is successful, it is written to the log file. There is no problem connecting Flume to the log file, Kafka connecting to Flume. Spark Streaming processes a piece of data received, recommends it, and stores it in MongoDB.

image.png

Summarize

Due to the rush of time, the writing is a bit hasty. If you need the front-end design code and the back-end code, you can comment on me. I will sort it out and send it to github.

The front-end design part has no time to do it in detail, and the front-end page will be beautified later. The undergraduate integrated a management system at that time, and there is no time to do it now. In short, it took more than a week to quickly reproduce the system at that time, which is a review.

During the development, I encountered many problems, such as version problems, server intranet and extranet problems, docker container related problems, and collaborative filtering algorithm design problems, but I helped myself to review Vue and SpringBoot.

when you have a problem

  • When encountering a problem, you should not solve it blindly. You should calm down and look at the reason for the error, and think about why the error was reported.
  • The version is especially important, so it is best to set the version in a project's pom
  • Use the server to build docker-compose, use this method to build a cluster, fast and simple, but some network knowledge such as port forwarding involved needs to be patient
  • Vue-Cli+Element-ui is easy to develop together
  • When writing a program, we should agree on the interface in advance, otherwise the follow-up will be very confusing...

Follow up

  • In the future, the front-end page will be optimized and more functions will be designed
  • Improve recommendation algorithm
  • Add cold start scheme

Guess you like

Origin blog.csdn.net/weixin_40433003/article/details/132051071