An example of spark-based collaborative filtering based on ALS

I am learning spark recently. There are many examples of ALS on the Internet. Most of them are one example. I decided to write an example by myself, and try to make it run and have results.

1. Data set preparation:

Go to http://grouplens.org/datasets/movielens/ to download the movie rating data, and check the README for an introduction to the dataset.

Preprocess the data:

 

cat u1.base | awk -F "\t" '{print $1"::"$2"::"$3"::"$4}' > ratings.dat  
cat u.item | awk -F "|" '{print $1"\t"$2"\t"$3}' > movies.dat  

 The result is as follows:

 

 

[root@hongboVM ml-100k]# head -10 ratings.data
1::1::5::874965758
1::2::3::876893171
1::3::4::878542960
1::4::3::876893119
1::5::3::889751712
1::7::4::875071561
1::8::1::875072484
1::9::5::878543541
1::11::2::875072262
1::13::5::875071805

 

[root@hongboVM ml-100k]# head -10 movies.data
1::Toy Story (1995)::01-Jan-1995
2::GoldenEye (1995)::01-Jan-1995
3::Four Rooms (1995)::01-Jan-1995
4::Get Shorty (1995)::01-Jan-1995
5::Copycat (1995)::01-Jan-1995
6::Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)::01-Jan-1995
7::Twelve Monkeys (1995)::01-Jan-1995
8::Babe (1995)::01-Jan-1995
9::Dead Man Walking (1995)::01-Jan-1995
10::Richard III (1995)::22-Jan-1996

 

[root@hongboVM ml-100k]# head -10 user.data
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7 | 57 | M | administrator | 91344
8 | 36 | M | administrator | 05201
9|29|M|student|01002
10|53|M|lawyer|90703

 

  Upload data to hdfs.

2. Data processing is based on ideas:

  First train the model using the ratings data, then use the model to make predictions. To print out recommended information:

3. The code is as follows:

 

package com.bohai.mllib

import org.apache.log4j.{Level, Logger}
import org.apache.spark.mllib.recommendation.{ALS, Rating}
import org.apache.spark. {SparkConf, SparkContext}

object MoviesRecommondNew {
  def main(args: Array[String]) {
    //Shield the log, because the result is printed on the console, to facilitate viewing the result, turn off the spark log output
    //The best solution to the problem of spark log output is: modify the spark log file and write the log to the file
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)

    val conf = new SparkConf().setAppName("MoviesRecommondNew")
    val sc = new SparkContext(conf)
    //ratings.data data
    val data = sc.textFile("/data/ratings.data")
    val test_data = sc.textFile("/data/test.data")
    // Note the delimiter here
    val user_data = sc.textFile("/data/user.data").map(x => x.split("[|]") match {
      case Array(userId, age, gender, occupation, zipCode) => Users(userId.toInt, age.toInt, gender)
    })
    val movie_data = sc.textFile("/data/movies.data").map(x =>x.split("::"))

    println("rate data count is : " + data.count())
    val ratings = data.map(x => x.split("::") match {
      case Array(user, item, rate, ts) => Rating(user.toInt, item.toInt, rate.toDouble)
    })

    val test_ratings = test_data.map(x => x.split("::") match {
      case Array(user, item, rate, ts) => Rating(user.toInt, item.toInt, rate.toDouble)
    })
    println("test rate data count is : " +test_ratings.count())
    //println("test rate data is : " + test_ratings.take(2))

    val userIds = user_data.map(_.id)

    println("user data is : " + userIds.count())

    //Generate k, v form, easy to find movieName through movieID
    val movieDataMap = movie_data.map(x => (x(0).toInt,x(1))).collectAsMap()
    //broadcast to broadcast
    val bMovieDataMap = sc.broadcast(movieDataMap)

    val rank = 10
    val numIterations = 10
    val model = ALS.train(ratings, rank, numIterations, 0.001)
    val usersProducts = ratings.map { case Rating(user, prod, rate) => (user, prod) }
    val predictions = model.predict(usersProducts).map { case Rating(user, product, rate) => ((user, product), rate) }
    val ratesAndPreds = ratings.map { case Rating(user, product, rate) => ((user, product), rate) }.join(predictions)
    val MSE = ratesAndPreds.map { case ((user, product), (r1, r2)) =>
      val err = r1 - r2
      err * err
    }.mean()
    println(s"Mean squared Error = $MSE")


    val userID = 384
    val moviesForUser = ratings.keyBy(_.user).lookup(userID)
    println(s"Movies reviewed by user $userID:\n")
    for (movieID <- moviesForUser.map(f => f.product)) {
      //movie_data.filter{x  => x(0).toInt == movieID}.map(x => x(1)).collect().foreach(println)
      println(bMovieDataMap.value.getOrElse(movieID,""))
    }

    println(s"Movie recommended by user $userID:\n")
    val recommendProds:Array[Rating] = model.recommendProducts(userID,20)
    for (recommend <- recommendProds) {
      //println(recommend.user + "," + recommend.product + "," + recommend.rating)
      //movie_data.filter{x  => x(0).toInt == recommend.product}.map(x => x(1)).collect().foreach(println)
      println(bMovieDataMap.value.getOrElse(recommend.product,""))
    }

    println("Recommend 10 movies for each user:\n")
    val allRecommendations = model.recommendProductsForUsers(10).map{
      case (userId,recommends) =>
        val str = new StringBuilder()
        for (r <- recommends) {
          if (str.nonEmpty) {
            str.append("::")
          }
          str.append(r.product)
        }
        (userId,str.toString())
    }
    allRecommendations.take(10).foreach(println)
  }

  //Sample class, used as SparkSQL implicit conversion
  case class Ratings(userId: Int, movieId: Int, rating: Int)

  case class Movies(id: Int, moveTitle: String, releaseDate: String)

  case class Users(id: Int, age: Int, gender: String)

}

   Submit to spark for testing:

 

 

spark-submit --master spark://172.4.23.99:7077 --num-executors 4 --executor-cores 2 --class com.bohai.mllib.MoviesRecommondNew  ./simple-project_2.10-1.0.jar

 The results are as follows:

 

 

rate data count is : 80000                                                      
test rate data count is : 20000
user data is : 943
Mean squared Error = 0.44838904095188975                                        
Movies reviewed by user 384:

Contact (1997)
Starship Troopers (1997)
English Patient, The (1996)
Evita (1996)
Air Force One (1997)
L.A. Confidential (1997)
Titanic (1997)
As Good As It Gets (1997)
Cop Land (1997)
Conspiracy Theory (1997)
Desperate Measures (1998)
Game, The (1997)
Tomorrow Never Dies (1997)
That Darn Cat! (1997)
Peacemaker, The (1997)
Cats Don't Dance (1997)
Movies recommended by user 384:

Amos & Andrew (1993)
I'm Not Rappaport (1996)
Ruling Class, The (1972)
Amateur (1994)
Englishman Who Went Up a Hill, But Came Down a Mountain, The (1995)
To Live (Huozhe) (1994)
Stupids, The (1996)
Cemetery Man (Dellamorte Dellamore) (1994)
Eye for an Eye (1996)
M. Butterfly (1993)
Crooklyn (1994)
8 1/2 (1963)
Herbie Rides Again (1974)
City of Lost Children, The (1995)
Vanya on 42nd Street (1994)
Afterglow (1997)
Addiction, The (1995)
Die xue shuang xiong (Killer, The) (1989)
Haunted World of Edward D. Wood Jr., The (1995)
Mute Witness (1994)
Recommend 10 movies for each user:

(656,1313::998::1480::1206::149::253::1411::1451::974::401)                     
(692,1192::1058::1286::1425::1483::703::1113::1404::960::753)
(932,1313::57::1643::947::601::1128::1131::954::1224::965)
(772,1131::860::1192::1512::1205::1129::967::1128::57::445)
(324,1019::1192::904::1022::982::1262::320::1404::786::1298)
(180,1426::394::1195::1184::793::1389::1069::1208::1245::1120)
(340,860::1192::974::998::1131::1273::440::1178::296::1483)
(320,1286::1056::115::1425::320::916::534::962::960::703)
(752,1286::1643::115::906::800::1160::296::1129::767::1049)
(744,1131::1380::624::1211::1137::782::860::1192::1113::630)

 Summary: (1) The downloaded data is separated by \t and needs to be preprocessed; (2) In order to find the movieName through the movieID, we encapsulate the movieID and movieName into a map, broadcast the map, and submit it for retrieval. performance. (3) If the error between the predicted result and the actual score is not calculated, fill it up if there is time;

refer to:

http://blog.csdn.net/oopsoom/article/details/34462329

http://blog.javachen.com/2015/06/01/how-to-implement-collaborative-filtering-using-spark-als.html

 

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=327059076&siteId=291194637