Collaborative Filtering Algorithm
Get the MovieLens dataset that comes with spark, where each row contains a user, a movie, a user's rating for the movie, and a timestamp. We use the default ALS.train()
method, explicit feedback (default implicitPrefs
false) to build the recommendation model and evaluate the model based on the RMSE of the model's prediction of ratings.
Import the required packages:
import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS
Create a read specification from a data structure
Create a Rating type, ie [Int, Int, Float, Long]; then build a function that converts each row in the data into a Rating class.
case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
def parseRating(str: String): Rating = {
val fields = str.split("::")
assert(fields.size == 4)
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
}
Read data:
Import implicits, read the MovieLens dataset, and convert the data into Rating type;
import spark.implicits._
val ratings = spark.sparkContext.textFile("file:///home/hduser/spark/data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()
then print the data
ratings.show()
Build the model
Divide MovieLens dataset into training set and test set
val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))
Use ALS to build a recommendation model, here we build two models, one is explicit feedback and the other is implicit feedback
val alsExplicit = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userId"). setItemCol("movieId").setRatingCol("rating")
val alsImplicit = new ALS().setMaxIter(5).setRegParam(0.01).setImplicitPrefs(true). setUserCol("userId").setItemCol("movieId").setRatingCol("rating")
The implementation in ML has the following parameters:
numBlocks
is the number of chunks for users and items to parallelize computation (default is 10).rank
is the number of latent semantic factors in the model (default 10).maxIter
is the number of iterations (default 10).regParam
is the regularization parameter for ALS (default 1.0).implicitPrefs
Determines whether to use the version of the explicit feedback ALS or the version that uses the implicit feedback dataset (default is false, that is, explicit feedback is used).alpha
is a parameter for the implicit feedback ALS version, this parameter determines the benchmark for the strength of the preference behavior (default is 1.0).nonnegative
Determines whether to use non-negative constraints for least squares (defaults to false).These parameters can be adjusted to continuously optimize the results to make the mean square error smaller. For example: the larger the imaxIter, the smaller the regParam, the smaller the mean square error, and the better the recommendation result.
Next, train the recommendation model on the training data:
val modelExplicit = alsExplicit.fit(training)
val modelImplicit = alsImplicit.fit(training)
Model prediction
Use the trained recommendation model to predict and score the user products in the test set, and get the data set of predicted scores
val predictionsExplicit = modelExplicit.transform(test)
val predictionsImplicit = modelImplicit.transform(test)
Let's output the results and compare the actual results with the predicted results:
predictionsExplicit.show()
predictionsImplicit.show()
Model evaluation
The model is evaluated by calculating the root mean square error of the model. The smaller the root mean square error, the more accurate the model:
val evaluator = new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating"). setPredictionCol("prediction")
val rmseExplicit = evaluator.evaluate(predictionsExplicit)
val rmseImplicit = evaluator.evaluate(predictionsImplicit)
Print out the root mean squared error for both models:
println(s"Explicit:Root-mean-square error = $rmseExplicit")
println(s"Implicit:Root-mean-square error = $rmseImplicit")
It can be seen that the mean square error of the score is about 1.69 and 1.80. Due to the small amount of data in this example, there is a certain gap between the predicted results and the actual results.