Spark ML - Collaborative Filtering

http://ihoge.cn/2018/ML1.html

Collaborative Filtering Algorithm

​ Get the MovieLens dataset that comes with spark, where each row contains a user, a movie, a user's rating for the movie, and a timestamp. We use the default ALS.train()method, explicit feedback (default implicitPrefsfalse) to build the recommendation model and evaluate the model based on the RMSE of the model's prediction of ratings.

Import the required packages:

import org.apache.spark.ml.evaluation.RegressionEvaluator
import org.apache.spark.ml.recommendation.ALS

Create a read specification from a data structure

Create a Rating type, ie [Int, Int, Float, Long]; then build a function that converts each row in the data into a Rating class.

case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
def parseRating(str: String): Rating = {
            val fields = str.split("::")
            assert(fields.size == 4)
            Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
         }

Read data:

Import implicits, read the MovieLens dataset, and convert the data into Rating type;

import spark.implicits._
val ratings = spark.sparkContext.textFile("file:///home/hduser/spark/data/mllib/als/sample_movielens_ratings.txt").map(parseRating).toDF()

then print the data

ratings.show()

Build the model

Divide MovieLens dataset into training set and test set

val Array(training, test) = ratings.randomSplit(Array(0.8, 0.2))

Use ALS to build a recommendation model, here we build two models, one is explicit feedback and the other is implicit feedback

val alsExplicit = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userId"). setItemCol("movieId").setRatingCol("rating")
val alsImplicit = new ALS().setMaxIter(5).setRegParam(0.01).setImplicitPrefs(true). setUserCol("userId").setItemCol("movieId").setRatingCol("rating")

The implementation in ML has the following parameters:

  • numBlocksis the number of chunks for users and items to parallelize computation (default is 10).
  • rankis the number of latent semantic factors in the model (default 10).
  • maxIteris the number of iterations (default 10).
  • regParamis the regularization parameter for ALS (default 1.0).
  • implicitPrefsDetermines whether to use the version of the explicit feedback ALS or the version that uses the implicit feedback dataset (default is false, that is, explicit feedback is used).
  • alphais a parameter for the implicit feedback ALS version, this parameter determines the benchmark for the strength of the preference behavior (default is 1.0).
  • nonnegativeDetermines whether to use non-negative constraints for least squares (defaults to false).

    These parameters can be adjusted to continuously optimize the results to make the mean square error smaller. For example: the larger the imaxIter, the smaller the regParam, the smaller the mean square error, and the better the recommendation result.

Next, train the recommendation model on the training data:

val modelExplicit = alsExplicit.fit(training)
val modelImplicit = alsImplicit.fit(training)

Model prediction

Use the trained recommendation model to predict and score the user products in the test set, and get the data set of predicted scores

val predictionsExplicit = modelExplicit.transform(test)
val predictionsImplicit = modelImplicit.transform(test)

Let's output the results and compare the actual results with the predicted results:

predictionsExplicit.show()
predictionsImplicit.show()

Model evaluation

The model is evaluated by calculating the root mean square error of the model. The smaller the root mean square error, the more accurate the model:

val evaluator = new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating"). setPredictionCol("prediction")
val rmseExplicit = evaluator.evaluate(predictionsExplicit)
val rmseImplicit = evaluator.evaluate(predictionsImplicit)

Print out the root mean squared error for both models:

println(s"Explicit:Root-mean-square error = $rmseExplicit")
println(s"Implicit:Root-mean-square error = $rmseImplicit")

It can be seen that the mean square error of the score is about 1.69 and 1.80. Due to the small amount of data in this example, there is a certain gap between the predicted results and the actual results.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325583636&siteId=291194637