[Machine Learning] Principles of collaborative filtering algorithm and examples based on Spark

table of Contents

 

Collaborative filtering

Types of collaborative filtering

Evaluation method of collaborative filtering

Cold start problem

Implementation of collaborative filtering algorithm in Spark


Collaborative filtering

Collaborative filtering, referred to as CF algorithm, is a way to rely on "collective computing". It uses a large number of existing user preferences to estimate the user's preference for items they have not touched. The inner thought is the definition of similarity.

Collaborative filtering is often applied to recommendation systems. These techniques are designed to supplement the missing part of the user-commodity correlation matrix. Among them, users and items are described by a small number of known factors, which can be used to predict missing values.

Types of collaborative filtering

  1. In the user-based collaborative filtering (User CF) method, if two users show similar preferences (that is, their preferences for the same items are roughly the same), they are considered to have similar interests. Recommend an unknown item to one of them.

For example, the preference matrix of users A, B, and C is as follows:

User/item

Item A

Item B

Item C

Item D

Item E

User A

 

 

User B

 

 

 

 

User C

recommend

Because user A likes item A, item C, and item D; and user C likes item A and item C, the algorithm considers that user A and user C are similar, and therefore recommends item D of user A's system to user C.

  1. You can also make recommendations based on the item (Item CF) method. This method usually calculates a certain degree of similarity between items based on existing user preferences or ratings of items. At this time, those items with the same rating by similar users will be considered more similar. Once the similarity between the items is obtained, the items that the user has touched can be used to represent the user, and then find those items that are similar to these known items, and recommend these items to the user. Similarly, items similar to existing items are used to generate a comprehensive score, and the score is used to evaluate the similarity of unknown items.

Such as article classification:

Item name

Types of

Item A

Type A

Item B

Type B

Item C

Type C

Item D

Type A

Item E

Type B

User purchase history matrix:

User/item

Item A

Item B

Item C

Item D

Item E

User A

 

 

User B

 

 

 

 

User C

recommend

Item A and item D belong to the same type A. User A purchases item D, so item D is recommended to user C.

Evaluation method of collaborative filtering

(1) Mean absolute error (MAE)

The Mean Absolute Error (MAE) in the statistical accuracy measurement method is widely used to evaluate the recommendation quality of collaborative filtering recommendation systems. Therefore, the recommendation quality evaluation uses the common average absolute error MAE on the test set to first use the recommendation system to predict the user's score, and then calculate the deviation between the two based on the user's actual score in the test set, which is the value of the MAE.

Assuming that the predicted user score value is {p1, p2..., pn} and the actual score value is {q1, q2, …, qn }, the calculation formula of MAE is:

 

(2) The calculation formula of the root mean square error RMSE is:

 

Cold start problem

When the product is just launched and new users arrive, it is impossible to predict their interests and hobbies without the user's behavior data on the application. In addition, when a new product is put on the shelves, it will also encounter the problem of cold start. It does not collect any user's browsing, clicking or purchasing behavior, and it is impossible to recommend the product.

Implementation of collaborative filtering algorithm in Spark

The collaborative filtering algorithm in Spark ML is based on the least squares method (Alternating Least Squares or ALS).

The ALS algorithm belongs to User-Item CF, also called hybrid CF. It considers both User and Item aspects, and Spark uses it to learn the unknown potential factors. The relationship between users and products can be abstracted into the following triples: <User, Item, Rating>. Among them, Rating is the user's rating of the product, which represents the user's preference for the product.

Assuming that there is a batch of user data, which contains m Users and n Items, we define a Rating matrix, where the elements represent the ratings of the u-th User for the i-th Item.

In actual use, since the numbers of n and m are very large, the scale of the R matrix can easily exceed the billion-item level. The traditional matrix factorization method is difficult to deal with such a large amount of data. On the other hand, it is impossible for a user to rate all products. Therefore, the R matrix is ​​destined to be a sparse matrix.

 

In practice, User and Item have many dimensions, such as User has name, gender, age, etc.; Item has name, place of origin, category, etc. When calculating, the R matrix needs to be mapped to these dimensions, and the mapping method refers to "singular value decomposition". The projection results are as follows:

 

Usually, K is much smaller than m and n, so as to achieve the purpose of dimensionality reduction. When we feed data to the algorithm, there is no dimensional information. The dimension is just assumed to exist, so the associated dimension here is also called the Latent factor.

Explicit feedback

 Explicit feedback is expressed as users’ direct ratings of Items.

Implicit feedback

User rating products is a very simple user behavior. In reality, there are a large number of user behaviors, which can also indirectly reflect user preferences, such as user purchase records, search keywords, and even mouse movements. We call these indirect user behaviors as implicit feedback to distinguish them from explicit feedback such as ratings.

Implicit feedback has the following characteristics:

1. No negative feedback (negative feedback). Users generally ignore products they don’t like, instead of giving negative reviews.

2. Implicit feedback contains a lot of noise. For example, a TV is playing a certain program at a certain time, but the user has fallen asleep or forgot to change the channel.

3. Explicit feedback shows the user's preference (preference), while implicit feedback shows the user's confidence (confidence). For example, the user's favorite is generally a movie, but the longest watch time is a series. Rice is purchased more frequently and in large quantities, but it may not be the food that users most want to eat.

4. Implicit feedback is very difficult to quantify.

1) Spark collaborative filtering algorithm

ALS parameters:

  1. numBlocks: The number of blocks that the users and items matrix is ​​divided into. Each block can be calculated in parallel to speed up the calculation. The default value is 10
  2. rank: the size of the Latent factor in the model, the default value is 10
  3. maxIter: The default maximum number of iterations, the mode is 10
  4. regParam: Specify the regularization parameter, the default is 10
  5. implicitPrefs: Specify whether to use implicit feedback, the default is explicit feedback; true is to use implicit feedback.
  6. alpha: This parameter takes effect when implicit feedback is used; it controls the baseline confidence of the observed value, and the default value is 1.
  7. nonnegative: Whether to use non-zero constraint for ALS, the default is false, that is, not used.

Examples:

Please go to https://github.com/apache/spark/tree/master/data/mllib/als/sample_movielens_ratings.txt to download the input data

import java.io.File;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.ml.evaluation.RegressionEvaluator;
import org.apache.spark.ml.recommendation.ALS;
import org.apache.spark.ml.recommendation.ALSModel;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class ALStest {
	public static void main(String[] args) {
		String rootDir = System.getProperty("user.dir") + File.separator;
		String fileResourcesDir = rootDir + "resources" + File.separator;

		// 训练数据的路径
		String filePath = fileResourcesDir + "sample_movielens_ratings.txt";
		
		SparkSession spark = SparkSession.builder().master("local[4]").appName("test").getOrCreate();
		JavaRDD<Rating> ratingsRDD = spark.read().textFile(filePath).javaRDD().map(Rating::parseRating);
		Dataset<Row> ratings = spark.createDataFrame(ratingsRDD, Rating.class);
		Dataset<Row>[] splits = ratings.randomSplit(new double[] { 0.8, 0.2 });
		Dataset<Row> training = splits[0];
		Dataset<Row> test = splits[1];
	
		//使用训练数据通过ALS算法训练
		ALS als = new ALS().setMaxIter(5).setRegParam(0.01).setUserCol("userId").setItemCol("movieId")
				.setRatingCol("rating");
		ALSModel model = als.fit(training);		 
		 
		// evaluation metrics
		//删除冷启动数据,即NaN值的数据
		model.setColdStartStrategy("drop");
		//在测试数据上进行预测
		Dataset<Row> predictions = model.transform(test);
		//用均方根误差评估模型
		RegressionEvaluator evaluator = new RegressionEvaluator().setMetricName("rmse").setLabelCol("rating")
				.setPredictionCol("prediction");
		double rmse = evaluator.evaluate(predictions);

		//给每个用户推荐Top 10部电影
		Dataset<Row> userRecs = model.recommendForAllUsers(10);
		//给每部电影推荐Top 10 个用户
		Dataset<Row> movieRecs = model.recommendForAllItems(10);

		// 给指定用户推荐10部电影
		Dataset<Row> users = ratings.select(als.getUserCol()).distinct().limit(3);
		Dataset<Row> userSubsetRecs = model.recommendForUserSubset(users, 10);
		 
		//给指定电影推荐10个用户
		Dataset<Row> movies = ratings.select(als.getItemCol()).distinct().limit(3);
		Dataset<Row> movieSubSetRecs = model.recommendForItemSubset(movies, 10);		
		training.show(false);
		predictions.show(false);
		userRecs.show(false);
		movieRecs.show(false);
		System.out.println("均方根误差 = " + rmse);		
	}
}

 

Guess you like

Origin blog.csdn.net/henku449141932/article/details/111993750