Things to note when training ALS model with pyspark

Collaborative filtering is a recommendation system technology that recommends new products based on user interactions and ratings of various products.

The introduction of collaborative filtering is that it only requires the input of a series of user/product interaction records;

Whether the interaction is explicit (such as giving a rating on a shopping site) or implicit (such as the user visiting a

Product page but no product rating) can be interacted with. Based solely on these interactions, collaborative filtering algorithms can

Enough to know which products are similar to each other (because the same users interact with them) and which products are similar to each other

Compare similar, and then you can make new recommendations.

alternating least squares

MLlib/ml contains an implementation of alternating least squares (ALS), which is a commonly used algorithm for collaborative filtering and can be used very well.

Extend to the cluster. It is located in the ml.recommendation.ALS class.

ALS will set a feature vector for each user and product, so that the dot product of the user vector and the product vector is close to their scores.

It receives several parameters listed below:

 Rank: The size of the feature vector used. A larger feature vector will produce a better model, but it also requires greater computational cost. The default is 10

  iterations: the number of iterations to be performed, default 10

  lamda: regularization parameter, default 0.01

  alpha: Constant used to calculate confidence in ALS, default 1.0

 numUserBlocks, numProductBlocks: The number of blocks to split user and product data to control the degree of parallelism. You can observe the stage calculation time through SparkUI and increase or decrease the parallelism appropriately. Generally, it is better to control each stage task to be completed within 20 seconds.

implicitPrefs: explicit preference information-false, implicit preference information-true, default false (displayed).
coldStartStrategy: cold start strategy during prediction. The default is nan, you can choose drop.

Rating matrix

To use the ALS algorithm, you need to have an RDD consisting of mllib.recommendation.Rating objects,

Each of these contains a user id, a product id and a rating.

One challenge during implementation is that each id needs to be a 32-bit integer value .

If the id is a string or a larger number, you can use the hash value of the id directly in ALS ,

Even if two users or products are mapped to the same ID, the overall result will still be good.

Another way is to broadcast() a table of mapping values ​​from product ids to assign a unique id to each product.

Model results

ALS returns a MatrixFactorizationModel object to represent the result,

You can call predict() to predict a score for an RDD consisting of (UserId, productId) pairs.

You can also use model.recommendProducts(userId,numProducts) to find the top numProducts that are most recommended for a given user.

Note that unlike other models in MLlib, the MatrixFactorizationModel object is large and stores a vector for each user and product.

This way we can't store it to disk and read it back in another program.

However, the feature vector RDD generated in the model, that is, model.userFactors and model.itemFactors, can be saved to the distributed file system .

Finally, ALS comes in two variants: explicit scoring (default) and implicit feedback.

When used for explicit rating, each user's rating for a product needs to be a score (for example, 1 to 5 stars), and the predicted rating is also a score.

When used for implicit feedback, each rating represents the user's confidence that he or she will interact with a given product (for example, as the number of times a user visits a web page increases, the rating will also increase), and the prediction is also the confidence level.

from pyspark.ml.recommendation import ALS

ratings = spark.sql("""
        select
          user_id, item_id, rating
        from dmb_dev.dmb_dev_als_model_rating_matrix
        """).repartition(3600)
train_data, test_data = ratings.randomSplit([0.9, 0.1], seed=4226)
train_data.cache()      

# 隐式数据
als = ALS() \
    .setImplicitPrefs(True) \
    .setAlpha(0.7) \
    .setMaxIter(20) \
    .setRank(10) \
    .setRegParam(0.01) \
    .setNumBlocks(30) \
    .setUserCol("user_id") \
    .setItemCol("item_id") \
    .setRatingCol("rating") \
    .setColdStartStrategy("drop")
print(als.explainParams())
 
als_model = als.fit(train_data)
als_model.write().overwrite().save(model_save_path)
 
# 训练集合所有用户U/ 商品I 的向量表示
als_model.userFactors.withColumnRenamed("id", "user_id")\
    .write.format("orc").mode("overwrite")\
    .saveAsTable("dev.dev_als_model_all_trained_users_factor_result")
als_model.itemFactors.withColumnRenamed("id", "item_id")\
    .write.format("orc").mode("overwrite")\
    .saveAsTable("dev.dev_als_model_all_trained_item_factor_result")

おすすめ

転載: blog.csdn.net/eylier/article/details/132305122