【Introduction to Recommendation System】Introduction to Recommendation System

1. Recommender System Architecture

Classification of working principles of recommendation systems :

  • social recommendation
  • Content-Based Recommendations
  • Recommendations based on popularity
  • Recommendation Based on Collaborative Filtering

Recommended system architecture :

  • front-end interface
  • Data (Lambda architecture)
  • business knowledge
  • algorithm
    insert image description here

Big data Lambda architecture : The Lambda system architecture provides a combinationReal-time dataandHadoop Precomputed Data EnvironmentA hybrid platform that provides a real-time view of data.


Layered architecture:

  • Batch layer:
    • The data is immutable, any calculation can be performed, and it can be expanded horizontally
    • high latency
    • Log collection Flume
    • Distributed storage Hadoop
    • Distributed ComputingHadoop , MapReduce & Spark
    • View storage database NoSQL , Redis , MySQL
  • Real-time processing layer:
    • Stream processing, continuous computing
    • Store and analyze data for a certain window period
    • Eventual accuracy
    • Real-time data collection (message middleware) flume & kafka
    • Real-time data analysis (real-time computing framework) spark streaming / storm / flink
  • Service layer:
    • Support random read
    • Need to return results in a very short time
    • Read and merge batch layer and real-time processing layer results
      insert image description here

Recommended Algorithm Architecture :

  • Recall stage (sea selection): Recall determines the ceiling of the final recommendation results. (Be careful not to confuse it with recall rate)
    Commonly used algorithms:
    • Collaborative filtering: user-based, item-based
    • Content-based: Summarize your own preferences based on user behavior, and use text mining technology to find products with similar content according to your preferences.
    • implicit semantics
  • sorting stage: Recall determines the ceiling of the final recommendation result, and sorting approaches this limit, which determines the final recommendation result.
    • CTR estimation (click-through rate estimation, using LR (Linear Regression) algorithm): To estimate whether a user will click on a product requires the user's click data.
  • Strategy adjustment
    insert image description here



2. Recommendation algorithm

2.1 Recommendation model construction process

Data (data) -> Feature (features) -> ML Algorithm (machine learning algorithm) -> Prediction Output (prediction output)

  • data collection
    • explicit data:
      • Rating
      • Comments Comments/Reviews
    • Implicit data:
      • Order history Historical order
      • Cart events Add to cart
      • Page views Page views
      • Click-thru click
      • Search log search history
  • feature engineering
    • Collaborative filtering: user-item matrix
      insert image description here
    • Content-based: word segmentation tf-idf, word2vec
  • training model
    • Collaborative filtering: KNN, matrix factorization
  • Evaluation, model launch

2.2 Recommendation Algorithm Based on Collaborative Filtering

2.2.1 Algorithm idea

Birds of a feather flock together.

2.2.2 The recommendation algorithm based on collaborative filtering is based on the following assumptions:

(1) " You may also like what people like you like " : user-based collaborative filtering recommendation (User-Based CF);

(2) " You may also like items similar to the items you like ": item-based collaborative filtering recommendation (Item-Based CF).

2.2.3 The realization of collaborative filtering recommendation has the following steps:

(1)Find the most similar people or objects: Sorting is performed by calculating the similarity between two pairs to find out Top-N similar people or items.
(2)Generate recommendations based on similar people or items

insert image description here
insert image description here
insert image description here

2.2.4 Similarity Calculation

Data Classification

  • Real value (item rating status)
  • Boolean value (user behavior: such as whether to click, whether to bookmark)

Euclidean distance: The distance between two points p and q in space can be expressed as Euclidean distance: E ( p , q ) = ∑ i = 1 n ( pi − qi ) 2 E(p,q) = \sqrt{\sum\limits^n_{i=1} (p_i - q_i)^2}E(p,q)=i=1n(piqi)2


The value of Euclidean distance is a non-negative number, and the maximum value is positive infinity. Usually, the calculation result of similarity is expected to be between [-1,1] or [0,1]. Generally, the following conversion formula can be used: (The farther the distance, the smaller the similarity) 1 1 + E ( p , q ) \frac{1}{1+E(p,q)
}1+E(p,q)1

(1) Cosine similarity

s i m i l a r i t y = c o s ( θ ) = x ⋅ y ∣ ∣ x ∣ ∣ ∣ ∣ y ∣ ∣ = ∑ i = 1 n x i × y i ∑ i = 1 n ( x i ) 2 × ∑ i = 1 n ( y i ) 2 similarity = cos(\theta) = \frac{ \pmb{x} \cdot \pmb{y}}{||\pmb{x}|| ||\pmb{y}||} = \frac{\sum\limits^n_{i=1} \pmb{x}_i \times \pmb{y}_i}{\sqrt{\sum\limits^n_{i=1}(\pmb{x}_i)^2} \times \sqrt{\sum\limits^n_{i=1}(\pmb{y}_i)^2}} similarity=cos ( θ )=∣∣xx∣∣∣∣yy∣∣xxyy=i=1n(xxi)2 ×i=1n(yyi)2 i=1nxxi×yyi
insert image description here

  • The cosine similarity measures the angle between two vectors. The cosine value of the angle is used to measure the similarity. When calculating, the length of the vector must be calculated first.Normalized
  • Cosine similarity is measured inText similarity, user similarity, item similaritymost commonly used when.
  • Features: It has nothing to do with the length of the vector, that is, as long as two vectors have the same direction, no matter how strong or weak they are, they can be regarded as "similar".

(2) Pearson correlation coefficient Pearson

  • In fact, it is also a cosine similarity, but the vector is first donecentralized, after subtracting the mean value of the vectors a and b, calculate the cosine similarity. That is:
    similarity = corr ( x , y ) = ∑ i = 1 n ( xi − x ‾ ) × ( yi − y ‾ ) ∑ i = 1 n ( xi − x ‾ ) 2 × ∑ i = 1 n ( yi − y ‾ ) 2 similarity = corr(\pmb{x}, \p mb{y}) = \frac{\sum\limits^n_{i=1} (\pmb{x}_i - \overline{\pmb{x}}) \times (\pmb{y}_i- \overline{\pmb{y}})}{\sqrt{\sum\limits^n_{i=1}(\pmb{x}_i - \overline{\pmb{x}})^ 2} \times \sqrt{\sum\limits^n_{i=1}(\pmb{y}_i - \overline{\pmb{y}})^2}}similarity=corr(xx,yy)=i=1n(xxixx)2 ×i=1n(yyiyy)2 i=1n(xxixx)×(yyiyy)
  • The calculation result of Pearson similarity is between [-1, 1], -1 means negative correlation, 1 means positive correlation.
  • The Pearson similarity measure is two variablesTrendIs it consistent,Not suitable for computing correlations between boolean values

(3) Jaccard similarity

  • two sets ofintersectionThe number of elements inunionThe proportion of , which is very suitable for Boolean vector representation.
  • The numerators are two boolean vectors doingdot product operation, the result is the number of intersection elements;
  • The denominators are two boolean vectors doingOR operation, and then find the element sum.



2.3 Collaborative filtering implementation

2.3.1 Cases of recommendation results calculated by Jaccard similarity

import pandas as pd
import numpy as np

# 构建数据集,我们用1、0分别来表示用户的是否购买过该物品
users = ["User1", "User2", "User3", "User4", "User5"]
items = ["Item A", "Item B", "Item C", "Item D", "Item E"]
datasets = [
    [1,0,1,1,0],
    [1,0,0,1,1],
    [1,0,1,0,0],
    [0,1,0,1,1],
    [1,1,1,0,1],
]

df = pd.DataFrame(datasets, columns=items, index=users)
print(df)

insert image description here

# 进行相似度计算

# 直接计算某两项的杰卡德相似系数
from sklearn.metrics import jaccard_score
# 计算Item A 和Item B的相似度
print(jaccard_score(df["Item A"], df["Item B"]))

insert image description here

# 计算所有的数据两两的杰卡德相似系数
from sklearn.metrics.pairwise import pairwise_distances      
#pairwise_distances算的是杰卡德距离,1-pairwise_distances就是杰卡德相似度,默认是按行进行计算
# 计算用户间相似度
user_similar = 1 - pairwise_distances(df.values, metric="jaccard")    # 返回一个包含DataFrame基础数据的numpy数组
user_similar = pd.DataFrame(user_similar, columns=users, index=users)
print("用户之间的两两相似度:")
print(user_similar)

insert image description here

# 计算物品间相似度
item_similar = 1 - pairwise_distances(df.T.values, metric="jaccard")
item_similar = pd.DataFrame(item_similar, columns=items, index=items)
print("物品之间的两两相似度:")
print(item_similar)

insert image description here

# 筛选 top-N相似结果,并进行推荐

# 基于用户的推荐
topN_users = {
    
    }
# 遍历每一行数据
for i in user_similar.index:
    # 取出每一列数据,并删除自身,然后排序数据
    _df = user_similar.loc[i].drop([i])
    _df_sorted = _df.sort_values(ascending=False)    # ascending默认为True,即升序排列
    
    top2 = list(_df_sorted.index[:2])
    topN_users[i] = top2

print("Top2相似用户:")
print(topN_users)

insert image description here

# 构建推荐结果
rs_results = {
    
    }

for user, sim_users in topN_users.items():
    rs_result = set()    # 存储推荐结果
    for sim_user in sim_users:
        # 构建初始的推荐结果
        rs_result = rs_result.union(set(df.loc[sim_user].replace(0,np.nan).dropna().index))    # 把相似用户中的0替换成nan,并删除缺失值
    # 过滤掉已经购买过的物品
    rs_result -= set(df.loc[user].replace(0,np.nan).dropna().index)
    rs_results[user] = rs_result
print("最终推荐结果:")
print(rs_results)

insert image description here


2.3.2 Score Prediction by Pearson Correlation Coefficient

Using the Pearson correlation coefficient to calculate the similarity, the user-item data set is required to bedense matrix

import pandas as pd
import numpy as np

users = ["User1", "User2", "User3", "User4", "User5"]
items = ["Item A", "Item B", "Item C", "Item D", "Item E"]

# 用户购买记录数据集(评分矩阵)

datasets = [
    [5, 3, 4, 4, None],
    [3, 1, 2, 3, 3],
    [4, 3, 4, 3, 5],
    [3, 3, 1, 5, 4],
    [1, 5, 5, 2, 1]
]
# 计算相似度:对于评分数据这里我们采用皮尔逊相关系数[-1,1]来计算,-1表示强负相关,+1表示强正相关
# pandas中的corr方法可直接用于计算皮尔逊相关系数

df = pd.DataFrame(datasets, columns=items, index=users)   
# DataFrame是由多种类型的列构成的二维标签数据结构(二维数组),往往包含index(行标签)和columns(列标签)

print("用户之间的两两相似度:")
user_similar = df.T.corr()      # 直接计算皮尔逊相关系数:默认是按列进行计算,因此如果计算用户间的相似度,需要进行转置
print(user_similar.round(4))    # 保留4位小数

insert image description here

print("物品之间的两两相似度:")
item_similar = df.corr()
print(item_similar.round(4))

insert image description here

Score prediction based on the similarity between users : Consider the user's own score and the weighted average similarity score of neighboring users for prediction. That is:
Prediction score = I = 1 n Similarity of Near Neighboring User I × Near neighboring User I score ∑ 即 即 即 ∑ ∑ 即 即 Similarity prediction scores = \ frac {\ SUM \ Limits^n_ {i = 1} The similarity of near neighboring User I scores the item}} {{ \ SUM \ Limits^n_ {i = 1} Similarity of near neighboring users}predictive score=i=1nSimilarity with neighbor user ii=1nSimilarity with neighbor user i×The rating of the item by the neighbor user iFor example: Predict User1's rating on ItemE.
predict = 0.8528 ∗ 3 + 0.7071 ∗ 5 0.8528 + 0.7071 = 3.91 predict = \frac{0.8528 * 3 + 0.7071 * 5}{0.8528 + 0.7071} = 3.91predict=0.8528+0.70710.85283+0.70715=3.91
Score prediction based on the similarity between items: Consider the score of the item itself and the weighted average similarity score of neighboring items for prediction. That is:
Forecast score = I = 1 n Similarity of neighboring items I × The user's score of near -neighboring items ∑ ∑ I = 1 n and the similarity of near -neighboring items I predict the score = \ frac {\ SUM \ limits^n_ {i = 1} The user's scoring of neighboring items i } {\ SUM \ Limits^n_ {i = 1} Similarity of near -neighboring items}predictive score=i=1nThe similarity with the neighbor item ii=1nThe similarity with the neighbor item i×The user's score for the neighbor item iFor example: Predict User1's rating on ItemE.
predict = 0.9695 ∗ 5 + 0.5817 ∗ 4 0.9695 + 0.5871 = 4.625 predict = \frac{0.9695 * 5 + 0.5817*4}{0.9695+0.5871} = 4.625predict=0.9695+0.58710.96955+0.58174=4.625



2.4 Model-based collaborative filtering recommendation algorithm

The collaborative filtering algorithm that directly calculates the similarity introduced earlier requires that the user-item rating matrix must be dense, which is unrealistic in many actual application scenarios.For the case where the user-item rating matrix is ​​sparse, we use a model-based collaborative filtering algorithm

The main idea of ​​the model-based collaborative filtering algorithm is to find patterns in the data and model the interaction between users and items through machine learning algorithms.

Model-based collaborative filtering recommendation algorithm:

  • CF based on classification algorithm, regression algorithm, clustering algorithm
  • CF based on matrix factorization
  • CF based on neural network algorithm
  • CF based on graph model

2.4.1 Collaborative filtering recommendation algorithm based on graph model

insert image description here

  • Represent the user's behavior data as a bipartite graph, and then make recommendations for users based on this bipartite graph.
  • The correlation of two vertices is evaluated according to the number of paths between the two vertices, the length of the path and the number of vertices passed.

2.4.2 Collaborative filtering recommendation algorithm based on matrix decomposition

  • ALS Alternating Least Squares
  • SVD singular value decomposition

(1) ALS alternating least squares

Based on the underlying performance of users and items, we can predict how much users like unrated items.

The original large matrix is ​​approximately decomposed into the product of two small matrices, so that the large matrix is ​​no longer used in the actual recommendation calculation, but the two small matrices obtained by decomposition are used. The specific method is: suppose the user-item rating matrix AAA M × N M \times N M×N dimensions, that is, there are a total ofMMM users,NNN items. We choose a very small numberK ( K < < M , K < < N ) K(K<<M, K<<N)K(K<<M,K<<N ) (hereKKK can be understood as a feature that will affect the user's rating of the item), and two matricesU, VU, VU,V 使得 U m × k V n × k T ≈ A m × n U_{m\times k}V_{n\times k}^T \approx A_{m\times n} Um×kVn×kTAm×n

ALS-WR (weighted regularized alternating least squares method): alternating-least-squares with weighted- λ \lambdaλ -regularization, this method is to decompose the rating matrix of the user (user) on the item (item) into two matrices, one is the user's preference matrix for the hidden features of the item, and the other is the matrix of the hidden features contained in the item. In the process of matrix decomposition, the missing item of rating is filled, that is to say, we can recommend products to users based on this filled rating matrix.

insert image description here


The matrix decomposition algorithm of ALS is often applied in the recommendation system, which decomposes the rating matrix of the user (user) on the product (item) intoThe user's preference matrix for the hidden features of the product,andThe mapping matrix of goods on hidden features

Unlike the traditional matrix decomposition SVD method to decompose the matrix, ALS (alternating least squares) hopes to find two low-dimensional matrices U , VU, VU,V 使得 U m × k V n × k T ≈ A m × n U_{m\times k}V_{n\times k}^T \approx A_{m\times n} Um×kVn×kTAm×n. This reduces the complexity of the problem from O ( mn ) O(mn)O ( mn ) decomposes intoO ( ( m + n ) k ) O((m+n)k)O((m+n)k)

How to calculate the matrices U , VU , VU,V ?
First initializeVVV , and according to the formula (UV ≈ A UV\approx AUVA ) WantedUUU , so that the initialU , VU, VU,V matrix is ​​up. Then optimize the loss function until the value of the loss function is less than a preset number, or the number of iterations meets the requirements, then stop.


3. Recommender System Evaluation

insert image description here
Commonly used evaluation indicators:

  • accuracy
  • Trust
  • satisfaction
  • real-time
  • coverage
  • robustness
  • diversity
  • scalability
  • novelty
  • business goals
  • Surprise
  • user retention

3.1 Exploitation & Exploration

  • Exploitation (development, utilization): choose the best solution possible now;
  • Exploration (detection, search): Choose some solutions that are not sure now, but there may be high-yield solutions in the future.


    In the process of making two types of decisions, constantly update the understanding of the uncertainty of all decisions and optimize long-term goals.

EE question practice:

  • Interest expansion: similar topics, collocation recommendation
  • Crowd algorithm: userCF user clustering
  • Balance the proportion of personalized recommendations and popular recommendations
  • Randomly discard user historical behavior
  • random perturbation model parameters

3.2 Evaluation means

Combination of offline evaluation and online evaluation (gray release, A/B testing), regular questionnaire survey.





Reference:
[1] Recommendation system algorithm foundation + comprehensive project practice (speaking by Mr. Daniel)

Guess you like

Origin blog.csdn.net/qq_42757191/article/details/126417809