[Data Mining and Business Intelligence Decision-Making] Chapter 15 Intelligent Recommendation System - Collaborative Filtering Algorithm

foreword

My CSDN blog is "Do bionic programmers dream of electronic sheep", this article is written based on markdown, the platform and software are CSDN and Typora, and the storage address of the pictures in this article is CSDN, so some pictures may have "CSDN@维纯计算器会The watermark of "Dream of electric sheep" is my own originality, which is used for the daily work and major work of "Data Mining and Business Intelligence Decision-Making".

The content of this article is the fifteenth chapter, Intelligent Recommendation System - Collaborative Filtering Algorithm.
For ease of reading, I have divided the content of the article into the following sections:

  1. basic knowledge
  2. Experimental content
  3. Extended research
  4. experience

Among them, the introduction of each section is as follows:

  • basic knowledge
    • Contains personal learning and understanding about the topic of this chapter, summarized knowledge points, and code and operation results worth recording.
  • Experimental content
    • This is the subject experiment part of this article, and it is also the experiment content sent by the teacher. After running successfully on the computer (jupyter notebook), it will be exported to markdown format.
    • Among them, the main title is the subsection content of each chapter
      insert image description here
    • As shown in the figure above, the main title is PCA principal component analysis and code implementation, and the sub-title is the submodule in the file. The content under each main title is different from each other, that is to say, there will be cases where the same python library is referenced under two main titles. To ensure the integrity of the code, it is reserved here.
    • In order to show that the class work is indeed completed, the code is roughly the same as the code given by the teacher, but the markdown text part has added my own understanding. At the same time, because the data source is not necessarily the same, the running results and drawing are also different from the tutorial, but the experiment itself is correct and complete.
    • In addition, some relevant cases sent by the teacher (the experiments not in the course center, but the cases sent to the course group, such as the case airline customer value analysis ) will also be attached to this part.
  • Extended research
    • This part is the expansion content that I tried outside the experiment of this subject, including code and knowledge points, and also has my own experiment
  • experience

basic knowledge

Experimental content

15.2 Three Common Methods of Similarity Calculation

15.2.1 Euclidean distance

import pandas as pd
df = pd.DataFrame([[5, 1, 5], [4, 2, 2], [4, 2, 1]], columns=['用户1', '用户2', '用户3'], index=['物品A', '物品B', '物品C'])
df
user 1 user 2 user 3
Item A 5 1 5
Item B 4 2 2
Item C 4 2 1
import numpy as np
dist = np.linalg.norm(df.iloc[0] - df.iloc[1])
dist
3.3166247903554

15.2.2 The cosine built-in function

import pandas as pd
df = pd.DataFrame([[5, 1, 5], [4, 2, 2], [4, 2, 1]], columns=['用户1', '用户2', '用户3'], index=['物品A', '物品B', '物品C'])
df
user 1 user 2 user 3
Item A 5 1 5
Item B 4 2 2
Item C 4 2 1
from sklearn.metrics.pairwise import cosine_similarity
user_similarity = cosine_similarity(df)
pd.DataFrame(user_similarity, columns=['物品A', '物品B', '物品C'], index=['物品A', '物品B', '物品C'])
Item A Item B Item C
Item A 1.000000 0.914659 0.825029
Item B 0.914659 1.000000 0.979958
Item C 0.825029 0.979958 1.000000

15.2.3 Simple version of Pearson correlation coefficient

from scipy.stats import pearsonr
X = [1, 3, 5, 7, 9]
Y = [9, 8, 6, 4, 2]
corr = pearsonr(X, Y)
print('相关系数r值为' + str(corr[0]) + ',显著性水平P值为' + str(corr[1]))
相关系数r值为-0.9938837346736188,显著性水平P值为0.0005736731093322215

Pearson correlation coefficient small case

import pandas as pd
df = pd.DataFrame([[5, 4, 4], [1, 2, 2], [5, 2, 1]], columns=['物品A', '物品B', '物品C'], index=['用户1', '用户2', '用户3'])  
df
Item A Item B Item C
user 1 5 4 4
user 2 1 2 2
user 3 5 2 1
# 物品A与其他物品的皮尔逊相关系数
A = df['物品A']
corr_A = df.corrwith(A)
corr_A
物品A    1.000000
物品B    0.500000
物品C    0.188982
dtype: float64
# 皮尔逊系数表,获取各物品相关性
df.corr()
Item A Item B Item C
Item A 1.000000 0.500000 0.188982
Item B 0.500000 1.000000 0.944911
Item C 0.188982 0.944911 1.000000

15.3 Case Study - Movie Intelligent Recommender System

1. Read data

import pandas as pd 
movies = pd.read_excel('电影.xlsx')
movies.head()
movie number name category
0 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Fighting Old Urchin 2 (1995) Comedy|Romance
3 4 Wait until the time to wake up (1995) Comedy|Drama|Romance
4 5 Father of the Bride 2 (1995) comedy
score = pd.read_excel('评分.xlsx')
score.head()
user ID movie number score
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
df = pd.merge(movies, score, on='电影编号')
df.head()
movie number name category user ID score
0 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy 1 4.0
1 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy 5 4.0
2 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy 7 4.5
3 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy 15 2.5
4 1 Toy Story (1995) Adventure|Animation|Kids|Comedy|Fantasy 17 4.5
df.to_excel('电影推荐系统.xlsx')
df['评分'].value_counts()  # 查看各个评分的出现的次数
4.0    26794
3.0    20017
5.0    13180
3.5    13129
4.5     8544
2.0     7545
2.5     5544
1.0     2808
1.5     1791
0.5     1369
Name: 评分, dtype: int64
%matplotlib inline
import matplotlib.pyplot as plt
df['评分'].hist(bins=20)  # hist()函数绘制直方图,竖轴为各评分出现的次数
<AxesSubplot:>

insert image description here

2. Data Analysis

ratings = pd.DataFrame(df.groupby('名称')['评分'].mean())
ratings.sort_values('评分', ascending=False).head()
score
name
Tomboy (1997) 5.0
The Adventures of Sherlock Holmes and Dr. Watson: The King of Blackmail (1980) 5.0
Robot (2016) 5.0
Oscar (1967) 5.0
The Human Condition III (1961) 5.0
ratings['评分次数'] = df.groupby('名称')['评分'].count()
ratings.sort_values('评分次数', ascending=False).head()
score Number of ratings
name
Forrest Gump (1994) 4.164134 329
The Shawshank Redemption (1994) 4.429022 317
Pulp Fiction (1994) 4.197068 307
The Silence of the Lambs (1991) 4.161290 279
The Matrix (1999) 4.192446 278

3. Data processing

user_movie = df.pivot_table(index='用户编号', columns='名称', values='评分')
user_movie.tail()
name 007 Goldeneye (1995) 100 Girls (2000) 100 Streets (2016) Sequel to 101 Dogs: Adventures in London (2003) 101 Chugo (1961) 101 Reykjavik (2000) 102 Dalmatians (2000) 10 pieces or less (2006) 10(1979) 11:14(2003) ... Dragon Ball: Mystery Adventure (1988) Dragon Ball: The Curse of the Blood Ruby (1986) Dragon Ball: Sleeping Princess in the Devil's Castle (1987) Dragon Seed (1944) The Girl with the Dragon Tattoo (2011) Tequila Sunrise (1988) Lobster (2015) Dragons: Gift of Night's Fury (2011) Dragon: The Bruce Lee Story (1993) Turtle Diaries (1985)
user ID
606 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
607 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
608 4.0 NaN NaN NaN NaN NaN NaN 3.5 NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
609 4.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
610 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN 4.0 NaN 4.5 NaN NaN NaN

5 rows × 9687 columns

user_movie.describe()  # 因为数据量较大,这个耗时可能会有1分钟左右
name 007 Goldeneye (1995) 100 Girls (2000) 100 Streets (2016) Sequel to 101 Dogs: Adventures in London (2003) 101 Chugo (1961) 101 Reykjavik (2000) 102 Dalmatians (2000) 10 pieces or less (2006) 10(1979) 11:14(2003) ... Dragon Ball: Mystery Adventure (1988) Dragon Ball: The Curse of the Blood Ruby (1986) Dragon Ball: Sleeping Princess in the Devil's Castle (1987) Dragon Seed (1944) The Girl with the Dragon Tattoo (2011) Tequila Sunrise (1988) Lobster (2015) Dragons: Gift of Night's Fury (2011) Dragon: The Bruce Lee Story (1993) Turtle Diaries (1985)
count 132.000000 4.00 1.0 1.0 44.000000 1.0 9.000000 3.000000 4.000000 4.00 ... 1.0 1.0 2.000000 1.0 42.000000 13.000000 7.000000 1.0 8.00000 2.0
mean 3.496212 3.25 2.5 2.5 3.431818 3.5 2.777778 2.666667 3.375000 3.75 ... 3.5 3.5 3.250000 3.5 3.488095 3.038462 4.000000 5.0 2.81250 4.0
std 0.859381 0.50 NaN NaN 0.751672 NaN 0.833333 1.040833 1.030776 0.50 ... NaN NaN 0.353553 NaN 1.327422 0.431158 0.707107 NaN 1.03294 0.0
min 0.500000 2.50 2.5 2.5 1.500000 3.5 2.000000 1.500000 2.000000 3.00 ... 3.5 3.5 3.000000 3.5 0.500000 2.000000 3.000000 5.0 0.50000 4.0
25% 3.000000 3.25 2.5 2.5 3.000000 3.5 2.000000 2.250000 3.125000 3.75 ... 3.5 3.5 3.125000 3.5 2.625000 3.000000 3.500000 5.0 2.87500 4.0
50% 3.500000 3.50 2.5 2.5 3.500000 3.5 2.500000 3.000000 3.500000 4.00 ... 3.5 3.5 3.250000 3.5 4.000000 3.000000 4.000000 5.0 3.00000 4.0
75% 4.000000 3.50 2.5 2.5 4.000000 3.5 3.000000 3.250000 3.750000 4.00 ... 3.5 3.5 3.375000 3.5 4.000000 3.000000 4.500000 5.0 3.12500 4.0
max 5.000000 3.50 2.5 2.5 5.000000 3.5 4.500000 3.500000 4.500000 4.00 ... 3.5 3.5 3.500000 3.5 5.000000 4.000000 5.000000 5.0 4.00000 4.0

8 rows × 9687 columns

4. Smart recommendation

FG = user_movie['阿甘正传(1994)']  # FG是Forrest Gump(),阿甘英文名称的缩写
pd.DataFrame(FG).head()
Forrest Gump (1994)
user ID
1 4.0
2 NaN
3 NaN
4 NaN
5 NaN
import numpy as np
np.seterr(divide='ignore',invalid='ignore')
# axis默认为0,计算user_movie各列与FG的相关系数
corr_FG = user_movie.corrwith(FG)
similarity = pd.DataFrame(corr_FG, columns=['相关系数'])
similarity.head()
D:\coder\randomnumbers\venv\lib\site-packages\numpy\lib\function_base.py:2845: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
D:\coder\randomnumbers\venv\lib\site-packages\numpy\lib\function_base.py:518: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis, **keepdims_kw)
correlation coefficient
007 Goldeneye (1995) 0.217441
100 Girls (2000) NaN
100 Streets (2016) NaN
Sequel to 101 Dogs: Adventures in London (2003) NaN
101 Chugo (1961) 0.141023
similarity.dropna(inplace=True)  # 或写成similarity=similarity.dropna()
similarity.head()
correlation coefficient
007 Goldeneye (1995) 0.217441
101 Chugo (1961) 0.141023
102 Dalmatians (2000) -0.857589
10 pieces or less (2006) -1.000000
11:14(2003) 0.500000
similarity_new = pd.merge(similarity, ratings['评分次数'], left_index=True, right_index=True)
similarity_new.head()
correlation coefficient Number of ratings
007 Goldeneye (1995) 0.217441 132
101 Chugo (1961) 0.141023 44
102 Dalmatians (2000) -0.857589 9
10 pieces or less (2006) -1.000000 3
11:14(2003) 0.500000 4
# 第二种合并方式
similarity_new = similarity.join(ratings['评分次数'])
similarity_new.head()
correlation coefficient Number of ratings
007 Goldeneye (1995) 0.217441 132
101 Chugo (1961) 0.141023 44
102 Dalmatians (2000) -0.857589 9
10 pieces or less (2006) -1.000000 3
11:14(2003) 0.500000 4
similarity_new[similarity_new['评分次数'] > 20].sort_values(by='相关系数', ascending=False).head()  # 选取阈值
correlation coefficient Number of ratings
Forrest Gump (1994) 1.000000 329
Crazy Twins (1996) 0.723238 31
Thor: The Dark World (2013) 0.715809 21
Fatal Attraction (1987) 0.701856 36
X-Men: Days of Future Past (2014) 0.682284 30

Supplementary knowledge point: use of groupby() function

import pandas as pd
data = pd.DataFrame([['战狼2', '丁一', 6, 8], ['攀登者', '王二', 8, 6], ['攀登者', '张三', 10, 8], ['卧虎藏龙', '李四', 8, 8], ['卧虎藏龙', '赵五', 8, 10]], columns=['电影名称', '影评师', '观前评分', '观后评分'])
data
movie title film critic Pre-view rating Post-view rating
0 wolf warrior 2 Ding Yi 6 8
1 climber Wang Er 8 6
2 climber Zhang San 10 8
3 Crouching Tiger, Hidden Dragon Li Si 8 8
4 Crouching Tiger, Hidden Dragon Zhao Wu 8 10
means = data.groupby('电影名称')[['观后评分']].mean()
means
Post-view rating
movie title
Crouching Tiger, Hidden Dragon 9.0
wolf warrior 2 8.0
climber 7.0
means = data.groupby('电影名称')[['观前评分', '观后评分']].mean()
means
Pre-view rating Post-view rating
movie title
Crouching Tiger, Hidden Dragon 8.0 9.0
wolf warrior 2 6.0 8.0
climber 9.0 7.0
means = data.groupby(['电影名称', '影评师'])[['观后评分']].mean()
means
Post-view rating
movie title film critic
Crouching Tiger, Hidden Dragon Li Si 8.0
Zhao Wu 10.0
wolf warrior 2 Ding Yi 8.0
climber Zhang San 8.0
Wang Er 6.0
count = data.groupby('电影名称')[['观后评分']].count()
count
Post-view rating
movie title
Crouching Tiger, Hidden Dragon 2
wolf warrior 2 1
climber 2
count = count.rename(columns={
    
    '观后评分':'评分次数'})
count
Number of ratings
movie title
Crouching Tiger, Hidden Dragon 2
wolf warrior 2 1
climber 2

Extended research

experience

Guess you like

Origin blog.csdn.net/Algernon98/article/details/130571749