foreword

My CSDN blog is "Do bionic programmers dream of electronic sheep", this article is written based on markdown, the platform and software are CSDN and Typora, and the storage address of the pictures in this article is CSDN, so some pictures may have "CSDN@维纯计算器会The watermark of "Dream of electric sheep" is my own originality, which is used for the daily work and major work of "Data Mining and Business Intelligence Decision-Making".

The content of this article is the fifteenth chapter, Intelligent Recommendation System - Collaborative Filtering Algorithm.
For ease of reading, I have divided the content of the article into the following sections:

basic knowledge
Experimental content
Extended research
experience

Among them, the introduction of each section is as follows:

basic knowledge
- Contains personal learning and understanding about the topic of this chapter, summarized knowledge points, and code and operation results worth recording.
Experimental content
- This is the subject experiment part of this article, and it is also the experiment content sent by the teacher. After running successfully on the computer (jupyter notebook), it will be exported to markdown format.
- Among them, the main title is the subsection content of each chapter
- As shown in the figure above, the main title is PCA principal component analysis and code implementation, and the sub-title is the submodule in the file. The content under each main title is different from each other, that is to say, there will be cases where the same python library is referenced under two main titles. To ensure the integrity of the code, it is reserved here.
- In order to show that the class work is indeed completed, the code is roughly the same as the code given by the teacher, but the markdown text part has added my own understanding. At the same time, because the data source is not necessarily the same, the running results and drawing are also different from the tutorial, but the experiment itself is correct and complete.
- In addition, some relevant cases sent by the teacher (the experiments not in the course center, but the cases sent to the course group, such as the case airline customer value analysis ) will also be attached to this part.
Extended research
- This part is the expansion content that I tried outside the experiment of this subject, including code and knowledge points, and also has my own experiment
experience

basic knowledge

Experimental content

15.2 Three Common Methods of Similarity Calculation

15.2.1 Euclidean distance

import pandas as pd
df = pd.DataFrame([[5, 1, 5], [4, 2, 2], [4, 2, 1]], columns=['用户1', '用户2', '用户3'], index=['物品A', '物品B', '物品C'])
df

	user 1	user 2	user 3
Item A	5	1	5
Item B	4	2	2
Item C	4	2	1

import numpy as np
dist = np.linalg.norm(df.iloc[0] - df.iloc[1])
dist

3.3166247903554

15.2.2 The cosine built-in function

import pandas as pd
df = pd.DataFrame([[5, 1, 5], [4, 2, 2], [4, 2, 1]], columns=['用户1', '用户2', '用户3'], index=['物品A', '物品B', '物品C'])
df

	user 1	user 2	user 3
Item A	5	1	5
Item B	4	2	2
Item C	4	2	1

from sklearn.metrics.pairwise import cosine_similarity
user_similarity = cosine_similarity(df)
pd.DataFrame(user_similarity, columns=['物品A', '物品B', '物品C'], index=['物品A', '物品B', '物品C'])

	Item A	Item B	Item C
Item A	1.000000	0.914659	0.825029
Item B	0.914659	1.000000	0.979958
Item C	0.825029	0.979958	1.000000

15.2.3 Simple version of Pearson correlation coefficient

from scipy.stats import pearsonr
X = [1, 3, 5, 7, 9]
Y = [9, 8, 6, 4, 2]
corr = pearsonr(X, Y)
print('相关系数r值为' + str(corr[0]) + '，显著性水平P值为' + str(corr[1]))

相关系数r值为-0.9938837346736188，显著性水平P值为0.0005736731093322215

Pearson correlation coefficient small case

import pandas as pd
df = pd.DataFrame([[5, 4, 4], [1, 2, 2], [5, 2, 1]], columns=['物品A', '物品B', '物品C'], index=['用户1', '用户2', '用户3'])

df

	Item A	Item B	Item C
user 1	5	4	4
user 2	1	2	2
user 3	5	2	1

# 物品A与其他物品的皮尔逊相关系数
A = df['物品A']
corr_A = df.corrwith(A)
corr_A

物品A    1.000000
物品B    0.500000
物品C    0.188982
dtype: float64

# 皮尔逊系数表，获取各物品相关性
df.corr()

	Item A	Item B	Item C
Item A	1.000000	0.500000	0.188982
Item B	0.500000	1.000000	0.944911
Item C	0.188982	0.944911	1.000000

15.3 Case Study - Movie Intelligent Recommender System

1. Read data

import pandas as pd 
movies = pd.read_excel('电影.xlsx')
movies.head()

	movie number	name	category
0	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Fighting Old Urchin 2 (1995)	Comedy\|Romance
3	4	Wait until the time to wake up (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride 2 (1995)	comedy

score = pd.read_excel('评分.xlsx')
score.head()

	user ID	movie number	score
0	1	1	4.0
1	1	3	4.0
2	1	6	4.0
3	1	47	5.0
4	1	50	5.0

df = pd.merge(movies, score, on='电影编号')
df.head()

	movie number	name	category	user ID	score
0	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy	1	4.0
1	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy	5	4.0
2	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy	7	4.5
3	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy	15	2.5
4	1	Toy Story (1995)	Adventure\|Animation\|Kids\|Comedy\|Fantasy	17	4.5

df.to_excel('电影推荐系统.xlsx')

df['评分'].value_counts()  # 查看各个评分的出现的次数

4.0    26794
3.0    20017
5.0    13180
3.5    13129
4.5     8544
2.0     7545
2.5     5544
1.0     2808
1.5     1791
0.5     1369
Name: 评分, dtype: int64

%matplotlib inline

import matplotlib.pyplot as plt
df['评分'].hist(bins=20)  # hist()函数绘制直方图，竖轴为各评分出现的次数

<AxesSubplot:>

insert image description here

2. Data Analysis

ratings = pd.DataFrame(df.groupby('名称')['评分'].mean())
ratings.sort_values('评分', ascending=False).head()

	score
name
Tomboy (1997)	5.0
The Adventures of Sherlock Holmes and Dr. Watson: The King of Blackmail (1980)	5.0
Robot (2016)	5.0
Oscar (1967)	5.0
The Human Condition III (1961)	5.0

ratings['评分次数'] = df.groupby('名称')['评分'].count()
ratings.sort_values('评分次数', ascending=False).head()

	score	Number of ratings
name
Forrest Gump (1994)	4.164134	329
The Shawshank Redemption (1994)	4.429022	317
Pulp Fiction (1994)	4.197068	307
The Silence of the Lambs (1991)	4.161290	279
The Matrix (1999)	4.192446	278

3. Data processing

user_movie = df.pivot_table(index='用户编号', columns='名称', values='评分')
user_movie.tail()

name	007 Goldeneye (1995)	100 Girls (2000)	100 Streets (2016)	Sequel to 101 Dogs: Adventures in London (2003)	101 Chugo (1961)	101 Reykjavik (2000)	102 Dalmatians (2000)	10 pieces or less (2006)	10（1979）	11:14（2003）	...	Dragon Ball: Mystery Adventure (1988)	Dragon Ball: The Curse of the Blood Ruby (1986)	Dragon Ball: Sleeping Princess in the Devil's Castle (1987)	Dragon Seed (1944)	The Girl with the Dragon Tattoo (2011)	Tequila Sunrise (1988)	Lobster (2015)	Dragons: Gift of Night's Fury (2011)	Dragon: The Bruce Lee Story (1993)	Turtle Diaries (1985)
user ID
606	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
607	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
608	4.0	NaN	NaN	NaN	NaN	NaN	NaN	3.5	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
609	4.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
610	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	4.0	NaN	4.5	NaN	NaN	NaN

5 rows × 9687 columns

user_movie.describe()  # 因为数据量较大，这个耗时可能会有1分钟左右

name	007 Goldeneye (1995)	100 Girls (2000)	100 Streets (2016)	Sequel to 101 Dogs: Adventures in London (2003)	101 Chugo (1961)	101 Reykjavik (2000)	102 Dalmatians (2000)	10 pieces or less (2006)	10（1979）	11:14（2003）	...	Dragon Ball: Mystery Adventure (1988)	Dragon Ball: The Curse of the Blood Ruby (1986)	Dragon Ball: Sleeping Princess in the Devil's Castle (1987)	Dragon Seed (1944)	The Girl with the Dragon Tattoo (2011)	Tequila Sunrise (1988)	Lobster (2015)	Dragons: Gift of Night's Fury (2011)	Dragon: The Bruce Lee Story (1993)	Turtle Diaries (1985)
count	132.000000	4.00	1.0	1.0	44.000000	1.0	9.000000	3.000000	4.000000	4.00	...	1.0	1.0	2.000000	1.0	42.000000	13.000000	7.000000	1.0	8.00000	2.0
mean	3.496212	3.25	2.5	2.5	3.431818	3.5	2.777778	2.666667	3.375000	3.75	...	3.5	3.5	3.250000	3.5	3.488095	3.038462	4.000000	5.0	2.81250	4.0
std	0.859381	0.50	NaN	NaN	0.751672	NaN	0.833333	1.040833	1.030776	0.50	...	NaN	NaN	0.353553	NaN	1.327422	0.431158	0.707107	NaN	1.03294	0.0
min	0.500000	2.50	2.5	2.5	1.500000	3.5	2.000000	1.500000	2.000000	3.00	...	3.5	3.5	3.000000	3.5	0.500000	2.000000	3.000000	5.0	0.50000	4.0
25%	3.000000	3.25	2.5	2.5	3.000000	3.5	2.000000	2.250000	3.125000	3.75	...	3.5	3.5	3.125000	3.5	2.625000	3.000000	3.500000	5.0	2.87500	4.0
50%	3.500000	3.50	2.5	2.5	3.500000	3.5	2.500000	3.000000	3.500000	4.00	...	3.5	3.5	3.250000	3.5	4.000000	3.000000	4.000000	5.0	3.00000	4.0
75%	4.000000	3.50	2.5	2.5	4.000000	3.5	3.000000	3.250000	3.750000	4.00	...	3.5	3.5	3.375000	3.5	4.000000	3.000000	4.500000	5.0	3.12500	4.0
max	5.000000	3.50	2.5	2.5	5.000000	3.5	4.500000	3.500000	4.500000	4.00	...	3.5	3.5	3.500000	3.5	5.000000	4.000000	5.000000	5.0	4.00000	4.0

8 rows × 9687 columns

4. Smart recommendation

FG = user_movie['阿甘正传（1994）']  # FG是Forrest Gump（），阿甘英文名称的缩写
pd.DataFrame(FG).head()

	Forrest Gump (1994)
user ID
1	4.0
2	NaN
3	NaN
4	NaN
5	NaN

import numpy as np
np.seterr(divide='ignore',invalid='ignore')
# axis默认为0，计算user_movie各列与FG的相关系数
corr_FG = user_movie.corrwith(FG)
similarity = pd.DataFrame(corr_FG, columns=['相关系数'])
similarity.head()

D:\coder\randomnumbers\venv\lib\site-packages\numpy\lib\function_base.py:2845: RuntimeWarning: Degrees of freedom <= 0 for slice
  c = cov(x, y, rowvar, dtype=dtype)
D:\coder\randomnumbers\venv\lib\site-packages\numpy\lib\function_base.py:518: RuntimeWarning: Mean of empty slice.
  avg = a.mean(axis, **keepdims_kw)

	correlation coefficient
007 Goldeneye (1995)	0.217441
100 Girls (2000)	NaN
100 Streets (2016)	NaN
Sequel to 101 Dogs: Adventures in London (2003)	NaN
101 Chugo (1961)	0.141023

similarity.dropna(inplace=True)  # 或写成similarity=similarity.dropna()
similarity.head()

	correlation coefficient
007 Goldeneye (1995)	0.217441
101 Chugo (1961)	0.141023
102 Dalmatians (2000)	-0.857589
10 pieces or less (2006)	-1.000000
11:14（2003）	0.500000

similarity_new = pd.merge(similarity, ratings['评分次数'], left_index=True, right_index=True)
similarity_new.head()

	correlation coefficient	Number of ratings
007 Goldeneye (1995)	0.217441	132
101 Chugo (1961)	0.141023	44
102 Dalmatians (2000)	-0.857589	9
10 pieces or less (2006)	-1.000000	3
11:14（2003）	0.500000	4

# 第二种合并方式
similarity_new = similarity.join(ratings['评分次数'])
similarity_new.head()

	correlation coefficient	Number of ratings
007 Goldeneye (1995)	0.217441	132
101 Chugo (1961)	0.141023	44
102 Dalmatians (2000)	-0.857589	9
10 pieces or less (2006)	-1.000000	3
11:14（2003）	0.500000	4

similarity_new[similarity_new['评分次数'] > 20].sort_values(by='相关系数', ascending=False).head()  # 选取阈值

	correlation coefficient	Number of ratings
Forrest Gump (1994)	1.000000	329
Crazy Twins (1996)	0.723238	31
Thor: The Dark World (2013)	0.715809	21
Fatal Attraction (1987)	0.701856	36
X-Men: Days of Future Past (2014)	0.682284	30

Supplementary knowledge point: use of groupby() function

import pandas as pd
data = pd.DataFrame([['战狼2', '丁一', 6, 8], ['攀登者', '王二', 8, 6], ['攀登者', '张三', 10, 8], ['卧虎藏龙', '李四', 8, 8], ['卧虎藏龙', '赵五', 8, 10]], columns=['电影名称', '影评师', '观前评分', '观后评分'])

data

	movie title	film critic	Pre-view rating	Post-view rating
0	wolf warrior 2	Ding Yi	6	8
1	climber	Wang Er	8	6
2	climber	Zhang San	10	8
3	Crouching Tiger, Hidden Dragon	Li Si	8	8
4	Crouching Tiger, Hidden Dragon	Zhao Wu	8	10

means = data.groupby('电影名称')[['观后评分']].mean()
means

	Post-view rating
movie title
Crouching Tiger, Hidden Dragon	9.0
wolf warrior 2	8.0
climber	7.0

means = data.groupby('电影名称')[['观前评分', '观后评分']].mean()
means

	Pre-view rating	Post-view rating
movie title
Crouching Tiger, Hidden Dragon	8.0	9.0
wolf warrior 2	6.0	8.0
climber	9.0	7.0

means = data.groupby(['电影名称', '影评师'])[['观后评分']].mean()
means

		Post-view rating
movie title	film critic
Crouching Tiger, Hidden Dragon	Li Si	8.0
Crouching Tiger, Hidden Dragon	Zhao Wu	10.0
wolf warrior 2	Ding Yi	8.0
climber	Zhang San	8.0
climber	Wang Er	6.0

count = data.groupby('电影名称')[['观后评分']].count()
count

	Post-view rating
movie title
Crouching Tiger, Hidden Dragon	2
wolf warrior 2	1
climber	2

count = count.rename(columns={
    
    '观后评分':'评分次数'})
count

	Number of ratings
movie title
Crouching Tiger, Hidden Dragon	2
wolf warrior 2	1
climber	2

[Data Mining and Business Intelligence Decision-Making] Chapter 15 Intelligent Recommendation System - Collaborative Filtering Algorithm