NMF Learning Exercise: Doing Movie Recommendations


I learned NMF a long time ago, and I almost forgot about it. Yesterday, YX asked a question about NMF (homonyms, homophones, but different meanings), and I remembered it again.
My study notes are quite messy. Fortunately, there is a lot of information on the Internet. I picked an article, supplemented the content of my notes, and kept this mnemonic.

The concept of NMF appeared relatively early, almost before computers began to prosper, and NMF and some related algorithms were already very mature. NMF is not very suitable for movie recommendation and product recommendation. Most of the algorithms such as SVD are used now. But this is just a record of learning, an example is better than a boring concept.

Scenes

Let's assume a scenario.
Like the current schedule, there are 10 movies in theaters, and we put them into an array:

item = [
    '希特勒回来了', '死侍', '房间', '龙虾', '大空头',
    '极盗者', '裁缝', '八恶人', '实习生', '间谍之桥',
]

The action of putting into the array is equivalent to numbering these movies, from 0 to 9, such as the movie "The Intern", the number is 8.
Then we continue to assume that our theater has 15 regular customers, and also place them into an array:

user = ['五柳君', '帕格尼六', '木村静香', 'WTF', 'airyyouth',
        '橙子c', '秋月白', 'clavin_kong', 'olit', 'You_某人',
        '凛冬将至', 'Rusty', '噢!你看!', 'Aron', 'ErDong Chen']

They are numbered 0-14.
Then from the user's movie viewing records, we extract each user's score record for each movie. Take the movie serial number as the row number and the user number as the column number to form a matrix:

RATE_MATRIX = np.array(
    [[5, 5, 3, 0, 5, 5, 4, 3, 2, 1, 4, 1, 3, 4, 5],
     [5, 0, 4, 0, 4, 4, 3, 2, 1, 2, 4, 4, 3, 4, 0],
     [0, 3, 0, 5, 4, 5, 0, 4, 4, 5, 3, 0, 0, 0, 0],
     [5, 4, 3, 3, 5, 5, 0, 1, 1, 3, 4, 5, 0, 2, 4],
     [5, 4, 3, 3, 5, 5, 3, 3, 3, 4, 5, 0, 5, 2, 4],
     [5, 4, 2, 2, 0, 5, 3, 3, 3, 4, 4, 4, 5, 2, 5],
     [5, 4, 3, 3, 2, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0],
     [5, 4, 3, 3, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
     [5, 4, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2],
     [5, 4, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]
)

From the matrix, we can get the following information:

  1. Grading is on a 5-point scale.
  2. User ratings have personal characteristics. For example, the user in the first column, namely "Wu Liujun", likes to give movies 5 points, :), he is really a kind-hearted guy.
  3. Movies that are not rated by users will be rated 0. (It doesn't usually happen to actually give a 0 for a movie).

Classification

We use NMF for topic classification of movies.
This is a very typical unsupervised learning method, that is, we do not specify what the topic is, but only know that the topic must exist, so there must be two typical tendencies:

  1. Any movie is bound to tend to one or more themes.
  2. Any audience must love one or more themes.

The point of understanding here is that in unsupervised learning, we don't specify what the topic is, but it's easy to understand as long as you think about the actual situation, such as maybe the "love" topic, or the "shootout" topic.
The following code will use NMF to set 2 topics of interest, and by classification, classify the movie into two categories that tend to be topic 1 or topic 2. At the same time, users are divided into two categories: favorite theme 1 or favorite theme 2.

nmf_model = NMF(n_components=2) # 设有2个主题
item_dis = nmf_model.fit_transform(RATE_MATRIX)
user_dis = nmf_model.components_

print('用户的主题分布:')
print(user_dis)
print('电影的主题分布:')
print(item_dis)

Using the above dataset, you will get the following results:

用户的主题分布:
[[0.81240799 0.71153396 0.47062388 0.43807017 1.39456425 2.24323719
  1.02417204 1.25356481 1.10517661 1.47624595 1.84626347 0.97437242
  1.14921406 0.8159644  1.14200028]
 [2.23910382 1.70186882 1.34300272 1.09192602 0.68045441 0.
  0.0542231  0.         0.         0.         0.04426552 0.12260418
  0.34109613 0.51642843 0.6157604 ]]
电影的主题分布:
[[2.20401687 1.53852775]
 [1.9083879  0.83214869]
 [1.95596132 0.        ]
 [1.87637018 1.65573674]
 [2.48959328 1.41632377]
 [2.38108536 1.08460665]
 [0.         2.29342959]
 [0.         2.27353353]
 [0.         2.32513876]
 [0.         2.23196277]]

These data are very detrimental to observational understanding. The concept they represent is that a relatively close value represents a subject that the movie or audience belongs to (or likes) relatively close.

data visualization

In order to observe it more intuitively, you can use the drawing code to display the data, so as to understand the "clustering" more vividly.

 #显示电影的坐标分布
plt1 = plt
plt1.plot(item_dis[:, 0], item_dis[:, 1], 'ro')
plt1.draw()#直接画出矩阵,只打了点,下面对图plt1进行一些设置

plt1.xlim((-1, 3))
plt1.ylim((-1, 3))
plt1.title(u'the distribution of items (NMF)')#设置图的标题

count = 1
zipitem = zip(item, item_dis)#把电影标题和电影的坐标联系在一起

for item in zipitem:
    item_name = item[0]
    data = item[1]
    plt1.text(data[0], data[1], item_name,
              fontproperties=fontP, 
              horizontalalignment='center',
              verticalalignment='top')
plt1.show()

 #显示用户的坐标分布
user_dis = user_dis.T #把转置用户分布矩阵
plt2 = plt
plt2.plot(user_dis[:, 0], user_dis[:, 1], 'ro')
plt2.xlim((-1, 3))
plt2.ylim((-1, 3))
plt2.title(u'the distribution of user (NMF)')#设置图的标题

zipuser = zip(user, user_dis)#把电影标题和电影的坐标联系在一起
for user in zipuser:
    user_name = user[0]
    data = user[1]
    plt2.text(data[0], data[1], user_name,
              fontproperties=fontP, 
              horizontalalignment='center',
              verticalalignment='top')
plt2.show()#直接画出矩阵,只打了点,下面对图plt1进行一些设置

The above code will get two pictures, movie theme distribution:

user preference theme distribution:

As you can see from the picture, we have relatively few classifications, and the data is not very accurate, resulting in a relatively large distribution deviation, but they are basically divided into two categories.

Movie recommendation

In this way, we specify a username and we can recommend movies to that user on his preferred topic.
The data deviation in this example is relatively large, so the calculated results are a bit unconvincing and are for reference only.

filter_matrix = RATE_MATRIX < 1e-8
rec_mat = np.dot(item_dis, user_dis)
print('重建矩阵,并过滤掉已经看过的电影')
rec_filter_mat = (filter_matrix * rec_mat).T
print(rec_filter_mat)

rec_user = 'Rusty'  # 需要进行推荐的用户
rec_userid = user.index(rec_user)  # 推荐用户ID
rec_list = rec_filter_mat[rec_userid, :]  # 推荐用户的电影列表

print('推荐用户的电影:')
print(np.nonzero(rec_list))

Results of the:

(array([2, 4, 6, 7, 8, 9]),)

full code

For the convenience of description above, some codes are broken up, simplified and omitted. The following is the complete code, and because of the course requirements of XJ students, the python3 code is used. Well, python3's support for Chinese is indeed much better.

#!/usr/bin/env python3

#pip3 install sklearn scipy numpy matplotlib

from sklearn.decomposition import NMF
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties

item = [
    '希特勒回来了', '死侍', '房间', '龙虾', '大空头',
    '极盗者', '裁缝', '八恶人', '实习生', '间谍之桥',
]
user = ['五柳君', '帕格尼六', '木村静香', 'WTF', 'airyyouth',
        '橙子c', '秋月白', 'clavin_kong', 'olit', 'You_某人',
        '凛冬将至', 'Rusty', '噢!你看!', 'Aron', 'ErDong Chen']
RATE_MATRIX = np.array(
    [[5, 5, 3, 0, 5, 5, 4, 3, 2, 1, 4, 1, 3, 4, 5],
     [5, 0, 4, 0, 4, 4, 3, 2, 1, 2, 4, 4, 3, 4, 0],
     [0, 3, 0, 5, 4, 5, 0, 4, 4, 5, 3, 0, 0, 0, 0],
     [5, 4, 3, 3, 5, 5, 0, 1, 1, 3, 4, 5, 0, 2, 4],
     [5, 4, 3, 3, 5, 5, 3, 3, 3, 4, 5, 0, 5, 2, 4],
     [5, 4, 2, 2, 0, 5, 3, 3, 3, 4, 4, 4, 5, 2, 5],
     [5, 4, 3, 3, 2, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0],
     [5, 4, 3, 3, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1],
     [5, 4, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2],
     [5, 4, 3, 3, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1]]
)

nmf_model = NMF(n_components=2) # 设有2个主题
item_dis = nmf_model.fit_transform(RATE_MATRIX)
user_dis = nmf_model.components_

print('用户的主题分布:')
print(user_dis)
print('电影的主题分布:')
print(item_dis)

filter_matrix = RATE_MATRIX < 1e-8
rec_mat = np.dot(item_dis, user_dis)
print('重建矩阵,并过滤掉已经评分的物品:')
rec_filter_mat = (filter_matrix * rec_mat).T
print(rec_filter_mat)

rec_user = 'Rusty'  # 需要进行推荐的用户
#rec_user = '凛冬将至'  # 需要进行推荐的用户
rec_userid = user.index(rec_user)  # 推荐用户ID
rec_list = rec_filter_mat[rec_userid, :]  # 推荐用户的电影列表

print('推荐用户的电影:')
print(np.nonzero(rec_list))


######################################################################
fontP = FontProperties(fname="/System/Library/Fonts/STHeiti Light.ttc")
fontP.set_size('small')

plt1 = plt
plt1.plot(item_dis[:, 0], item_dis[:, 1], 'ro')
plt1.draw()#直接画出矩阵,只打了点,下面对图plt1进行一些设置

plt1.xlim((-1, 3))
plt1.ylim((-1, 3))
plt1.title(u'the distribution of items (NMF)')#设置图的标题

count = 1
zipitem = zip(item, item_dis)#把电影标题和电影的坐标联系在一起

for item in zipitem:
    item_name = item[0]
    data = item[1]
    plt1.text(data[0], data[1], item_name,
              fontproperties=fontP, 
              horizontalalignment='center',
              verticalalignment='top')
plt1.show()

user_dis = user_dis.T #把转置用户分布矩阵
plt2 = plt
plt2.plot(user_dis[:, 0], user_dis[:, 1], 'ro')
plt2.xlim((-1, 3))
plt2.ylim((-1, 3))
plt2.title(u'the distribution of user (NMF)')#设置图的标题

zipuser = zip(user, user_dis)#把电影标题和电影的坐标联系在一起
for user in zipuser:
    user_name = user[0]
    data = user[1]
    plt2.text(data[0], data[1], user_name,
              fontproperties=fontP, 
              horizontalalignment='center',
              verticalalignment='top')
plt2.show()#直接画出矩阵,只打了点,下面对图plt1进行一些设置

Special thanks, code and data from: NMF Non-negative Matrix Factorization Practice .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324848319&siteId=291194637