Practical implementation of personalized movie recommendation system based on tensorflow

An unknown college student, known as Caigou in the world of martial arts
original author: jacky Li
Email: [email protected]

Time of completion：2022.12.24
Last edited: 2022.12.24

Table of contents

Project Description:

Background of the project

Movie data movies.dat

Ratings data ratings.dat

2. Processed data

2. Modeling & Training

1. Embedding layer

2. Text convolution layer

3. Fully connected layer

4. Build calculation graph & training

5. Recommend

3. Web display terminal

1. django framework for web development

2. Show screenshots

4. Self-evaluation and summary of experimental projects

Five: The author has something to say

Project Description:

dl_re_web: Folder of Web project

re_sys： Web app

model: After downloading from Baidu Cloud, put the model in this folder

recommend: Network model related

data: training data set folder

DataSet.py: Data set loading related

re_model.py: Network model class

utils.py: tools, crawlers

static:Web page static resources

templates: Html page for Web page

venv: Django project resource folder

db.sqlite3: Django’s own database

manages.py: Django execution script

Network model: network model diagram (visio)

Model Baidu Cloud Link: Baidu Netdisk Please enter the extraction code Extraction code: 6xpt

Background of the project

This system combines the natural language processing of neural networks with movie recommendations, uses the MovieLens data set to train a text-based convolutional neural network, and implements a personalized movie recommendation system. Finally, the Django framework is used in combination with the Douban crawler to build the web service of the recommendation system.

Main functions

Recommend movies that users like

Recommend similar movies

Recommend movies that users who have watched also like to watch
network model

1. Data processing

1. MovieLens Dataset

User data users.dat

Gender field: Convert 'F' and 'M' to 0 and 1

Age field: converted to consecutive numbers
Movie data movies.dat

Genre field: Some movies have more than one category, so convert this field into a numerical list

Title field: Same as above, create a digital dictionary of English titles, generate a list of numbers, and remove the year in the title

Note: To facilitate network processing, the lengths of the above two fields need to be unified.
Ratings data ratings.dat

After the data is processed, the three tables are inner merged and saved as the model file data_preprocess.pkl

2. Processed data

We see that some fields are type variables, such as very sparse variables such as UserID and MovieID. If used one-hot, the dimensions of the data will expand dramatically and the efficiency of the algorithm will be greatly reduced.

2. Modeling & Training

Build models for different fields of processed data

1. Embedding layer

According to the above, in order to solve the problem of data sparseness, One-hotmatrix multiplication can be simplified to a table lookup operation, which greatly reduces the amount of calculations. Instead of replacing each word with a vector, we replace it with an index used to find the vector in the embedding matrix. During the training process of the network, the embedding vector will also be updated, and we can also explore the relationship between words in the high-dimensional space. similarity between.

This system uses tensorflow to find the id row in embeddingstf.nn.embedding_lookup based on the id in input_ids . For example, if input_ids=[1,3,5] , find the 1st, 3rd, and 5th lines in embeddings to form a tensor and return it. It is not a simple table lookup. The vector corresponding to the id can be trained. The number of training parameters should be . It can also be said that lookup is a fully connected layer.tf.nn.embedding_lookupcategory num*embedding size

Analysis:

To create the embedding matrix, we have to decide how many latent factors we need to allocate to each index. This generally means how long we want the vector to be. The usual usage is to allocate lengths of 32 and 50. Here we choose 32 and 16, so We see that the first dimension of the shape of each field embedding matrix, that is, the second number is either 32 or 16;

The 0th latitude of the embedding matrix is 6041, 2, 7, and 21, which is the number of rows of the embedding matrix, which also represents how many unique values there are in these four fields. For example, the values of Gender are only 0 and 1 (after data processing ) Its embedding matrix has 2 rows

By now, everyone must be clear about the benefits of embedding matrices. Let's take UserIdfields as an example. Using one-hot encoding, the data needs to be added 数据量x6041. If the amount of data is large, or the fields have many unique values, during training, It will consume a lot of resources, but if we use an embedded matrix, we only need to create a 6041x32matrix, and then use tf.nn.embedding_lookupthe data of the UserID field to perform a full connection (equivalent to a table lookup operation), which can be represented by a one-dimensional array with a length of 32 The UserID is extracted, which greatly simplifies the time-consuming calculation.

As mentioned in the previous point, using tf.nn.embedding_lookupthe data of the UserID field for full connection (equivalent to a table lookup operation), the shape of each embedding layer should be like this (数据量，字段长度，索引长度), and 数据量can be designed to be the size of each epoch; for User data, The field length is all 1, because one value can represent a unique value. If it is text, it may need to be represented by an array, that is, the field length may be greater than 1, which will be further explained later in Movie data processing; index length is the latent factor of the embedding matrix.

Example: Construct an embedding matrix and embedding layer for the data set fields UserID, Gender, Age, respectively.JobID

def create_user_embedding(self, uid, user_gender, user_age, user_job):
	with tf.name_scope("user_embedding"):
  	uid_embed_matrix = tf.Variable(tf.random_uniform([self.uid_max, self.embed_dim], -1, 1),
                                   name="uid_embed_matrix") # (6041,32)
    uid_embed_layer = tf.nn.embedding_lookup(uid_embed_matrix, uid, name="uid_embed_layer") # (?,1,32)
    
		gender_embed_matrix = tf.Variable(tf.random_uniform([self.gender_max, self.embed_dim // 2], -1, 1),
                                  name="gender_embed_matrix") # (2,16)
		gender_embed_layer = tf.nn.embedding_lookup(gender_embed_matrix, user_gender, 
                                                name="gender_embed_layer") # (?,1,16)

		age_embed_matrix = tf.Variable(tf.random_uniform([self.age_max, self.embed_dim // 2], -1, 1),
                               name="age_embed_matrix") # (7,16)
		age_embed_layer = tf.nn.embedding_lookup(age_embed_matrix, user_age, name="age_embed_layer")# (?,1,16)

		job_embed_matrix = tf.Variable(tf.random_uniform([self.job_max, self.embed_dim // 2], -1, 1),
                               name="job_embed_matrix") # (21,16)
		job_embed_layer = tf.nn.embedding_lookup(job_embed_matrix, user_job, name="job_embed_layer")# (?,1,16)
	return uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer

Similarly, we created the embedding matrices of MovieID, Genres, and Title of the movie data in the corresponding code. What needs special attention is:

The shape of the Title embedding layer is （？，15，32）, "?" represents the number of an epoch, 32 represents the number of potential factors for custom selection, and 15 represents that each unique value of the field needs a vector of length 15 to represent it.

The shape of the Genres embedding layer is （？，1，32）that since the Genres (type of movie) of a movie may belong to multiple categories, this field needs special processing, that is, adding the vectors on the first latitude. This actually reduces It improves the performance of features, but prevents it from just recommending movies of related types.

In summary, after the embedding layer, we get the following model:

For User data

Model name	shape
uid_embed_matrix	(6041，32)
gender_embed_matrix	(2，16)
age_embed_matrix	(7，16)
job_embed_matrix	(21，16)
uid_embed_layer	(?，1，32)
gender_embed_layer	(?，1，16)
age_embed_layer	(?，1，16)
job_embed_layer	(?，1，16)

For Movie data

Model name	shape
movie_id_embed_matrix	(3953，32)
movie_categories_embed_matrix	(19，32)
movie_title_embed_matrix	(5215，32)
movie_id_embed_layer	(?，1，32)
movie_categories_embed_layer	(?，1，32)
movie_title_embed_layer	(?，15，32)

2. Text convolution layer

This article only introduces the derivation process and introduces the ideas of convolutional layer design. Please see the design ideas参考文献

The text convolution layer only involves the Title field of the movie data. In fact, the Genres field can also be designed for text convolution. However, as explained above, considering the impact of the recommended data field, only a conventional network is designed for Genres.

The convolution process involves the following parameters:

name&value	explain
windows_size=[2，3，4，5]	The sliding window for different convolutions is variable
filter_num=8	The number of convolution kernels (filters)
filter_weight =(windows_size，32，1，fliter_num)	The weight of the convolution kernel, the four parameters are (height, width, number of input channels, number of output channels)
filter_bias=8	The bias of the convolution kernel = the number of output channels of the convolution kernel = the number of convolution kernels

process

We use the output of the Title field latent layer movie_title_embed_layer(shape=(?, 15, 32)) as the input of the convolutional layer, so we first movie_title_embed_layerexpand it by one dimension and the shape becomes (?, 15, 32, 1), four The parameters are (batch, height, width, channels)

 movie_title_embed_layer_expand = tf.expand_dims(movie_title_embed_layer, -1) # 在最后加上一个维度

Use convolution kernels of different sizes for convolution and maximum pooling, and the changes in related parameters will not be described again.

pool_layer_lst = []
for window_size in self.window_sizes:
  with tf.name_scope("movie_txt_conv_maxpool_{}".format(window_size)):
    # 卷积核权重   
    filter_weights = tf.Variable(tf.truncated_normal([window_size, self.embed_dim, 1, self.filter_num], stddev=0.1),name="filter_weights")  

    # 卷积核偏执   
    filter_bias = tf.Variable(tf.constant(0.1, shape=[self.filter_num]), name="filter_bias")

    # 卷积层  第一个参数为：输入   第二个参数为：卷积核权重   第三个参数为：步长
    conv_layer = tf.nn.conv2d(movie_title_embed_layer_expand, filter_weights, [1, 1, 1, 1], padding="VALID",name="conv_layer")

    # 激活层  参数的shape保持不变
    relu_layer = tf.nn.relu(tf.nn.bias_add(conv_layer, filter_bias), name="relu_layer")

    # 池化层  第一个参数为：输入   第二个参数为：池化窗口大小	 第三个参数为：步长    
    maxpool_layer = tf.nn.max_pool(relu_layer, [1, self.sentences_size - window_size + 1, 1, 1],[1, 1, 1, 1],padding="VALID", name="maxpool_layer")

    pool_layer_lst.append(maxpool_layer)

available:

widow_size	filter_weights	filter_bias	conv_layer	relu_layer	maxpool_layer
2	(2，32，1，8)	8	(?，14，1，8)	(?，14，1，8)	(?，1，1，8)
3	(3，32，1，8)	8	(?，13，1，8)	(?，14，1，8)	(?，1，1，8)
4	(4，32，1，8)	8	(?，12，1，8)	(?，14，1，8)	(?，1，1，8)
5	(5，32，1，8)	8	(?，11，1，8)	(?，14，1，8)	(?，1，1，8)

Example analysis:

We consider the case of window_size=2. First, we get the embedding layer output and add one dimension to it movie_title_embed_layer_expand（shape=(？，15，32，1)）, which is used as the input of the convolution layer.

The parameters of the convolution kernel filter_weightsare (2, 32, 1, 8), which means that the height of the convolution kernel is 2, the width is 32, the input channel is 1, and the output channel is 32. The output channel is the same as the input channel of the previous layer.

The step size of the convolutional layer in each dimension is 1, and the padding method is VALID, then the shape of the convolutional base layer can be obtained as (?, 14, 1, 8).

After convolution, the relu function is used for activation, and a bias is added, and the shape remains unchanged.

The maximum pooling window is (1, 14, 1, 1), and the step size in each dimension is 1, so the shape after pooling is (?, 1, 1, 8).

By analogy, when window_size is other, the pooling layer output shape can also be obtained as (?, 1, 1, 8).

After obtaining the output of the four convolutions and pooling, we use the following code to connect the output of the pooling layer according to the third dimension, which is the fourth parameter, and transform it into (?, 1, 1, 32), and then transform it into 3D(?, 1, 32).

pool_layer = tf.concat(pool_layer_lst, 3, name="pool_layer") #（？，1，1，32）
max_num = len(self.window_sizes) * self.filter_num  # 32
pool_layer_flat = tf.reshape(pool_layer, [-1, 1, max_num], name="pool_layer_flat")  #（？，1，32）  其实仅仅是减少了一个纬度，？仍然为每一批批量

Finally, in order to regularize and prevent overfitting, after dropout layer processing, the output shape is (?, 1, 32).

3. Fully connected layer

Fully connect the output of the embedding layer and the output of the convolutional base layer obtained above.

Fully connect the embedding layer of the User data, and finally the shape of the output feature is (?, 200)

def create_user_feature_layer(self, uid_embed_layer, gender_embed_layer, age_embed_layer, job_embed_layer):
    with tf.name_scope("user_fc"):
        # 第一层全连接 改变最后一维
        uid_fc_layer = tf.layers.dense(uid_embed_layer, self.embed_dim, name="uid_fc_layer", activation=tf.nn.relu)
        gender_fc_layer = tf.layers.dense(gender_embed_layer, self.embed_dim, name="gender_fc_layer",
                                          activation=tf.nn.relu)
        age_fc_layer = tf.layers.dense(age_embed_layer, self.embed_dim, name="age_fc_layer", activation=tf.nn.relu)
        job_fc_layer = tf.layers.dense(job_embed_layer, self.embed_dim, name="job_fc_layer", activation=tf.nn.relu)
				# （？，1，32）
        
        # 第二层全连接
        user_combine_layer = tf.concat([uid_fc_layer, gender_fc_layer, age_fc_layer, job_fc_layer], 2)# (?, 1, 128)
        user_combine_layer = tf.contrib.layers.fully_connected(user_combine_layer, 200, tf.tanh)  # (?, 1, 200)
        user_combine_layer_flat = tf.reshape(user_combine_layer, [-1, 200]) #（？，200）
    return user_combine_layer, user_combine_layer_flat

In the same way, two layers of full connections are performed on the Movie data, and the final shape of the output feature is (?, 200)

def create_movie_feature_layer(self, movie_id_embed_layer, movie_categories_embed_layer, dropout_layer):
  with tf.name_scope("movie_fc"):
    # 第一层全连接
    movie_id_fc_layer = tf.layers.dense(movie_id_embed_layer, self.embed_dim, name="movie_id_fc_layer",
                                        activation=tf.nn.relu) #(?，1，32)
    movie_categories_fc_layer = tf.layers.dense(movie_categories_embed_layer, self.embed_dim,
                                                name="movie_categories_fc_layer", activation=tf.nn.relu)#(?，1，32)

    # 第二层全连接
    movie_combine_layer = tf.concat([movie_id_fc_layer, movie_categories_fc_layer, dropout_layer],2)  # (?, 1, 96)
    movie_combine_layer = tf.contrib.layers.fully_connected(movie_combine_layer, 200, tf.tanh)  # (?, 1, 200)

    movie_combine_layer_flat = tf.reshape(movie_combine_layer, [-1, 200])
    return movie_combine_layer, movie_combine_layer_flat

4. Build calculation graph & training

Construct computational graphs and train. The problem regression is simply matrix multiplication of user features and movie features to obtain a predicted score, and the loss is the mean square error.

inference = tf.reduce_sum(user_combine_layer_flat * movie_combine_layer_flat, axis=1)
inference = tf.expand_dims(inference, axis=1)
cost = tf.losses.mean_squared_error(targets, inference)
loss = tf.reduce_mean(cost)
global_step = tf.Variable(0, name="global_step", trainable=False)
optimizer = tf.train.AdamOptimizer(lr)   # 传入学习率
gradients = optimizer.compute_gradients(loss)  # cost
train_op = optimizer.apply_gradients(gradients, global_step=global_step)

Model save

The saved model includes: processed training data, network after training, user feature matrix, and movie feature matrix.

loss image

After simple parameter adjustment. batch_size The impact on Loss is large, but batch_size if it is too large, the loss will cause relatively large jitter. As the learning rate gradually decreases, the loss will first decrease and then increase, so in the end it is better to determine the parameters or the fixed parameters of the original author.

5. Recommend

A random factor is added to ensure inconsistent recommendation results when recommending the same movie.

Recommend favorite movies to users: Use user feature vectors and movie feature matrices to calculate the ratings of all movies, and take the topK ones with the highest ratings
Recommend similar movies: Calculate the cosine similarity between the selected movie feature vector and the entire movie feature matrix, and select the topK ones with the largest similarity
Recommend movies that users who have watched also like to watch

3.1 First select the topK individuals who like a certain movie and obtain the user feature vectors of these individuals.

3.2 Calculate the ratings of all movies by these people

3.3 Select the movie with the highest rating from everyone as a recommendation

3. Web display terminal

1. django framework for web development

Since there is no other information about the user in the given data set, only "recommend similar movies" and "recommend movies that users who have watched also like to watch" are displayed, and "recommend movies that users like" are not displayed. module , and the data set does not include the Chinese name of the movie, pictures and other data, so I added a Douban crawler to the web project, requesting data for each recommendation, and parsing and encapsulating it accordingly.

Load the model when the server starts, and encapsulate the tensorflow session in advance. When calling relevant methods, directly pass in the global session, avoiding loading the model for every request.

Most of the time spent on front-end request recommendations is the time spent on crawler requests, and if the access frequency is too high, the request will be rejected by Douban for a period of time.

2. Show screenshots

Backend recommendation results

Recommend movies that users like

Recommend similar movies

Recommend movies that users who have watched also like to watch

4. Self-evaluation and summary of experimental projects

Through this experiment, deep learning has crossed the threshold, and I have a certain understanding of the basic use of the tensorflow framework. The topic of this experiment is recommended, which is a direction I prefer. I have previously learned about algorithms such as collaborative filtering. According to my research, the use of deep networks to extract data features has deepened my understanding of recommendations. ~~Of course, the core code and model architecture of this experiment are copied.~~

Five: The author has something to say

If you need the code, please privately message the blogger and he will reply after seeing it.

~~If you feel that what the blogger said is useful to you, please click "Follow" to support it. We will continue to update such issues...~~