Intelligent movie recommendation system based on TensorFlow+CNN+collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code)+MovieLens data set (2)


Insert image description here

Preface

This project focuses on the MovieLens dataset and uses the 2D text convolutional network model in TensorFlow. It combines the collaborative filtering algorithm to calculate the cosine similarity between movies and provides two different movie recommendation methods through user interaction by clicking on the movie.

First, the project uses the MovieLens dataset, which contains a large number of user ratings and comments on movies. This data is used to train collaborative filtering algorithms to recommend movies that are similar to the user's preferences.

Secondly, the project uses the 2D text convolutional network model in TensorFlow, which can process the text description information of movies. By learning the text features of the movie, the model can better understand the content and style of the movie.

When users interact with the mini program, there are two different ways to recommend movies:

  1. Collaborative filtering recommendation : Based on the user's historical ratings and collaborative filtering algorithm, the system will recommend movies that are similar to the user's preferences. This is a traditional recommendation method that recommends movies by analyzing the behavior of users and other users.

  2. Text convolutional network recommendation : Users can start the text convolutional network model by clicking on a movie or entering a text description. The model analyzes the text information of the movie and recommends other movies that match the input movie or description. This method pays more attention to the content and plot similarity of the movie.

Taken together, this project combines collaborative filtering and deep learning technology to provide users with two different but effective ways to recommend movies. This improves user experience and makes it easier for them to find movies that suit their tastes.

overall design

This part includes the overall system structure diagram and system flow chart.

Overall system structure diagram

The overall structure of the system is shown in the figure.
Insert image description here

System flow chart

The system flow is shown in the figure.

Insert image description here

The model training process is shown in the figure.

Insert image description here

The server operation process is shown in the figure.

Insert image description here

Operating environment

This part includes the Python environment, TensorFlow environment, back-end server, Django and WeChat applet environments .

Module implementation

This project includes 3 modules: model training, back-end Django, and front-end WeChat applet module. The function introduction and related code of each module are given below.

1. Model training

./ml-1mDownload the data set and extract it to a folder in the project directory . The data set is divided into user data users.dat, movie data movies.dat and rating data ratings.dat.

1) Data set analysis

user.dat: There are fields such as user ID, gender, age, occupation ID and zip code respectively.

The website address of the data set is http://files.grouplens.org/datasets/movielens/ml-1m-README.txt . Description of the data:

Use UserID, Gender, Age, Occupation, and Zip code to represent user ID, gender, age, occupation, and zip code respectively. M represents male and F represents female. Age range indicates:

  • 1: “Under 18”
  • 18: “18-24”
  • 25: “25-34”
  • 35: “35-44”
  • 45: “45-49”
  • 50: “50-55”
  • 56: “56+”

Occupational representation:

  • 0: “other” or not specified
  • 1: “academic/educator”
  • 2: “artist”
  • 3: “clerical/admin”
  • 4: “college/grad student”
  • 5: “customer service”
  • 6: “doctor/health care”
  • 7: “executive/managerial”
  • 8: “farmer”
  • 9: “homemaker”
  • 10: “K-12 student”
  • 11: “lawyer”
  • 12: “programmer”
  • 13: “retired”
  • 14: “sales/marketing”
  • 15: “scientist”
  • 16: “self-employed”
  • 17: “technician/engineer”
  • 18: “tradesman/craftsman”
  • 19: “unemployed”
  • 20: “writer”

View the first 5 data in user.dat. The relevant code is as follows:

# 查看 users.dat
users_title = ['UserID', 'Gender', 'Age', 'OccupationID', 'Zip-code']
users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
users.head()

The results are shown in the figure.
Insert image description here
UserID, Gender, Age and Occupation are all category fields, and the zip code field is not used. The rating.dat data has fields such as user ID, movie ID, rating and timestamp respectively. Description of the data set website: UserID ranges from 1 to 6040; MovieID ranges from 1 to 3952; Rating represents the rating, with a maximum of 5 stars; Timestamp is the timestamp, and each user has at least 20 ratings. Check the first 5 data of ratings.dat. The results are as shown in the figure. The relevant code is as follows:

# 查看 ratings.dat
ratings_title = ['UserID','MovieID', 'Rating', 'timestamps']
ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
ratings.head()

Insert image description here

The rating field Rating is the target of supervised learning, and the timestamp field is not used. The movies.dat data set has fields such as movie ID, movie name, and movie style. Description of the dataset website:

Use MovieID, Title and Genres, where MovieID and Genres are category fields and Title is text. Title is the same as the one provided by IMDB (including year of release), Genres is pipe-separated, and is selected from the following genres:

Insert image description here

Check the first 3 data in movies.dat. The result is as shown in the figure. The relevant code is as follows:

# 查看 movies.dat
movies_title = ['MovieID', 'Title', 'Genres']
movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
movies.head()

Insert image description here

2) Data preprocessing

By studying the field types in the data set, we found that some are category fields and converted them into one-hot encoding. However, the fields of UserID and MovieID will become sparse, and the dimensions of the input data will expand sharply. Therefore, these fields will be converted when preprocessing the data. into numbers. The operation is as follows:

  • UserID, Occupation and MovieID remain unchanged.
  • Gender field: F and M need to be converted into 0 and 1.
  • Age field: converted into 7 consecutive numbers 0~6.

Genres field: It is a classification field and needs to be converted into a number. Convert the categories in Genres into a dictionary of strings to numbers. Since some movies are a combination of multiple Genres, convert the Genres field of each movie into a list of numbers.

Title field: The processing method is the same as Genres-. First, create a dictionary from text to numbers; second, convert the description in the Title into a list of numbers and delete the year in the Title.

Unify the lengths of Genres and Title fields to facilitate processing in neural networks. Fill in the blanks with the numbers corresponding to the PAD. The relevant code to implement data preprocessing is as follows:

#数据预处理
def load_data():
    #处理 users.dat
    users_title = ['UserID', 'Gender', 'Age', 'JobID', 'Zip-code']
    users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=users_title, engine = 'python')
    #去除邮编
    users = users.filter(regex='UserID|Gender|Age|JobID')
    users_orig = users.values
    #改变数据中的性别和年龄
    gender_map = {
    
    'F':0, 'M':1}
    users['Gender'] = users['Gender'].map(gender_map)
    age_map = {
    
    val:ii for ii,val in enumerate(set(users['Age']))}
    users['Age'] = users['Age'].map(age_map)
    #处理 movies.dat
    movies_title = ['MovieID', 'Title', 'Genres']
    movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=movies_title, engine = 'python')
    movies_orig = movies.values
    #去掉Title中的年份
    pattern = re.compile(r'^(.*)\((\d+)\)$')
    title_map = {
    
    val:pattern.match(val).group(1) for ii,val in enumerate(set(movies['Title']))}
    movies['Title'] = movies['Title'].map(title_map)
    #电影类型转数字字典
    genres_set = set()
    for val in movies['Genres'].str.split('|'):
        genres_set.update(val)
    genres_set.add('<PAD>')
    genres2int = {
    
    val:ii for ii, val in enumerate(genres_set)}
    #将电影类型转成等长数字列表,长度是18
    genres_map = {
    
    val:[genres2int[row] for row in val.split('|')] for ii,val in enumerate(set(movies['Genres']))}
    for key in genres_map:
       for cnt in range(max(genres2int.values()) - len(genres_map[key])):
    genres_map[key].insert(len(genres_map[key])+ cnt,genres2int['<PAD>'])
    movies['Genres'] = movies['Genres'].map(genres_map)
    #电影Title转数字字典
    title_set = set()
    for val in movies['Title'].str.split():
        title_set.update(val)
    title_set.add('<PAD>')
    title2int = {
    
    val:ii for ii, val in enumerate(title_set)}
    #将电影Title转成等长数字列表,长度是15
    title_count = 15
    title_map = {
    
    val:[title2int[row] for row in val.split()] for ii,val in enumerate(set(movies['Title']))}
    for key in title_map:
        for cnt in range(title_count - len(title_map[key])):
            title_map[key].insert(len(title_map[key]) + cnt,title2int['<PAD>'])
    movies['Title'] = movies['Title'].map(title_map)
    #处理 ratings.dat
    ratings_title = ['UserID','MovieID', 'ratings', 'timestamps']
    ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=ratings_title, engine = 'python')
    ratings = ratings.filter(regex='UserID|MovieID|ratings')
    #合并三个表
    data = pd.merge(pd.merge(ratings, users), movies)
    #将数据分成X和y两张表
    target_fields = ['ratings']
    features_pd, targets_pd = data.drop(target_fields, axis=1), data[target_fields]
    features = features_pd.values
    targets_values = targets_pd.values
    return title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig
#加载数据并保存到本地
#title_count:Title字段的长度(15)
#title_set:Title文本的集合
#genres2int:电影类型转数字的字典
#features:是输入X
#targets_values:是学习目标y
#ratings:评分数据集的Pandas对象
#users:用户数据集的Pandas对象
#movies:电影数据的Pandas对象
#data:三个数据集组合在一起的Pandas对象
#movies_orig:没有做数据处理的原始电影数据
#users_orig:没有做数据处理的原始用户数据
#调用数据处理函数
title_count, title_set, genres2int, features, targets_values, ratings, users, movies, data, movies_orig, users_orig = load_data()
#保存预处理结果
pickle.dump((title_count, title_set, genres2int, features,
             targets_values, ratings, users, movies, data,
             movies_orig, users_orig), open('preprocess.p', 'wb'))

View the preprocessed data as shown in the figure.

Insert image description here

The processed movies data is shown in the figure.

Insert image description here

Related other blogs

Intelligent movie recommendation system based on TensorFlow + CNN + collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code) + MovieLens data set (1)

Intelligent movie recommendation system based on TensorFlow+CNN+collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code)+MovieLens data set (3)

Intelligent movie recommendation system based on TensorFlow + CNN + collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code) + MovieLens data set (4)

Intelligent movie recommendation system based on TensorFlow + CNN + collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code) + MovieLens data set (5)

Intelligent movie recommendation system based on TensorFlow + CNN + collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code) + MovieLens data set (6)

Intelligent movie recommendation system based on TensorFlow + CNN + collaborative filtering algorithm - deep learning algorithm application (including WeChat applet, ipynb engineering source code) + MovieLens data set (7)

Project source code download

For details, please see my blog resource download page


Download other information

If you want to continue to understand the learning routes and knowledge systems related to artificial intelligence, you are welcome to read my other blog " Heavyweight | Complete Artificial Intelligence AI Learning - Basic Knowledge Learning Route, all information can be downloaded directly from the network disk without following any routines.
This blog refers to Github’s well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc., which has nearly 100G of related information. I hope it can help all my friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/133124641