Case series: Movielens_Predicting users’ ratings of movies_Recommendation system based on behavior sequence Transformer

Article directory

Description: Rating prediction on Movielens using Behavior Sequence Transformer (BST) model.

Introduction

This example demonstrates the use of the Behavioral Sequence Transformer (BST) model by Qiwei Chen et al. using the Movielens dataset. The BST model utilizes users' sequential behavior when watching and rating movies, as well as user profiles and movie characteristics, to predict users' ratings of target movies.

More specifically, the BST model is designed to predict the rating of a target movie by accepting the following input:

A fixed-length sequence of movies that the user watched containing movie_ids.
A fixed-length sequence of movies that the user watched , containing movies ratings.
A collection of user characteristics , including user_id, sex, occupationand age_group.
A collectiongenres of input sequences and target movies for each movie .
To predict the rating target_movie_id.

This example makes the following modifications to the original BST model:

We incorporate movie features (genres) into the embedding process for each input sequence and target movie, rather than treating them as “other features” outside the transformer layer.
We utilize the ratings of the movies in the input sequence and their position in the sequence to update them before feeding them to the self-attention layer.

Note that this example should run on TensorFlow 2.4 or higher.

data set

We use the 1M version of the Movielens dataset .
The dataset includes approximately 1 million ratings from 6,000 users on 4,000 movies,
and also includes some user characteristics and movie genres. Additionally, each user-movie rating timestamp is provided,
which allows creating a sequence of movie ratings for each user, as expected by the BST model.

set up

# 导入所需的库
import os  # 用于操作系统相关的功能
os.environ["KERAS_BACKEND"] = "tensorflow"  # 设置环境变量，指定使用tensorflow作为Keras的后端

import math  # 用于数学计算
from zipfile import ZipFile  # 用于解压缩zip文件
from urllib.request import urlretrieve  # 用于从URL下载文件

import keras  # Keras库，用于构建深度学习模型
import numpy as np  # 用于处理数值数组和矩阵
import pandas as pd  # 用于处理数据表格
import tensorflow as tf  # TensorFlow库，用于构建和训练机器学习模型
from keras import layers  # Keras库中的层模块
from keras.layers import StringLookup  # Keras库中的字符串查找层模块

Prepare data

Download and prepare the data frame

First, let's download the movielens data.

The downloaded folder will contain three data files: users.dat, movies.datand ratings.dat.

# 导入必要的库
from urllib.request import urlretrieve
from zipfile import ZipFile

# 下载movielens数据集的zip文件
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")

# 创建一个ZipFile对象，用于解压缩zip文件
zip_file = ZipFile("movielens.zip", "r")

# 解压缩zip文件中的所有内容到当前目录
zip_file.extractall()

We then load the data into pandas DataFrames using the correct column names.

# 导入所需的库
import pandas as pd

# 读取用户数据
users = pd.read_csv(
    "ml-1m/users.dat",  # 用户数据文件路径
    sep="::",  # 分隔符为双冒号
    names=["user_id", "sex", "age_group", "occupation", "zip_code"],  # 列名
    encoding="ISO-8859-1",  # 使用ISO-8859-1编码
    engine="python",  # 使用Python解析引擎
)

# 读取评分数据
ratings = pd.read_csv(
    "ml-1m/ratings.dat",  # 评分数据文件路径
    sep="::",  # 分隔符为双冒号
    names=["user_id", "movie_id", "rating", "unix_timestamp"],  # 列名
    encoding="ISO-8859-1",  # 使用ISO-8859-1编码
    engine="python",  # 使用Python解析引擎
)

# 读取电影数据
movies = pd.read_csv(
    "ml-1m/movies.dat",  # 电影数据文件路径
    sep="::",  # 分隔符为双冒号
    names=["movie_id", "title", "genres"],  # 列名
    encoding="ISO-8859-1",  # 使用ISO-8859-1编码
    engine="python",  # 使用Python解析引擎
)

Here we do some simple data manipulation on the column's data type to fix the data type.

# 给用户数据添加user_id前缀
users["user_id"] = users["user_id"].apply(lambda x: f"user_{
      
      x}")

# 给用户数据添加age_group前缀
users["age_group"] = users["age_group"].apply(lambda x: f"group_{
      
      x}")

# 给用户数据添加occupation前缀
users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{
      
      x}")

# 给电影数据添加movie_id前缀
movies["movie_id"] = movies["movie_id"].apply(lambda x: f"movie_{
      
      x}")

# 给评分数据添加movie_id前缀
ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{
      
      x}")

# 给评分数据添加user_id前缀
ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{
      
      x}")

# 将评分数据中的rating转换为浮点型
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))

Every movie has multiple genres. We moviessplit them into separate columns in the data frame.

# 定义电影类型列表
genres = ["Action", "Adventure", "Animation", "Children's", "Comedy", "Crime"]
genres += ["Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical"]
genres += ["Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]

# 遍历电影类型列表
for genre in genres:
    # 对于每个电影类型，将movies["genres"]中的每个电影的类型字符串进行处理
    # 使用lambda函数将字符串转换为对应的二进制值（1表示包含该类型，0表示不包含该类型）
    movies[genre] = movies["genres"].apply(
        lambda values: int(genre in values.split("|"))
    )

Convert movie rating data to a sequence

First, let's unix_timestampsort the ratings data using , then group by user_idpairs movie_idand values.rating

The output DataFrame will have user_idtwo ordered lists for each record (sorted by rating date): the movies they have reviewed and the ratings they gave those movies.

# 导入必要的库
import pandas as pd

# 按照"unix_timestamp"列对"ratings"数据集进行排序，并按"user_id"分组
ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")

# 创建一个新的数据框ratings_data，包含以下列：user_id, movie_ids, ratings, timestamps
ratings_data = pd.DataFrame(
    data={
    
    
        "user_id": list(ratings_group.groups.keys()),  # 获取分组后的用户ID
        "movie_ids": list(ratings_group.movie_id.apply(list)),  # 获取每个用户对应的电影ID列表
        "ratings": list(ratings_group.rating.apply(list)),  # 获取每个用户对应的评分列表
        "timestamps": list(ratings_group.unix_timestamp.apply(list)),  # 获取每个用户对应的时间戳列表
    }
)

Now, let's movie_idssplit the list into a set of fixed-length sequences.
We ratingsdo the same for . Set sequence_lengthvariables to vary the length of the input sequence.
You can also change step_sizeto control the number of sequences generated for each user.

# 定义窗口大小和步长
sequence_length = 4
step_size = 2

# 创建序列函数，输入值、窗口大小和步长，返回序列列表
def create_sequences(values, window_size, step_size):
    sequences = []  # 存储序列的列表
    start_index = 0  # 起始索引
    while True:
        end_index = start_index + window_size  # 结束索引
        seq = values[start_index:end_index]  # 根据窗口大小切片得到序列
        if len(seq) < window_size:  # 如果序列长度小于窗口大小
            seq = values[-window_size:]  # 则取最后窗口大小长度的序列
            if len(seq) == window_size:  # 如果序列长度等于窗口大小
                sequences.append(seq)  # 将序列添加到列表中
            break  # 结束循环
        sequences.append(seq)  # 将序列添加到列表中
        start_index += step_size  # 更新起始索引
    return sequences  # 返回序列列表

# 对电影ID列应用create_sequences函数，将结果赋值给movie_ids列
ratings_data.movie_ids = ratings_data.movie_ids.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

# 对评分列应用create_sequences函数，将结果赋值给ratings列
ratings_data.ratings = ratings_data.ratings.apply(
    lambda ids: create_sequences(ids, sequence_length, step_size)
)

# 删除timestamps列
del ratings_data["timestamps"]

After that, we process the output so that each series becomes a separate record in the DataFrame. Furthermore, we connect user characteristics with rating data.

# 导入所需的库
import pandas as pd

# 将ratings_data中的"movie_ids"列拆分成多行，每行只包含一个电影ID，并重置索引
ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode("movie_ids", ignore_index=True)

# 将ratings_data中的"ratings"列拆分成多行，每行只包含一个评分，并重置索引
ratings_data_rating = ratings_data[["ratings"]].explode("ratings", ignore_index=True)

# 将拆分后的"movie_ids"和"ratings"两列合并为一个DataFrame
ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating], axis=1)

# 根据"user_id"列将ratings_data_transformed与users进行连接
ratings_data_transformed = ratings_data_transformed.join(users.set_index("user_id"), on="user_id")

# 将"movie_ids"列中的每个元素转换为字符串，并用逗号分隔
ratings_data_transformed.movie_ids = ratings_data_transformed.movie_ids.apply(lambda x: ",".join(x))

# 将"ratings"列中的每个元素转换为字符串，并用逗号分隔
ratings_data_transformed.ratings = ratings_data_transformed.ratings.apply(lambda x: ",".join([str(v) for v in x]))

# 删除ratings_data_transformed中的"zip_code"列
del ratings_data_transformed["zip_code"]

# 将列名"movie_ids"改为"sequence_movie_ids"，将列名"ratings"改为"sequence_ratings"
ratings_data_transformed.rename(columns={
    
    "movie_ids": "sequence_movie_ids", "ratings": "sequence_ratings"}, inplace=True)

Using sequence_length4 and step_size2 we end up with 498,623 sequences.

Finally, we split the data into a training set and a test set, accounting for 85% and 15% of the total data respectively, and store them as CSV files.

import numpy as np

# 生成一个与ratings_data_transformed.index长度相同的随机数数组，每个元素都是0到1之间的随机数
random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85

# 根据随机数数组，选择85%的数据作为训练数据
train_data = ratings_data_transformed[random_selection]

# 根据随机数数组，选择15%的数据作为测试数据
test_data = ratings_data_transformed[~random_selection]

# 将训练数据保存为CSV文件，不包含索引列，使用竖线作为分隔符，不包含表头
train_data.to_csv("train_data.csv", index=False, sep="|", header=False)

# 将测试数据保存为CSV文件，不包含索引列，使用竖线作为分隔符，不包含表头
test_data.to_csv("test_data.csv", index=False, sep="|", header=False)

Define metadata

# 定义CSV_HEADER为ratings_data_transformed的列名列表
CSV_HEADER = list(ratings_data_transformed.columns)

# 定义CATEGORICAL_FEATURES_WITH_VOCABULARY为一个字典，包含了几个特征及其对应的唯一值列表
CATEGORICAL_FEATURES_WITH_VOCABULARY = {
    
    
    "user_id": list(users.user_id.unique()),  # 用户ID特征对应的唯一值列表
    "movie_id": list(movies.movie_id.unique()),  # 电影ID特征对应的唯一值列表
    "sex": list(users.sex.unique()),  # 性别特征对应的唯一值列表
    "age_group": list(users.age_group.unique()),  # 年龄组特征对应的唯一值列表
    "occupation": list(users.occupation.unique()),  # 职业特征对应的唯一值列表
}

# 定义USER_FEATURES为一个列表，包含了用户特征
USER_FEATURES = ["sex", "age_group", "occupation"]

# 定义MOVIE_FEATURES为一个列表，包含了电影特征
MOVIE_FEATURES = ["genres"]

Created for training and evaluation`tf.data.Dataset`

# 定义一个函数get_dataset_from_csv，用于从csv文件中获取数据集
# 参数：
# - csv_file_path：csv文件的路径
# - shuffle：是否对数据进行洗牌，默认为False
# - batch_size：批处理的大小，默认为128

def get_dataset_from_csv(csv_file_path, shuffle=False, batch_size=128):
    
    # 定义一个内部函数process，用于处理特征
    # 参数：
    # - features：特征数据
    def process(features):
        
        # 从特征中获取电影ID序列的字符串
        movie_ids_string = features["sequence_movie_ids"]
        
        # 将电影ID序列字符串按逗号分割，并转换为张量
        sequence_movie_ids = tf.strings.split(movie_ids_string, ",").to_tensor()

        # 序列中的最后一个电影ID是目标电影
        features["target_movie_id"] = sequence_movie_ids[:, -1]
        
        # 将特征中的电影ID序列更新为除了最后一个电影ID之外的序列
        features["sequence_movie_ids"] = sequence_movie_ids[:, :-1]

        # 从特征中获取评分序列的字符串
        ratings_string = features["sequence_ratings"]
        
        # 将评分序列字符串按逗号分割，并转换为浮点数类型的张量
        sequence_ratings = tf.strings.to_number(
            tf.strings.split(ratings_string, ","), tf.dtypes.float32
        ).to_tensor()

        # 序列中的最后一个评分是模型要预测的目标
        target = sequence_ratings[:, -1]
        
        # 将特征中的评分序列更新为除了最后一个评分之外的序列
        features["sequence_ratings"] = sequence_ratings[:, :-1]

        return features, target

    # 使用tf.data.experimental.make_csv_dataset函数从csv文件中创建数据集
    dataset = tf.data.experimental.make_csv_dataset(
        csv_file_path,
        batch_size=batch_size,
        column_names=CSV_HEADER,
        num_epochs=1,
        header=False,
        field_delim="|",
        shuffle=shuffle,
    ).map(process)

    return dataset

Create model input

# 定义一个函数create_model_inputs，用于创建模型的输入

def create_model_inputs():
    # 返回一个字典，包含模型的输入
    return {
    
    
        "user_id": keras.Input(name="user_id", shape=(1,), dtype="string"),  # 用户ID，输入形状为(1,)，数据类型为字符串
        "sequence_movie_ids": keras.Input(
            name="sequence_movie_ids", shape=(sequence_length - 1,), dtype="string"
        ),  # 电影序列ID，输入形状为(sequence_length - 1,)，数据类型为字符串
        "target_movie_id": keras.Input(
            name="target_movie_id", shape=(1,), dtype="string"
        ),  # 目标电影ID，输入形状为(1,)，数据类型为字符串
        "sequence_ratings": keras.Input(
            name="sequence_ratings", shape=(sequence_length - 1,), dtype=tf.float32
        ),  # 电影评分序列，输入形状为(sequence_length - 1,)，数据类型为浮点数
        "sex": keras.Input(name="sex", shape=(1,), dtype="string"),  # 性别，输入形状为(1,)，数据类型为字符串
        "age_group": keras.Input(name="age_group", shape=(1,), dtype="string"),  # 年龄组，输入形状为(1,)，数据类型为字符串
        "occupation": keras.Input(name="occupation", shape=(1,), dtype="string"),  # 职业，输入形状为(1,)，数据类型为字符串
    }

Encoded input features

encode_input_featuresThe method works as follows:

layers.EmbeddingEach categorical user feature is encoded using , where the embedding dimension is equal to the square root of the feature's vocabulary size .
The embeddings of these features are concatenated to form a single input tensor.
layers.EmbeddingEncode each movie in the movie sequence and the target movie using , where the dimension size is the square root of the number of movies .
The multi-hot genre vectors of each movie are concatenated with their embedding vectors, and nonlinear layers.Denseprocessing is used to output vectors with the same movie embedding dimensions.
Add a position embedding to each movie embedding in the sequence and multiply it by its rating from the rating sequence.
Concatenating the target movie embedding to the sequence movie embedding produces a [batch size, sequence length, embedding size]tensor of shape , conforming to the expected shape of the attention layer of the Transformer architecture.
This method returns a tuple consisting of two elements: encoded_transformer_featuresand encoded_other_features.

# 编码输入特征

## 定义函数encode_input_features，用于将输入特征进行编码
### 参数：
- inputs：包含输入特征的字典
- include_user_id：是否包含用户ID，默认为True
- include_user_features：是否包含用户特征，默认为True
- include_movie_features：是否包含电影特征，默认为True

### 返回值：
- encoded_transformer_features：编码后的转换器特征
- encoded_other_features：编码后的其他特征

## 初始化编码后的转换器特征列表和其他特征列表
encoded_transformer_features = []
encoded_other_features = []

## 初始化其他特征名称列表
other_feature_names = []

## 如果include_user_id为True，则将"user_id"添加到其他特征名称列表中
if include_user_id:
    other_feature_names.append("user_id")

## 如果include_user_features为True，则将USER_FEATURES中的特征名称添加到其他特征名称列表中
if include_user_features:
    other_feature_names.extend(USER_FEATURES)

## 对用户特征进行编码
for feature_name in other_feature_names:
    # 将字符串输入值转换为整数索引
    vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]
    idx = StringLookup(vocabulary=vocabulary, mask_token=None, num_oov_indices=0)(
        inputs[feature_name]
    )
    # 计算嵌入维度
    embedding_dims = int(math.sqrt(len(vocabulary)))
    # 创建指定维度的嵌入层
    embedding_encoder = layers.Embedding(
        input_dim=len(vocabulary),
        output_dim=embedding_dims,
        name=f"{
      
      feature_name}_embedding",
    )
    # 将索引值转换为嵌入表示
    encoded_other_features.append(embedding_encoder(idx))

## 创建用户特征的单个嵌入向量
if len(encoded_other_features) > 1:
    encoded_other_features = layers.concatenate(encoded_other_features)
elif len(encoded_other_features) == 1:
    encoded_other_features = encoded_other_features[0]
else:
    encoded_other_features = None

## 创建电影嵌入编码器
movie_vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]
movie_embedding_dims = int(math.sqrt(len(movie_vocabulary)))
# 创建查找表，将字符串值转换为整数索引
movie_index_lookup = StringLookup(
    vocabulary=movie_vocabulary,
    mask_token=None,
    num_oov_indices=0,
    name="movie_index_lookup",
)
# 创建指定维度的嵌入层
movie_embedding_encoder = layers.Embedding(
    input_dim=len(movie_vocabulary),
    output_dim=movie_embedding_dims,
    name=f"movie_embedding",
)
# 创建电影类型的向量查找表
genre_vectors = movies[genres].to_numpy()
movie_genres_lookup = layers.Embedding(
    input_dim=genre_vectors.shape[0],
    output_dim=genre_vectors.shape[1],
    embeddings_initializer=keras.initializers.Constant(genre_vectors),
    trainable=False,
    name="genres_vector",
)
# 创建电影类型的处理层
movie_embedding_processor = layers.Dense(
    units=movie_embedding_dims,
    activation="relu",
    name="process_movie_embedding_with_genres",
)

## 定义一个函数，用于编码给定的电影ID
def encode_movie(movie_id):
    # 将字符串输入值转换为整数索引
    movie_idx = movie_index_lookup(movie_id)
    movie_embedding = movie_embedding_encoder(movie_idx)
    encoded_movie = movie_embedding
    if include_movie_features:
        movie_genres_vector = movie_genres_lookup(movie_idx)
        encoded_movie = movie_embedding_processor(
            layers.concatenate([movie_embedding, movie_genres_vector])
        )
    return encoded_movie

## 编码目标电影ID
target_movie_id = inputs["target_movie_id"]
encoded_target_movie = encode_movie(target_movie_id)

## 编码序列电影ID
sequence_movies_ids = inputs["sequence_movie_ids"]
encoded_sequence_movies = encode_movie(sequence_movies_ids)
# 创建位置嵌入
position_embedding_encoder = layers.Embedding(
    input_dim=sequence_length,
    output_dim=movie_embedding_dims,
    name="position_embedding",
)
positions = tf.range(start=0, limit=sequence_length - 1, delta=1)
encodded_positions = position_embedding_encoder(positions)
# 获取序列评分，将其合并到电影编码中
sequence_ratings = inputs["sequence_ratings"]
sequence_ratings = keras.ops.expand_dims(sequence_ratings, -1)
# 将位置编码添加到电影编码中，并乘以评分
encoded_sequence_movies_with_poistion_and_rating = layers.Multiply()(
    [(encoded_sequence_movies + encodded_positions), sequence_ratings]
)

# 构建转换器的输入
for i in range(sequence_length - 1):
    feature = encoded_sequence_movies_with_poistion_and_rating[:, i, ...]
    feature = keras.ops.expand_dims(feature, 1)
    encoded_transformer_features.append(feature)
encoded_transformer_features.append(encoded_target_movie)

encoded_transformer_features = layers.concatenate(
    encoded_transformer_features, axis=1
)

return encoded_transformer_features, encoded_other_features

Create a binary search tree model

# 创建模型

## 设置参数

include_user_id = False  # 是否包含用户ID特征
include_user_features = False  # 是否包含用户特征
include_movie_features = False  # 是否包含电影特征

hidden_units = [256, 128]  # 隐藏层单元数
dropout_rate = 0.1  # Dropout比例
num_heads = 3  # 多头注意力机制的头数

## 创建模型函数

def create_model():
    inputs = create_model_inputs()  # 创建模型输入
    transformer_features, other_features = encode_input_features(
        inputs, include_user_id, include_user_features, include_movie_features
    )  # 编码输入特征

    # 创建多头注意力层
    attention_output = layers.MultiHeadAttention(
        num_heads=num_heads, key_dim=transformer_features.shape[2], dropout=dropout_rate
    )(transformer_features, transformer_features)

    # Transformer块
    attention_output = layers.Dropout(dropout_rate)(attention_output)
    x1 = layers.Add()([transformer_features, attention_output])
    x1 = layers.LayerNormalization()(x1)
    x2 = layers.LeakyReLU()(x1)
    x2 = layers.Dense(units=x2.shape[-1])(x2)
    x2 = layers.Dropout(dropout_rate)(x2)
    transformer_features = layers.Add()([x1, x2])
    transformer_features = layers.LayerNormalization()(transformer_features)
    features = layers.Flatten()(transformer_features)

    # 添加其他特征
    if other_features is not None:
        features = layers.concatenate(
            [features, layers.Reshape([other_features.shape[-1]])(other_features)]
        )

    # 全连接层
    for num_units in hidden_units:
        features = layers.Dense(num_units)(features)
        features = layers.BatchNormalization()(features)
        features = layers.LeakyReLU()(features)
        features = layers.Dropout(dropout_rate)(features)

    outputs = layers.Dense(units=1)(features)  # 输出层
    model = keras.Model(inputs=inputs, outputs=outputs)  # 创建模型
    return model

model = create_model()  # 创建模型

Run training and evaluation experiments

# 编译模型
model.compile(
    optimizer=keras.optimizers.Adagrad(learning_rate=0.01),  # 使用Adagrad优化器，学习率为0.01
    loss=keras.losses.MeanSquaredError(),  # 使用均方误差作为损失函数
    metrics=[keras.metrics.MeanAbsoluteError()],  # 使用平均绝对误差作为评估指标
)

# 读取训练数据
train_dataset = get_dataset_from_csv("train_data.csv", shuffle=True, batch_size=265)

# 使用训练数据拟合模型
model.fit(train_dataset, epochs=5)

# 读取测试数据
test_dataset = get_dataset_from_csv("test_data.csv", batch_size=265)

# 在测试数据上评估模型
_, rmse = model.evaluate(test_dataset, verbose=0)
print(f"Test MAE: {
      
      round(rmse, 3)}")  # 打印测试数据上的平均绝对误差

You should achieve or be close to a mean absolute error (MAE) of 0.7 on your test data.

in conclusion

The BST model uses a Transformer layer in its architecture to capture the sequential signals of user behavior sequences in recommendations.

You can try different configurations to train the model, such as increasing the input sequence length and training the model for more epochs. Additionally, you can try including other features such as movie release year and customer postal code, as well as including intersectional features such as gender X type.