Lightweight graph convolutional network LightGCN explanation and practice

Reprinted from: deephub

Recommender systems are the most influential ML tasks in the industry today. From Taobao to Douyin, tech companies are constantly trying to build better recommender systems for their specific apps. And the task doesn't get any easier, because every day we want to see more options to choose from. So our model must not only make optimal recommendations, but also make recommendations efficiently. The model introduced today is called: Light Graph Convolution Network or LightGCN¹.

2836641543047b5b14f5c2f0085eba58.png

Let's imagine users and items as nodes in a bipartite graph, where users are connected to items that have been selected. So the problem of finding the best recommendation item becomes a link prediction problem.

example dataset

As a practical example, we're talking about users who are music listeners searching for music artists ("items"). The original dataset is available at ³.

e063a157afeda7ef248d6bb6c2049d29.png

The dataset contains 1824 users, 6854 artists and 20,664 labels. A common artist is associated with about 3 users, and a common user is associated with about 11 artists, because in this particular dataset, there are substantially more artists than users. A feature of this dataset is that we can see the creation time of each new connection, which is very important to us because the data can be divided into training set (earliest time) and test set (latest time) by connection time³. Our goal is to create a recommender system model to predict new labels/connections formed in the future.

Embedding-based models

LightGCN is an embedding-based model, which means it tries to find the best embeddings (vectors) for users and items. In addition to this, it is also looking for the optimal scoring function f, which scores new user-items and recommends those with higher scores.

40e0fcb844fe57ed5c4433782afe7ee3.png

For embedding vectors, users with similar preferences will have similar embeddings, and users with different preferences will have more different embeddings.

Before continuing to study lightGCN, first briefly introduce the embedding-based model, matrix factorization has been used in traditional recommender systems for many years, and the effect has been very good, so it will be used as our baseline model:

807356245b90eef0ea1df704347f5851.png

The figure above is the relationship between the matrix factorization (MF) process and the original graph and the embedding matrix².

Here we compare the matrix factorization model with the LightGCN model as a baseline. The scoring function f here is just the scalar product of two embeddings and the model trained by minimizing the Frobenious norm of the matrix (R - HW), where matrix R is the user-item adjacency matrix, and matrix H contains user embeddings and W contains items Embedded². The scoring function f is the same in the lightGCN case, but in order to understand the model intuitively, first consider what the performance optimization goal of the lightGCN model is.

But how to measure performance?

A popular performance measure is to take all actual new user edges from the test set and compute the top K predictions (meaning with the highest score f(user, item)) considering the model. This score is calculated for each user, and then the scores of all users are averaged to obtain a final score, called Recall@K².

341ab96b71f3d3617258f3e413729c5f.png

But the Recall@K metric is non-differentiable, which means that a differentiable loss function needs to be designed so that the training of the lightGCN model can use the gradient to find the optimal value.

The main goal of designing this loss function is that the scoring function for the future positive edge will result in a larger number, and the future negative edge scoring function will result in a smaller number². So a better way to combine these two problems is to hope that the difference between a given future positive edge for user u and a given future negative edge for user u is a large number:

7084a56930a8029773a31c9db554ab8b.png

After using the sigmoid function to map the difference of the two scores to the interval [0, 1], the scores can be treated as probabilities. So for a given user u, the scores for all given pairs of positive and negative edges can be combined and fed into the loss function. These losses² are then averaged across all users to obtain the final loss, which is called the Bayesian Personalized Ranking BPR loss:

b773ef28f1b583e869b67fdee9d5db28.png

LightGCN

Now to get to the point of this paper, while matrix factorization methods only capture the first-order edge-connected structure of the graph (only the information from the immediate neighbors of a given node), we want the model to capture higher-order graph structure. So use LightGCN to do this, which starts training with node embeddings initialized with matrix factorization:

e5b158e7a25af4b662160ad6df7846ca.png

After the embedding is initialized, LightGCN uses 3 layers to complete the training of the embedding, in each layer, each node obtains a new embedding by combining the embeddings of its neighbors. This can be thought of as a kind of graph convolution (see below for a comparison with image convolution):

b935be44147f923d5aa1ebcc4afd1cab.gif

Image convolution (left) can be seen as a special case of graph convolution (right). Graph convolution is a node-permutation-invariant operation.

As shown in the figure above, stacking more layers similarly to convolutional layers means that information from a given node is able to obtain information from nodes further away from that node, which can capture higher-order graph structure as needed. But how exactly are the embeddings combined into a new embedding at each iteration k? Here are two examples:

1f5e57d82338163a7d28e021f6f86d62.png

The graph above shows the effect of adjacent item embeddings on user embedding in the next layer and vice versa¹. The impact of the initial embedding decreases with each iteration as it is able to reach more nodes farther from the origin. This is what is said to be diffusing embeddings, this particular way of diffusing also vectorizes and speeds up the process by building a diffusion matrix:

296d4d31b365983d1be1d6cfcc350e15.png

Build a diffusion matrix from a degree matrix and an adjacency matrix².

Since the diffusion matrix is ​​calculated from the degree matrix D and the adjacency matrix A, the diffusion matrix only needs to be calculated once and contains no learnable parameters. The only learnable parameter in the model is the embedding at the shallow input node, which is multiplied by the diffusion matrix K times to get (K + 1) embeddings, which are then averaged to get the final embedding:

15057cf43415002891ea366da6b1ecf2.png

Now that we understand how the model propagates the input embeddings forward, we can encode the model using PyTorch Geometric, and then use the BPR loss mentioned above to optimize the embeddings for items and users. PyG (PyTorch Geometric) is a library built on PyTorch that helps us write and train graph neural networks (GNNs).

import torch

import torch.nn as nn
import torch.nn.functional as F
import torch_scatter
from torch_geometric.nn.conv import MessagePassing


class LightGCNStack(torch.nn.Module):
    def __init__(self, latent_dim, args):
        super(LightGCNStack, self).__init__()
        conv_model = LightGCN
        self.convs = nn.ModuleList()
        self.convs.append(conv_model(latent_dim))
        assert (args.num_layers >= 1), 'Number of layers is not >=1'
        for l in range(args.num_layers-1):
            self.convs.append(conv_model(latent_dim))

        self.latent_dim = latent_dim
        self.num_layers = args.num_layers
        self.dataset = None
        self.embeddings_users = None
        self.embeddings_artists = None

    def reset_parameters(self):
        self.embeddings.reset_parameters()

    def init_data(self, dataset):
        self.dataset = dataset
        self.embeddings_users = torch.nn.Embedding(num_embeddings=dataset.num_users, embedding_dim=self.latent_dim).to('cuda')
        self.embeddings_artists = torch.nn.Embedding(num_embeddings=dataset.num_artists, embedding_dim=self.latent_dim).to('cuda')

    def forward(self):
        x_users, x_artists, batch = self.embeddings_users.weight, self.embeddings_artists.weight, \
                                                self.dataset.batch

        final_embeddings_users = torch.zeros(size=x_users.size(), device='cuda')
        final_embeddings_artists = torch.zeros(size=x_artists.size(), device='cuda')
        final_embeddings_users = final_embeddings_users + x_users/(self.num_layers + 1)
        final_embeddings_artists = final_embeddings_artists + x_artists/(self.num_layers+1)
        for i in range(self.num_layers):
            x_users = self.convs[i]((x_artists, x_users), self.dataset.edge_index_a2u, size=(self.dataset.num_artists, self.dataset.num_users))
            x_artists = self.convs[i]((x_users, x_artists), self.dataset.edge_index_u2a, size=(self.dataset.num_users, self.dataset.num_artists))
            final_embeddings_users = final_embeddings_users + x_users/(self.num_layers+1)
            final_embeddings_artists = final_embeddings_artists + x_artists/(self.num_layers + 1)

        return final_embeddings_users, final_embeddings_artists

    def decode(self, z1, z2, pos_edge_index, neg_edge_index):  # only pos and neg edges
        edge_index = torch.cat([pos_edge_index, neg_edge_index], dim=-1)  # concatenate pos and neg edges
        logits = (z1[edge_index[0]] * z2[edge_index[1]]).sum(dim=-1)  # dot product
        return logits

    def decode_all(self, z_users, z_artists):
        prob_adj = z_users @ z_artists.t()  # get adj NxN
        #return (prob_adj > 0).nonzero(as_tuple=False).t()  # get predicted edge_list
        return prob_adj

    def BPRLoss(self, prob_adj, real_adj, edge_index):
        loss = 0
        pos_scores = prob_adj[edge_index.cpu().numpy()]
        for pos_score, node_index in zip(pos_scores, edge_index[0]):
            neg_scores = prob_adj[node_index, real_adj[node_index] == 0]
            loss = loss - torch.sum(torch.log(torch.sigmoid(pos_score.repeat(neg_scores.size()[0]) - neg_scores))) / \
                   neg_scores.size()[0]

        return loss / edge_index.size()[1]

    def topN(self, user_id, n):
        z_users, z_artists = self.forward()
        scores = torch.squeeze(z_users[user_id] @ z_artists.t())
        return torch.topk(scores, k=n)


class LightGCN(MessagePassing):
    def __init__(self, latent_dim, **kwargs):
        super(LightGCN, self).__init__(node_dim=0, **kwargs)
        self.latent_dim = latent_dim

    def forward(self, x, edge_index, size=None):
        return self.propagate(edge_index=edge_index, x=(x[0], x[1]), size=size)

    def message(self, x_j):
        return x_j

    def aggregate(self, inputs, index, dim_size=None):
        return torch_scatter.scatter(src=inputs, index=index, dim=0, dim_size=dim_size, reduce='mean')

Prediction with LightGCN

PyTorch Geometric also provides training functions to help us simplify the training process. After training, the embedding representation can now represent that users are likely to like similar items and have similar preferences. So new item scores can be computed from the final embeddings returned by the model to predict each user's preference for items they haven't seen yet. For each user recommend the K highest scoring items (new to the user). Just like matrix factorization, the scoring function f is just a scalar product of embeddings, efficiently computed by matrix multiplication:

e3bd6de1141caa8a01b4f83353028363.png

The test set also contains new users that did not appear in the training set. So in this case, we just recommend the top K items that are popular among all combined users present in the training set.

from functools import partial

import get_pyg_data
from model import LightGCNStack
import torch

from src.data_preprocessing import TrainTestGenerator
from src.evaluator import Evaluator
from train_test import train, test
from torch_geometric.utils import train_test_split_edges
import time

import pandas as pd


class objectview(object):
    def __init__(self, *args, **kwargs):
        d = dict(*args, **kwargs)
        self.__dict__ = d


# Wrapper for evaluation
class LightGCN_recommender:
    def __init__(self, args):
        self.args = objectview(args)
        self.model = LightGCNStack(latent_dim=64, args=self.args).to('cuda')
        self.a_rev_dict = None
        self.u_rev_dict = None
        self.a_dict = None
        self.u_dict = None

    def fit(self, data: pd.DataFrame):
        # Default rankings when userID is not in training set
        self.default_recommendation = data["artistID"].value_counts().index.tolist()

        # LightGCN
        data, self.u_rev_dict, self.a_rev_dict, self.u_dict, self.a_dict = get_pyg_data.load_data(data)
        data = data.to("cuda")
        self.model.init_data(data)
        self.optimizer = torch.optim.Adam(params=self.model.parameters(), lr=0.001)

        best_val_perf = test_perf = 0

        for epoch in range(1, self.args.epochs+1):
            start = time.time()
            train_loss = train(self.model, data, self.optimizer)
            val_perf, tmp_test_perf = test(self.model, (data, data))
            if val_perf > best_val_perf:
                best_val_perf = val_perf
                test_perf = tmp_test_perf
            log = 'Epoch: {:03d}, Loss: {:.4f}, Val: {:.4f}, Test: {:.4f}, Elapsed time: {:.2f}'
            print(log.format(epoch, train_loss, best_val_perf, test_perf, time.time()-start))

    def recommend(self, user_id, n):
        try:
            recommendations = self.model.topN(self.u_dict[str(user_id)], n=n)
        except KeyError:

            recommendations = self.default_recommendation
        else:
            recommendations = recommendations.indices.cpu().tolist()
            recommendations = list(map(lambda x: self.a_rev_dict[x], recommendations))
        return recommendations


def evaluate(args):
    data_dir = "../data/"
    data_generator = TrainTestGenerator(data_dir)

    evaluator = Evaluator(partial(LightGCN_recommender, args), data_generator)
    evaluator.evaluate()

    evaluator.save_results('../results/lightgcn.csv', '../results/lightgcn_time.csv')
    print('Recall:')
    print(evaluator.get_recalls())
    print('MRR:')
    print(evaluator.get_mrr())


if __name__=='__main__':
    # best_val_perf = test_perf = 0
    # data = get_pyg_data.load_data()
    #data = train_test_split_edges(data)

    args = {'model_type': 'LightGCN', 'num_layers': 3, 'batch_size': 32, 'hidden_dim': 32,
         'dropout': 0, 'epochs': 1000, 'opt': 'adam', 'opt_scheduler': 'none', 'opt_restart': 0, 'weight_decay': 5e-3,
         'lr': 0.1, 'lambda_reg': 1e-4}

    evaluate(args)

Comparative Results

The model was run on three test sets for three years: 2008, 2009, and 2010. For a given test set, the training data consists of all connections made in previous years, e.g. a model tested on the test set in 2010, was run on the training set of all previous years (including 2008 and 2009) trained. But the model tested on the 2008 test set was only trained on data from 2007 and earlier.

After the model has produced predictions, it is evaluated using the Recall@K introduced earlier. The first table below shows the results with matrix factorization as the baseline, while the second table below shows the results obtained with LightGCN:

2047982387760f5e4528967d9e305702.png

Recall@K score by matrix factorization

f70f104c73e47b6d552b5f2ff4485912.png

Recall@K score of LightGCN

As expected, the recall @K increases with K, and the model seems to perform best on the test set in 2010, probably because the training set has the largest amount of data in this case. The above tables all clearly show that LightGCN outperforms matrix factorization baseline models in Recall@K. The graph below shows the average Recall@K value over three years.

95674f501854031b24e28d26530c4bd9.png

Another metric that can be used is mean reciprocal rank MRR. This metric attempts to better illustrate how certain the model is about predicting connections. It does this by considering all new connections Q that are actually correct. For each connection, it checks how many incorrectly predicted connections (false positives) there are in order to get a rank for that connection (the smallest possible rank is 1, since we also count the correct connections themselves). The reciprocals of these rankings are averaged to obtain the MRR:

c36848d8e73db1c24c5fefc724681480.png

Regarding MRR, we can again clearly see that the LightGCN model performs better than the matrix factorization model, as shown in the following table:

03cb9205c135891b5c985624c9f4595a.png

But the LightGCN model takes much longer to train than the matrix factorization model used to initialize its embeddings. But as the name suggests, LightGCN is very lightweight compared to other graph convolutional neural networks, this is because LightGCN does not have any learnable parameters other than the input embeddings, which makes training faster than others used for recommender systems GCN based models are much faster.

eea3b79d7ea2de775c1f20dd7772e25d.png

For the timing of predictions, both models take milliseconds to generate predictions, the gap is basically negligible

quote

  1. Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang.  Lightgcn: Simplifying and powering graph convolution network for  recommendation. In Proceedings of the 43rd International ACM SIGIR  conference on research and development in Information Retrieval, pages  639–648, 2020. arXiv:2002.02126  

  2. Visualizations taken from lecture given by Jure Leskovec, available at http://web.stanford.edu/class/cs224w/slides/13-recsys.pdf

  3. Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. 2nd workshop on  information heterogeneity and fusion in recommender systems (hetrec  2011). In Proceedings of the 5th ACM conference on Recommender systems,  RecSys 2011, New York, NY, USA, 2011. ACM.

  4. 代码 代码 https://github.com/tm1897/mlg_cs224w_project/tree/main(Authors: Ermin Omeragic, Tomaz Marticic, Jurij Nastran)

Author: jn2279

Recommended reading:

My 2022 Internet School Recruitment Sharing

My 2021 Summary

Talking about the difference between algorithm post and development post

Internet school recruitment research and development salary summary

For time series, everything you can do.

What is the spatiotemporal sequence problem? Which models are mainly used for such problems? What are the main applications?

Public number: AI snail car

Stay humble, stay disciplined, stay progressive

24f9dff84a6267b3cb36457053e6f6d1.png

Send [Snail] to get a copy of "Hands-on AI Project" (AI Snail Car)

Send [1222] to get a good leetcode brushing note

Send [AI Four Classics] to get four classic AI e-books

Guess you like

Origin blog.csdn.net/qq_33431368/article/details/123515948