Introduction to Factorization Machine and PyTorch Code Implementation

Factorization Machines (FM for short) is a model for solving machine learning tasks such as recommender systems, regression, and classification. It was proposed by Steffen Rendle in 2010 and is an extension method based on linear models, which can effectively deal with high-dimensional sparse data and perform well when dealing with feature combinations. It is one of the classic models of the recommendation system, and the model is simple and interpretable, so it is still being used in the field of search advertising and recommendation algorithms. Today we will introduce it in detail and make a simple implementation using Pytorch code.

Here we use a data set of users, movies and ratings, and now we need to recommend movies through the factorization machine. Data features include: Movie, Rating, Timestamp, Title, and Genre. User characteristics include: age, gender, occupation, zip code. Movies without ratings in the dataset will be removed.

 DATA_DIR = './data/ml-1m/'
 df_movies = pd.read_csv(DATA_DIR+'movies.dat', sep='::',
                         names=['movieId', 'title','genres'],
                         encoding='latin-1',
                         engine='python')
 user_cols = ['userId', 'gender' ,'age', 'occupation', 'zipcode']
 df_users = pd.read_csv(DATA_DIR+'users.dat', sep='::',
                        header=None,
                        names=user_cols,
                        engine='python')
 df = pd.read_csv(DATA_DIR+'ratings.dat', sep='::',
                  names=['userId','movieId','rating','time'],
                  engine='python')
 # Left merge removes movies with no rating. # of unique movies: 3883 -> 3706
 df = df.merge(df_movies, on='movieId', how='left')
 df = df.merge(df_users, on='userId', how='left')
 df = df.sort_values(['userId', 'time'], ascending=[True, True]).reset_index(drop=True)

The dataset looks like this

data preprocessing

The largest movieId in our dataset is 3952, but there are only 3706 unique movieIds. So need to remap (3952 -> 3706)

 d = defaultdict(LabelEncoder)
 cols_cat = ['userId', 'movieId', 'gender', 'age', 'occupation']
 for c in cols_cat:
     d[c].fit(df[c].unique())
     df[c+'_index'] = d[c].transform(df[c])
     print(f'# unique {c}: {len(d[c].classes_)}')
 
 min_num_ratings = df.groupby(['userId'])['userId'].transform(len).min()
 print(f'Min # of ratings per user: {min_num_ratings}')
 print(f'Min/Max rating: {df.rating.min()} / {df.rating.max()}')
 print(f'df.shape: {df.shape}')

The result is as follows

For factorization machines, an additional step is required after label encoding, which is to add feature offsets. By adding feature offsets, we can use only one embedding matrix instead of multiple embedding matrices + a for loop. This is very helpful for improving training efficiency.

 feature_cols = ['userId_index', 'movieId_index', 'gender_index', 'age_index',
                 'occupation_index']
 # Get offsets
 feature_sizes = {}
 for feat in feature_cols:
     feature_sizes[feat] = len(df[feat].unique())
 feature_offsets = {}
 NEXT_OFFSET = 0
 for k,v in feature_sizes.items():
     feature_offsets[k] = NEXT_OFFSET
     NEXT_OFFSET += v
 
 # Add offsets to each feature column
 for col in feature_cols:
     df[col] = df[col].apply(lambda x: x + feature_offsets[col])
 print('Offset - feature')
 for k, os in feature_offsets.items():
     print(f'{os:<6} - {k}')

Split data, create Dataset and Dataloader

 THRES = 5
 cols = ['rating', *feature_cols]
 df_train = df[cols].groupby('userId_index').head(-THRES).reset_index(drop=True)
 df_val = df[cols].groupby('userId_index').tail(THRES).reset_index(drop=True)
 print(f'df_train shape: {df_train.shape}')
 print(f'df_val shape: {df_val.shape}')
 df_train.head(3)

Dataset and Dataloader are as follows:

 class MovieDataset(Dataset):
     """ Movie DS uses x_feats and y_feat """
     def __init__(self, df, x_feats, y_feat):
         super().__init__()
         self.df = df
         self.x_feats = df[x_feats].values
         self.y_rating = df[y_feat].values
     def __len__(self):
         return len(self.df)
     def __getitem__(self, idx):
         return self.x_feats[idx], self.y_rating[idx]
 
 BS = 1024
 ds_train = MovieDataset(df_train, feature_cols, 'rating')
 ds_val = MovieDataset(df_val, feature_cols, 'rating')
 dl_train = DataLoader(ds_train, BS, shuffle=True, num_workers=2)
 dl_val = DataLoader(ds_val, BS, shuffle=True, num_workers=2)
 
 xb, yb = next(iter(dl_train))
 print(xb.shape, yb.shape)
 print(xb)
 print(yb)

FM model

The main goal of FM is to deal with the interaction between features, especially in problems with a large number of discrete features, the traditional linear model is prone to the curse of dimensionality. FM adopts the technique of factorization to capture the implicit relationship between features, so as to learn the interaction between features in high-dimensional data without explicitly considering all possible feature combinations.

The core idea of ​​FM is to represent each feature as a vector, and then express the interaction between features through the inner product between the vectors. Specifically, FM achieves this by learning a one-dimensional weight for each feature (representing the importance of the feature) and a latent vector for each feature (representing the interaction between features).

Simply put, factorization machines can use any number of features to train a model. It models the interaction of pairwise (feature-to-feature) features, taking the dot product of each feature with other features. Then add them up.

In addition to feature-to-feature dot products, the paper also adds global offset and feature bias. Offset and bias are included in our Pytorch implementation. Below is the equation from the paper. "n" = number of features. k = feature dimension

In the above equation, each feature is multiplied by each feature. If there are "n" features in "k" dimensions, this will result in O(k * n²) time complexity. As shown in the figure below, the paper derives a faster implementation with a time complexity of O(k * n).

In our implementation, an nn.Embedding layer is used to process the input (usually encoded).

 class FM(nn.Module):
     """ Factorization Machine + user/item bias, weight init., sigmoid_range 
         Paper - https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf
     """
     def __init__(self, num_feats, emb_dim, init, bias, sigmoid):
         super().__init__()
         self.x_emb = nn.Embedding(num_feats, emb_dim)
         self.bias = bias
         self.sigmoid = sigmoid
         if bias:
             self.x_bias = nn.Parameter(torch.zeros(num_feats))
             self.offset = nn.Parameter(torch.zeros(1))
         if init:
             self.x_emb.weight.data.uniform_(0., 0.05)
 
     def forward(self, X):
         # Derived time complexity - O(nk)
         x_emb = self.x_emb(X) # [bs, num_feats] -> [bs, num_feats, emb_dim]
         pow_of_sum = x_emb.sum(dim=1).pow(2) # -> [bs, num_feats]
         sum_of_pow = x_emb.pow(2).sum(dim=1) # -> [bs, num_feats]
         fm_out = (pow_of_sum - sum_of_pow).sum(1)*0.5  # -> [bs]
         if self.bias:
             x_biases = self.x_bias[X].sum(1) # -> [bs]
             fm_out +=  x_biases + self.offset # -> [bs]
         if self.sigmoid:
             return self.sigmoid_range(fm_out, low=0.5) # -> [bs]
         return fm_out
 
     def sigmoid_range(self, x, low=0, high=5.5):
         """ Sigmoid function with range (low, high) """
         return torch.sigmoid(x) * (high-low) + low

train

The model is trained with AdamW optimizer and mean squared loss (MSE). For ease of use, hyperparameters are put into configuration classes (CFG).

 CFG = {
     'lr': 0.001,
     'num_epochs': 8,
     'weight_decay': 0.01,
     'sigmoid': True,
     'bias': True,
     'init': True,
 }
 n_feats = int(pd.concat([df_train, df_val]).max().max())
 n_feats = n_feats + 1 # "+ 1" to account for 0 - indexing
 mdl = FM(n_feats, emb_dim=100,
          init=CFG['init'], bias=CFG['bias'], sigmoid=CFG['sigmoid'])
 mdl.to(device)
 opt = optim.AdamW(mdl.parameters(), lr=CFG['lr'], weight_decay=CFG['weight_decay'])
 loss_fn = nn.MSELoss()
 print(f'Model weights: {list(dict(mdl.named_parameters()).keys())}')

Script is also a common Pytorch training process

 epoch_train_losses, epoch_val_losses = [], []
 
 for i in range(CFG['num_epochs']):
     train_losses, val_losses = [], []
     mdl.train()
     for xb,yb in dl_train:
         xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
         preds = mdl(xb)
         loss = loss_fn(preds, yb)
         train_losses.append(loss.item())
         opt.zero_grad()
         loss.backward()
         opt.step()
     mdl.eval()
     for xb,yb in dl_val:
         xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
         preds = mdl(xb)
         loss = loss_fn(preds, yb)
         val_losses.append(loss.item())
     # Start logging
     epoch_train_loss = np.mean(train_losses)
     epoch_val_loss = np.mean(val_losses)
     epoch_train_losses.append(epoch_train_loss)
     epoch_val_losses.append(epoch_val_loss)
     s = (f'Epoch: {i}, Train Loss: {epoch_train_loss:0.2f}, '
          f'Val Loss: {epoch_val_loss:0.2f}'
         )
     print(s)

result

Let's do some sanity checks. The rating range of the model is [0.65,5.45], which deviates from the actual rating range [1,5]. But the forecast distribution looks good

lpreds, lratings = [], []
mdl.eval()
for xb,yb in dl_val:
    xb, yb = xb.to(device), yb.to(device, dtype=torch.float)
    preds = mdl(xb)
    lpreds.extend(preds.detach().cpu().numpy().tolist())
    lratings.extend(yb.detach().cpu().numpy().tolist())

print(f'Preds min/max: {min(lpreds):0.2f} / {max(lpreds):0.2f}')
print(f'Rating min/max: {min(lratings):0.2f} / {max(lratings):0.2f}')
plt.figure(figsize=(4,2))
plt.hist(lratings, label='ratings', bins=(np.arange(1,7)-0.5),
         rwidth=0.25, color='blue')
plt.hist(lpreds, label='preds', bins=20, rwidth=0.5, color='red')
plt.title('Ratings & Predictions Distribution')
plt.grid()
plt.legend();

Use TSNE to view the trained nn.Embedding, we can see that the groups of children, horror and documentaries are learned

# Check TSNE for genres - Make dataframe of movie + embeddings + biases
movies = df.drop_duplicates('movieId_index').reset_index(drop=True)
movies['movieId'] = d['movieId'].transform(movies.movieId)
# Get movie embeddings and biases
idxs_movies = torch.tensor(movies['movieId_index'].values, device=device)
movie_embs = mdl.x_emb.weight[idxs_movies]
movie_biases = mdl.x_bias[idxs_movies]
movies['emb'] = movie_embs.tolist()
movies['bias'] = movie_biases.tolist()

# Check TSNE, and scatter plot movie embeddings
# Movie embeddings do get separated after training
genre_cols = ['Children\'s', 'Horror', 'Documentary']
GENRES = '|'.join(genre_cols)
print(f'Genres: {GENRES}')

movies_subset = movies[movies['genres'].str.contains(GENRES)].copy()
X = np.stack(movies_subset['emb'].values)
ldr = TSNE(n_components=2, init='pca', learning_rate='auto', random_state=42)
Y = ldr.fit_transform(X)
movies_subset['x'] = Y[:, 0]
movies_subset['y'] = Y[:, 1]

def single_genre(genres):
    """ Filter movies for genre in genre_cols"""
    for genre in genre_cols:
        if genre in genres: return genre

movies_subset['genres'] = movies_subset['genres'].apply(single_genre)
plt.figure(figsize=(5, 5))
ax = sns.scatterplot(x='x', y='y', hue='genres', data=movies_subset)

We can get movie recommendations for "Toy Story 2 (1999)". That is, our reasoning process is carried out through cosine similarity.

# Helper function/dictionaries to convert form name to labelEncoder index/label
d_name2le = dict(zip(df.title, df.movieId))
d_le2name = {v:k for k,v in d_name2le.items()}

def name2itemId(names):
    """Give movie name, returns labelEncoder label. This is before adding any offset"""
    if not isinstance(names, list):
        names = [names]
    return d['movieId'].transform([d_name2le[name] for name in names])

# Input: movie name. Output: movie recommendations using cosine similarity
IDX = name2itemId('Toy Story 2 (1999)')[0] # IDX = 2898, before offset
IDX = IDX + feature_offsets['movieId_index'] # IDX = 8938, after offset to get input movie emb
emb_toy2 = mdl.x_emb(torch.tensor(IDX, device=device))
cosine_sim = torch.tensor(
    [F.cosine_similarity(emb_toy2, emb, dim=0) for emb in movie_embs]
)
top8 = cosine_sim.argsort(descending=True)[:8]
movie_sims = cosine_sim[top8]
movie_recs = movies.iloc[top8.detach().numpy()]['title'].values
for rec, sim in zip(movie_recs, movie_sims):
    print(f'{sim.tolist():0.3f} - {rec}')

Display labelEncoder user meta feature encoding.

d_age_meta = {'Under 18': 1, '18-24': 18, '25-34': 25, '35-44': 35,
              '45-49': 45, '50-55': 50, '56+': 56
             }
d_gender = dict(zip(d['gender'].classes_, range(len(d['gender'].classes_))))
d_age = dict(zip(d['age'].classes_, range(len(d['age'].classes_))))
print(f'Gender mapping: {d_gender}')
print(f'Age mapping: {d_age}')

In this way, recommendations can be made for specific types of people, such as cold-start movie recommendations for males aged 18-24.

# Get cold start movie recs for a male (GENDER=1), ages 18-24 (AGE=1)
GENDER = 1
AGE = 1
gender_emb = mdl.x_emb(
    torch.tensor(GENDER+feature_offsets['gender_index'], device=device)
)
age_emb = mdl.x_emb(
    torch.tensor(AGE+feature_offsets['age_index'], device=device)
)
metadata_emb = gender_emb + age_emb
rankings = movie_biases + (metadata_emb*movie_embs).sum(1) # dot product
rankings = rankings.detach().cpu()
for i, movie in enumerate(movies.iloc[rankings.argsort(descending=True)]['title'].values[:10]):
    print(i, movie)

deploy

Simple deployment can be done using streamlit:

First save the model as a file

SAVE = False
if SAVE:
    movie_embs_cpu = movie_embs.cpu()
    d_utils = {'label_encoder': d,
               'feature_offsets': feature_offsets,
               'movie_embs': movie_embs_cpu,
               'movies': movies,
               'd_name2le': d_name2le,
              }
    pd.to_pickle(d_utils, 'data/d_utils.pkl', protocol=4)
    mdl_scripted = torch.jit.script(mdl)
    mdl_scripted.save('mdls/fm_pt.pkl')

then look at the result

Summarize

The FM model can be seen as a model that combines linear models and low-rank matrix factorization, which overcomes the problem of high-dimensional data, reduces the number of model parameters, and can well capture the interaction information between features. In addition, the training process of FM is relatively simple and efficient. Factorization machine is a powerful machine learning model, especially suitable for processing high-dimensional sparse data, and has been widely used in recommender systems, advertising recommendations, personalized recommendations and other fields.

The full code for this article is here:

https://avoid.overfit.cn/post/57c0d06f61ed4b67b9487750e8d2d211

By Daniel Lam

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/132004778