Recommender system using factorization machine with examples and code

1. Description

        In my previous articles, I discussed the basics of recommender systems, matrix factorization and neural collaborative filtering (NCF), which you can find in the My Blog section below. Next, this time I'll explore the factorization machine through examples and code.

        Some advantages of using factorization machines for recommender systems are

  • It handles sparse and high-dimensional data relatively well.
  • You can add meta information around users and projects to get more context. Therefore, factorization machines are not pure collaborative filtering methods like NCF and matrix factorization that only use user-item interactions.

        Factorization machine is a supervised ML algorithm that can be used for classification and regression. Although, it is famous for its recommendation system. also

It can be viewed as an extension of linear regression, in addition to capturing linear relationships,

Higher-order relationships can also be captured by introducing higher-order feature interactions using latent decomposition.

2. What is high-order feature interaction?

        Higher-order interaction refers to the combined effect of two or more features on the target variable, where the impact is not linear and cannot be represented by the sum of the effects of individual features. For example

Let's say we have classified data on whether a user will click on an ad. The feature set has

User ID, user age, ad type, ad ID, click or not (label)

We may find that the impact of Ad Type on the likelihood of a user clicking on an ad depends on User Age. For example, younger users may be more likely to click on image ads, while older users may prefer video ads. This interaction means that the effect of “ad type” is not consistent across all age groups. To capture this higher-order interaction, the model needs to consider how “ad type” and “user age” interact to affect click-through rates.

Factorization machines can help us capture higher-order interactions that linear regression ignores.

        The FM model equation contains n-way interactions between features of different orders. The most common configuration is a second-order model, which contains a weight for a single feature in the dataset and an interaction term for each pair of features. I will explain two-way interaction in this article.

3. How is it implemented? Let us understand with an example

        Suppose we have the following user-item interaction data

        As you can see, in addition to user-project interactions, it also has some meta-information about users and projects.

  • One-hot encoding of categorical features present in a dataset. This will include One-Hot En encoding for the User (User ID) and Project (Movie ID) columns, with the exception of the Tag (Rating) column.

4. Factorization machine equation

y = w₀ + ∑(wi * xi) + ∑i(∑j(<vi . vj> * xi * xj))

where

y = label

w₀ = deviation

wi = weight

xi = features from the One-Hot encoding feature set

<VI.vj> = dot product between potential vectors

Note: If term 3 is ignored, term 1 will form a linear regression equation .

  • The first two terms are similar to linear regression where w₀=bias and
  • wi * xi captures the weight of each One-Hot-Encoding feature.
  • Item 3 is important to capture higher order interactions

∑i(∑j(<vi . vj> * xi * xj)), we multiply each column obtained after One-Hot Encoding all features as well as the dot product between the latent vector representations of these OHE columns.

Suppose we have 2 features, Age and Genre, where Genre has two possible values: Action and Comedy, then the 3rd term in FM after the One-Hot encoded Genre (Genre_Action and Genre_Comedy) will look like this

<v_age .v_age> * age * age +

<v_genre_action .v_genre_action> * genre_action* genre_action+

<v_genre_comedy .v_genre_comedy> * genre_comedy* genre_comedy+

<v_age .v_genre_action> * age * genre_action +

<v_age .v_genre_action>age* genre_comedy +

<v_genre_action .v_genre_comedy> * genre_comedy* genre_action

Here you can see that for each possible pair of OHE columns, we will have a term that captures their interaction as well as their underlying vector representation. Since " genre" has two possible values, it is split into two columns.  "Age" is a number, so there is no OHE for this column. So for 3 columns we get 6 different combinations

W₀, WI and VI are the entities that the model will learn during training.

last question

Why don't we use a weight matrix instead of a latent vector in the third term? Or is the latent vector similar to some weight matrix?

Latent vectors are not the same as weight matrices. They are usually smaller in size.

This is done because the number of latent vectors is usually much smaller than the number of unique feature combinations formed in term 3. This is particularly advantageous when working with high-dimensional data sets, as the complete weight matrix will become computationally expensive and memory-intensive.

Enough math, time to take action

First, we need a dummy dataset. We will create one using sklearn.datasets with 10 features.

Note:  This example works for any type of classification. For recommendations, the entire process remains the same

#pip install git+https://github.com/coreylynch/pyFM

import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from pyfm import pylibfm

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100000,n_features=10, n_clusters_per_class=1)

Let’s look at our training data set

and tags

  • Next, we will use the loaded dataset to create a dictionary with the feature names as keys and the corresponding values ​​as values. Each row of the training data set will be represented by a dictionary like this
data = [ {v: k for k, v in dict(zip(i, range(len(i)))).items()}  for i in X]

Train test split time

X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.1, random_state=42)
  • Now, as discussed in Factorization Machine, we will convert all features into One-Hot vectors. This will be done using DictVectorizer. This will produce a sparse matrix
v = DictVectorizer()
X_train = v.fit_transform(X_train)
X_test = v.transform(X_test)

Next, let us train the factorization machine and analyze the results on the test data

fm = pylibfm.FM(num_factors=50,num_iter=10, verbose=True, task="classification", initial_learning_rate=0.0001, learning_rate_schedule="optimal")

fm.fit(X_train,y_train)

# Evaluate
from sklearn.metrics import log_loss,accuracy_score
print("Validation log loss: %.4f" % log_loss(y_test,fm.predict(X_test)))
print("Validation accuracy: %.4f" % accuracy_score(y_test,np.where(fm.predict(X_test)<0.5,0,1)))

The implementation shown above is very simple. However, there are a few key hyperparameters you should consider tuning

num_factors : Size of the latent vector (v). The larger the size, the more complexity it can capture, but the computational and memory costs are high.

Task: Classification (implicit feedback)/Regression (explicit feedback) task

The screenshots attached below depict the training in progress

As you can see, we achieved 097% accuracy with logloss=97.0, which is great! !

Many other libraries also provide implementations of factorization machines, such as xlearn, fastFM, etc., you can try it out. Mehul Gupta

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/132252266