1. Description
In my previous articles, I discussed the basics of recommender systems, matrix factorization and neural collaborative filtering (NCF), which you can find in the My Blog section below. Next, this time I'll explore the factorization machine through examples and code.
Some advantages of using factorization machines for recommender systems are
- It handles sparse and high-dimensional data relatively well.
- You can add meta information around users and projects to get more context. Therefore, factorization machines are not pure collaborative filtering methods like NCF and matrix factorization that only use user-item interactions.
Factorization machine is a supervised ML algorithm that can be used for classification and regression. Although, it is famous for its recommendation system. also
It can be viewed as an extension of linear regression, in addition to capturing linear relationships,
Higher-order relationships can also be captured by introducing higher-order feature interactions using latent decomposition.
2. What is high-order feature interaction?
Higher-order interaction refers to the combined effect of two or more features on the target variable, where the impact is not linear and cannot be represented by the sum of the effects of individual features. For example
Let's say we have classified data on whether a user will click on an ad. The feature set has
User ID, user age, ad type, ad ID, click or not (label)
We may find that the impact of Ad Type on the likelihood of a user clicking on an ad depends on User Age. For example, younger users may be more likely to click on image ads, while older users may prefer video ads. This interaction means that the effect of “ad type” is not consistent across all age groups. To capture this higher-order interaction, the model needs to consider how “ad type” and “user age” interact to affect click-through rates.
Factorization machines can help us capture higher-order interactions that linear regression ignores.
The FM model equation contains n-way interactions between features of different orders. The most common configuration is a second-order model, which contains a weight for a single feature in the dataset and an interaction term for each pair of features. I will explain two-way interaction in this article.
3. How is it implemented? Let us understand with an example
Suppose we have the following user-item interaction data
As you can see, in addition to user-project interactions, it also has some meta-information about users and projects.
- One-hot encoding of categorical features present in a dataset. This will include One-Hot En encoding for the User (User ID) and Project (Movie ID) columns, with the exception of the Tag (Rating) column.
4. Factorization machine equation
y = w₀ + ∑(wi * xi) + ∑i(∑j(<vi . vj> * xi * xj))
where
y = label
w₀ = deviation
wi = weight
xi = features from the One-Hot encoding feature set
<VI.vj> = dot product between potential vectors
Note: If term 3 is ignored, term 1 will form a linear regression equation .
- The first two terms are similar to linear regression where w₀=bias and
- wi * xi captures the weight of each One-Hot-Encoding feature.
- Item 3 is important to capture higher order interactions
∑i(∑j(<vi . vj> * xi * xj)), we multiply each column obtained after One-Hot Encoding all features as well as the dot product between the latent vector representations of these OHE columns.
Suppose we have 2 features, Age and Genre, where Genre has two possible values: Action and Comedy, then the 3rd term in FM after the One-Hot encoded Genre (Genre_Action and Genre_Comedy) will look like this
<v_age .v_age> * age * age +
<v_genre_action .v_genre_action> * genre_action* genre_action+
<v_genre_comedy .v_genre_comedy> * genre_comedy* genre_comedy+
<v_age .v_genre_action> * age * genre_action +
<v_age .v_genre_action>age* genre_comedy +
<v_genre_action .v_genre_comedy> * genre_comedy* genre_action
Here you can see that for each possible pair of OHE columns, we will have a term that captures their interaction as well as their underlying vector representation. Since " genre" has two possible values, it is split into two columns. "Age" is a number, so there is no OHE for this column. So for 3 columns we get 6 different combinations
W₀, WI and VI are the entities that the model will learn during training.
last question
Why don't we use a weight matrix instead of a latent vector in the third term? Or is the latent vector similar to some weight matrix?
Latent vectors are not the same as weight matrices. They are usually smaller in size.
This is done because the number of latent vectors is usually much smaller than the number of unique feature combinations formed in term 3. This is particularly advantageous when working with high-dimensional data sets, as the complete weight matrix will become computationally expensive and memory-intensive.
Enough math, time to take action
First, we need a dummy dataset. We will create one using sklearn.datasets with 10 features.
Note: This example works for any type of classification. For recommendations, the entire process remains the same
#pip install git+https://github.com/coreylynch/pyFM
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from pyfm import pylibfm
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100000,n_features=10, n_clusters_per_class=1)
Let’s look at our training data set
and tags
- Next, we will use the loaded dataset to create a dictionary with the feature names as keys and the corresponding values as values. Each row of the training data set will be represented by a dictionary like this
data = [ {v: k for k, v in dict(zip(i, range(len(i)))).items()} for i in X]
Train test split time
X_train, X_test, y_train, y_test = train_test_split(data, y, test_size=0.1, random_state=42)
- Now, as discussed in Factorization Machine, we will convert all features into One-Hot vectors. This will be done using DictVectorizer. This will produce a sparse matrix
v = DictVectorizer()
X_train = v.fit_transform(X_train)
X_test = v.transform(X_test)
Next, let us train the factorization machine and analyze the results on the test data
fm = pylibfm.FM(num_factors=50,num_iter=10, verbose=True, task="classification", initial_learning_rate=0.0001, learning_rate_schedule="optimal")
fm.fit(X_train,y_train)
# Evaluate
from sklearn.metrics import log_loss,accuracy_score
print("Validation log loss: %.4f" % log_loss(y_test,fm.predict(X_test)))
print("Validation accuracy: %.4f" % accuracy_score(y_test,np.where(fm.predict(X_test)<0.5,0,1)))
The implementation shown above is very simple. However, there are a few key hyperparameters you should consider tuning
num_factors : Size of the latent vector (v). The larger the size, the more complexity it can capture, but the computational and memory costs are high.
Task: Classification (implicit feedback)/Regression (explicit feedback) task
The screenshots attached below depict the training in progress
As you can see, we achieved 097% accuracy with logloss=97.0, which is great! !
Many other libraries also provide implementations of factorization machines, such as xlearn, fastFM, etc., you can try it out. Mehul Gupta