推荐系统 | 学习笔记:Attentional Factorization Machines

论文

Attentional Factorization Machines:Learning the Weight of Feature Interactions via Attention Networks


Abstract

  • Despite effectiveness, FM can be hindered by its modelling of all feature interactions with the same weight, as not all feature interactions are equally useful and predictive.
  • In this work, we improve FM by discriminating the importance of different
    feature interactions.
  • Empirically, it is shown on regression task AFM betters FM with a 8.6% relative improvement, and consistently outperforms the state-of-the-art deep learning methods Wide&Deep [Cheng et al., 2016] and DeepCross [Shan et al., 2016] with a much simpler structure and fewer model parameters.

1 Introduction

  • the key problem with PR (and other similar cross feature-based solutions, such as the wide component of Wide&Deep [Cheng et al., 2016]) is that for sparse datasets where only a few cross features are observed, the parameters for unobserved cross features cannot be estimated.
  • To address the generalization issue of PR, factorization machines (FMs)1 were proposed [Rendle, 2010], which parameterize the weight of a cross feature as the inner product of the embedding vectors of the constituent features
  • Despite great promise, we argue that FM can be hindered by its modelling of all factorized interactions with the same weight.
  • the interactions with less useful features should be assigned a lower weight as they contribute less to the prediction.
  • Nevertheless, FM lacks such capability of differentiating the importance of feature interactions, which may result in suboptimal prediction.
  • In this work, we improve FM by discriminating the importance of feature interactions.
  • We devise a novel model named AFM, which utilizes
    the recent advance in neural network modelling — the attention
    mechanism [Chen et al., 2017a; 2017b] — to enable feature interactions
    contribute differently to the prediction.
  • More importantly, the importance of a feature interaction is automatically learned from data without any human domain knowledge.
  • Extensive experiments show that our use of attention on FM serves two benefits:it not only leads to better performance, but also provides insight into which feature interactions contribute more to the prediction.

2 Factorization Machines

在这里插入图片描述
在这里插入图片描述


3 Attentional Factorization Machines

3.1 Model

在这里插入图片描述

  • the pair-wise interaction layer and the attention-based pooling layer are the main contribution of this paper.

Pair-wise Interaction Layer

双向交互层

  • It expands m vectors to m(m − 1)/2 interacted vectors, where
    each interacted vector is the element-wise product of two distinct
    vectors to encode their interaction.
    在这里插入图片描述
    在这里插入图片描述

Attention-based Pooling Layer

  • The idea is to allow different parts contribute
    differently when compressing them to a single representation.
    在这里插入图片描述
  • for features that have never co-occurred in the training data,
    the attention scores of their interactions cannot be estimated.
    To address the generalization problem, we further parameterize
    the attention score with a multi-layer perceptron (MLP),
    which we call the attention network.
    在这里插入图片描述

the attention network 是用来算 a i j a_{ij}

3.2 Learning

  • In this paper, we
    focus on the regression task and optimize the squared loss.
  • To optimize the objective function, we employ stochastic
    gradient descent (SGD)

Overfitting Prevention

As AFM has a stronger representation ability than FM, it may be even
easier to overfit the training data.

  • Here we consider two techniques to prevent
    overfitting — dropout and L2 regularization — that have been
    widely used in neural network models.
  • The idea of dropout is randomly drop some neurons (along
    their connections) during training [Srivastava et al., 2014].
    在这里插入图片描述
  • We do not employ
    dropout on the attention network, as we find the joint use
    of dropout on both the interaction layer and attention network
    leads to some stability issue and degrades the performance

4 Related Work

FMs [Rendle, 2010] are mainly used for supervised learning under
sparse settings; for example, in situations where categorical
variables are converted to sparse feature vector via one-hot encoding.

  • By directly extending FM with the attention mechanism that
    learns the importance of each feature interaction, our AMF
    is more interpretable and empirically demonstrates superior
    performance over Wide&Deep and DeepCross.

5 Experiments

在这里插入图片描述
在这里插入图片描述

5.1 Experimental Settings

Datasets.

  • Frappe
    The Frappe dataset has been used for
    context-aware recommendation, which contains 96, 203 app
    usage logs of users under different contexts.
    The eight context
    variables are all categorical, including weather, city, daytime
    and so on.
    We convert each log (user ID, app ID and context
    variables) to a feature vector via one-hot encoding, obtaining
    5, 382 features.

  • MovieLens
    The MovieLens data has been used for personalized
    tag recommendation, which contains 668, 953 tag
    applications of users on movies.
    We convert each tag application
    (user ID, movie ID and tag) to a feature vector and
    obtain 90, 445 features.

Evaluation Protocol

Baselines.

5.2 Hyper-parameter Investigation (RQ1)

First, we explore the effect of dropout on the pair-wise interaction
layer

  • By setting the dropout ratio to a proper value, both AFM
    and FM can be significantly improved
  • Our implementation of FM offers a better performance
    than LibFM(adapts the learning rate for each parameter
    based on its frequency, employ dropout)
  • AFM outperforms FM and LibFM by a large margin.

We then study whether the L2 regularization on the attention
network is beneficial to AFM

  • simply using dropout on the
    pair-wise interaction layer is insufficient to prevent overfitting
    for AFM. And more importantly, tuning the attention network
    can further improve the generalization of AFM

5.3 Impact of the Attention Network(RQ2)

now focus on analyzing the impact of the attention network
on AFM

  • We observe that AFM converges faster than FM
  • On Frappe, both the training and test error of AFM
    are much lower than that of FM, indicating that AFM can
    better fit the data and lead to more accurate prediction.
  • On
    MovieLens, although AFM achieves a slightly higher training
    error than FM, the lower test error shows that AFM generalizes
    better to unseen data.

Micro-level Analysis

Besides the improved performance, another key advantage of AFM is that
it is more explainable through interpreting the attention score of each feature interaction.

5.4 Performance Comparison (RQ3)

  • First, we see that AFM achieves the best performance
    among all methods

  • Second, HOFM improves over FM, which is attributed to
    its modelling of higher-order feature interactions.However,
    the slight improvements are based on the rather expensive
    cost of almost doubling the number of parameters

  • Lastly, DeepCross performs the worst, due to the severe
    problem of overfitting


6 Conclusion and Future Work

  • Our AFM enhances FM by learning the
    importance of feature interactions with an attention network,
    which not only improves the representation ability but also
    the interpretability of a FM model.

  • As AFM has a relatively high complexity quadratic
    to the number of non-zero features, we will consider improving
    its learning efficiency, for example by using learning
    to hash [Zhang et al., 2016b; Shen et al., 2015] and
    data sampling [Wang et al., 2017b] techniques.

猜你喜欢

转载自blog.csdn.net/cat_xing/article/details/88912671