FM/FFM

FM factorization

FM algorithm in linear time to complete the training model, is a very efficient model. FM largest characteristics and advantages: FM model has a better ability to learn on sparse data, you can learn the relationships between characteristics through interaction term, and to ensure the efficiency of learning and the ability to forecast.
One-Hot encoding features: Most characteristic is relatively sparse samples; large feature space.
It can be found by observing a large number of sample data, after certain features associated, will improve the correlation between the label. Such as: "USA" and "Thanksgiving", "China" and Correlation of such "Chinese New Year", has a positive impact on user clicks.
Polynomial model is the most intuitive model comprising the combination of features. Considering the computational efficiency, we only discuss the second-order polynomial model.

\ [Y (x) = w_0
+ \ sum_ {i = 1} ^ nw_ix_i + \ sum_ {i = 1} ^ n \ sum_ {j = i + 1} ^ nw_ {ij} x_ix_j \] As can be seen from this formula, parameter combination features a total of \ (n (n-1) \ over 2 \) of, any two parameters are independent. When the number of samples is not sufficient combination of features, learned parameter will be inaccurate, which will seriously affect the model to predict the effect (Performance) and stability.
So, how to solve the problem of quadratic term training parameters of it? Matrix factorization provides a solution ideas. In the Model-based collaborative filtering, a rating matrix can be decomposed into a matrix and a user item matrix, and each user may employ a hidden item vector representation. It can be decomposed into matrix W \ (W = V ^ the TV \) , V j-th column vector in the j dimension is implicit feature. Each parameter \ (W_ {ij} = ⟨v_i, v_j⟩ \) , which is the core idea FM model. Thus, the model equation for FM

\ [Y (x) = w_0 + \ sum _ {i = 1} ^ nw_ix_i + \ sum_ {i = 1} ^ n \ sum_ {j = i + 1} 'n⟨vi, vj⟩x_ix_j \\ ⟨v_i, v_j ⟩ = \ sum_ {f = 1} ^ {i kv_, f} {v_ · j, f} \]

Implicit vector of length k (k << n), comprising the characteristics described factor of k.
This formula is a universal top fitting equations can be used to solve different loss functions for regression, binary classification and other issues, such as may be used MSE (Mean Square Error) return loss function to solve the problem, it can also be used Hinge, Cross -Entropy loss to solve classification problems. Of course, during binary classification, the need to go through FM output Sigmoid transformation, which Logistic regression is the same.

The complexity of the current FM formula is \ (\ mathcal O (KN ^ 2) \) , but by the following equivalent conversion, the FM quadratic simplification, the degree of complexity can be optimized to \ (\ mathcal O (KN) \) , namely:

\[ \sum_{i=1}^n\sum_{j=i+1}^n⟨v_i,v_j⟩x_i,x_j=\frac{1}{2}\sum_{f=1}^k[(\sum_{i=1}^nv_{i,f}x_i)^2-\sum_{i=1}^nv_{i,f}^2x_i^2] \]
详细推导:

\[ \begin{align} &\sum_{i=1}^n\sum_{j=i+1}^n⟨v_i,v_j⟩x_ix_j \\ =&\frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n⟨v_i,v_j⟩x_ix_j-\frac{1}{2}\sum_{i=1}^n⟨v_i,v_i⟩x_ix_i \\ =&\frac{1}{2}(\sum_{i=1}^n\sum_{j=1}^n\sum_{f=1}^kv_{i,f}v_{j,f}x_ix_j-\sum_{i=1}^n\sum_{f=1}^kv_{i,f}v_{i,f}x_ix_i) \\ =&\frac{1}{2}\sum_{f=1}^k[(\sum_{i=1}^nv_{i,f}x_i)·(\sum_{j=1}^nv_{j,f}x_j)-\sum_{i=1}^nv_{i,f}^2x_i^2] \\ =&\frac{1}{2}\sum_{f=1}^k[(\sum_{i=1}^nv_{i,f}x_i)^2- \sum_{i=1}^nv_{i,f}^2x_i^2] \end{align} \]

The central role of FM model can be summarized as the following three:

  1. Reducing the cross-term FM parameter learning is insufficient effects: the sample data of one-hot coding is very sparse, especially combinations of features. In order to solve cross-term parameter learning is not sufficient, resulting in partial or model a problem of instability. Reference OF matrix decomposition idea: Each k-dimensional Viterbi requisition implicit vector representation, the parameters of cross terms \ (w_ {ij} \) represents the inner product of vectors corresponding features hidden, i.e. \ (⟨v_i, v_j⟩ \ ) . Such parameters previously learned by the learning parameters cross terms \ (w_ {ij} \) process, into a learning process corresponding to n single k-dimensional feature vector is hidden.
  2. FM enhanced model predictive ability. Item can be used to estimate the combination of features not seen in the training set.
  3. FM enhance learning efficiency parameters: a polynomial model is based on the calculation parameters were adjusted to be linear complexity in terms of the interaction term point of view, FM is just a table can be expressed as a function of the interactions between French feature, you can. extended to higher order forms, i.e. cross-correlation information between a plurality of different characteristic component into account. For example, in the advertising business scenario, considering the relationship between the three dimensional features of User-Ad-Context, the model corresponding to the degree of FM 3.

Compared with other models, its advantages are as follows:

  • FM is a more flexible model, characterized by suitable conversion method, FM can simulate a second order polynomial kernel SVM model, the MF model, SVD ++ models;
  • Compared to the second-order polynomial kernel SVM is concerned, FM sparse in the sample have an advantage; moreover, training / FM forecast complexity is linear, polynomial kernel SVM while two need to calculate the nuclear matrix, nuclear matrix complex N is the degree square.

FFM (field perception decomposition machine, Field-aware Factorization Machine)

FM disadvantages: As required combinations of two characteristics, both between the two so that any feature of the direct or indirect cross-correlation, and therefore any combination of implicit vector intersecting two features are related, which actually limits the model complexity. However, if any pair of such combination of features is completely independent, which is calculated by intersecting the kernel function similar features, it has a high degree of freedom and complexity, the complexity of the model calculation. FFM exactly between both of inter.
FFM introducing the concept of feature sets (field) to the optimization problem. FFM the characteristics of the same nature attributable to the same field, in accordance with the level calculating a feature vector at the current combination of features with features of other field, respectively, the number of such combinations of features will greatly reduced.
assumed that the sample belongs to the f n feature field, then the FFM have a quadratic implicit vector nf. In the FM model, the implicit vector for each dimension features only one. FM can be seen as a special case of FFM, FFM model is when all the features are attributable to a field. FFM model equation is as follows:

\ [Y (x) = w_0 + \ sum_ {i = 1} ^ nw_ix_i + \ sum_ {i = 1} ^ n \ sum_ {j = i + 1} 'n⟨v_ {i, fj}, v_ {j, f_i ⟩x_ix_j} \\ \]

If the implicit vector of length k, then the FFM nfk quadratic parameters have a far excess of the nk FM.
Since the implicit vector FFM any two intersecting sets of features are independent and can achieve better combination of results, this also makes it possible to not FFM quadratic simplify its complexity is \ (\ mathcal O (KN ^ 2) \) .

Weight Solution:
libFFM implemented employed in the method of stochastic gradient descent AdaGrad is omitted and a constant term and a term in equation FFM, the model equation is as follows:

\ [Φ (w, x)
= \ sum_ {j1, j2∈C_2} ⟨w_ {j_1, f_2}, w_ {j_2, f_1} ⟩x_ {j_1} x_ {j_2} \] where, C2 is the nonzero eigenvalues binary combination, j1 is a characteristic belonging F1 Field, \ (W_ {j_1, F_2} \) is the vector of hidden features j1 of the field f2. This model uses FFM loss function as a logistic loss, and L2 penalty term, therefore only for binary classification problem.

\[ \underset{w}{\min}\sum_{i=1}^n\log(1+\exp{−y_iϕ(w,x_i)})+\frac{λ}{2}‖w‖^2 \]

FM / FFM comparison with other models

  • vs neural network is difficult to directly process the neural network of discrete features high dimensional sparse, because this results in many neuronal connection parameters and machine may be considered to make factorization embedded (Embedding) sparse high-dimensional discrete features.
  • vs gradient boosting tree height when the data is not sparse, the gradient can effectively enhance the tree to learn more complex combinations of features; but in highly sparse data, wherein the number of second order mode combinations exceed the number of samples, and thus enhance the gradient gradient this high level of the tree can not learn the combination.

Guess you like

Origin www.cnblogs.com/makefile/p/ffm.html
FM
FM
FM
FM