【Recommendation System 02】DeepFM, YoutubeDNN, DSSM, MMOE

1 DeepFM

Online Advertising Dataset: Criteo Labs

Description: Contains millions of click feedback records for display ads, which can be used as a benchmark for click-through rate (CTR) prediction.

The dataset has 40 features, the first column is the label, where a value of 1 means the ad was clicked and a value of 0 means the ad was not clicked. Other features include 13 dense features and 26 sparse features.

img

1.1 Feature Engineering

  • Dense features: Also known as numerical features, such as salary and age. In this tutorial, two operations are performed on Dense features:
    • MinMaxScaler is normalized so that its value is between [0,1]
    • Discretize it into a new Sparse feature
    • Sparse features: also known as category-type features, such as gender and education. In this tutorial, the LabelEncoder encoding operation is directly performed on the Sparse feature, and the original category string is mapped to a value, and the Embedding vector will be generated for each value in the model.

1.2 Torch-RecHub framework

  1. The Torch-RecHub framework is mainly based on PyTorch and sklearn. It is easy to use and expand, and can reproduce practical recommendation models in the industry. It is highly modular and supports common Layers, common sorting models, recall models, and multi-task learning;
  2. How to use: Use DataGenerator to build a data loader, build a lightweight model, and perform model training based on a unified trainer, and finally complete model evaluation.

1.3 Generate background

DeepFM is a model proposed by Huawei Noah's Ark Laboratory in 2017.

FM ( Factorization Machines , factorization machine ) intends to solve the problem that model parameters are difficult to train in the scenario of sparse data. As a recommendation algorithm, FM is widely used in recommendation systems and computational advertising, and is usually used to predict click-through rate (CTR) and conversion rate (CVR).

https://zhuanlan.zhihu.com/p/342803984

  • Solve the limitations of DNN (fully connected neural network): the network parameters are too large, and the One Hot feature is converted to Dense Vector
  • FNN (feedforward neural network) and PNN (probabilistic neural network): use the pre-trained FM module, connect to the DNN to form the FNN model, and then add a product layer between the Embedding layer and hidden layer1, and replace it with the product layer FM pre-training layer to form a PNN model (do not understand)
    insert image description here

2 FM section

  • FM Layer is mainly composed of first-order features and second-order features, and then get logitcs through Sigmoid

  • Model Formula:
    -insert image description here

    The model formula of FM is a general fitting equation, and different loss functions can be used to solve problems such as regression and classification. FM can make predictions for new samples in linear time.

  • advantage:

    1. By using the vector inner product as the weight of the cross feature, the weight of the cross feature can be effectively trained when the data is very sparse (because two features are not required to be non-zero at the same time)
    2. Through the optimization of the formula, the computational complexity of O(nk) can be obtained, and the computational efficiency is very high
    3. Although the overall feature space in the recommendation scenario is very large, FM training and prediction only need to deal with non-zero features in the sample, which also improves the speed of model training and online prediction.
    4. Due to the high computational efficiency of the model and the ability to automatically mine long-tail low-frequency materials in sparse scenarios, it is applicable to three stages of recall, rough sorting and fine sorting. When the application is in different stages, the sample structure, fitting target and online service are all different"
  • Disadvantages: It can only display the second-order crossover of features, and can't do anything about higher-order crossovers.

3 Deep sections

  • Model composition:

    1. Use the full connection method to input Dense Embedding to Hidden Layer to solve the parameter explosion problem in DNN
    2. The output of the Embedding layer is to connect the embedding vectors corresponding to all id class features together and input them into the DNN
  • Model formula:
    insert image description here

4 YoutubeDNN

Implementation of YouTubeDNN - Programmer Sought

According to User embedding and item imbedding, use the nearest neighbor search method to recall, and use negative sampling in softmax.

The input data is highly sparse, processed by embedding and average pooling to get the user's viewing/search interest

Discrete data can be processed by embedding, and continuous data can be processed by normalization or bucketing

  • skip-gram: The word vector is trained by using the central word to predict the context word. The input of the model is the central word, and the sequence sample is obtained in the form of a sliding window. After the central word is obtained, the word vector of the central word is obtained according to the matrix multiplication of the word vector , and then multiplied with the context matrix to get the similarity between the center word and each word, get the similarity probability through softmax, and select the index output with the highest probability.

    Convert a word into a one-hot vector and multiply it with the embedding matrix to get the embedding vector, and send it to softmax to get the prediction result.
    insert image description here

    Build a loss function (assuming there are 10,000 dimensions):

insert image description here

  • YoutubeDNN training: The right part is similar to skip-gram, but the central word is directly obtained by embedding, and the left part combines the user's feature vectors into a large vector and then performs DNN dimensionality reduction

5 DSSM

Deep Structured Semantic Models twin tower model

insert image description here

Negative sampling in NLP is sampling in something other than the target. k generally takes 5-20, and it cannot be uniform sampling. Uniform sampling will get a lot of high-frequency words. You can use the following method:

insert image description here

6 Multitasking concept

First of all, what is multi-task learning?

In conventional classification tasks, each instance usually corresponds to a label. As shown below. The i-th instance corresponds to the second class only.

insert image description here

However, in multi-classification learning, one instance will correspond to multiple labels.
insert image description here

Loss function:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-JR9CDPYb-1656385819493) (C:\Users\96212\AppData\Roaming\Typora\typora-user-images\ image-20220628104734434.png)]

  • m is the number of samples

  • j is the jth lable

  • Multi-task learning shares the same low-level features

  • For multi-task learning, we can try a neural network large enough to handle all tasks

7 MMOE

Task Relation Modeling in Multi-Expert Mixed Multi-Task Learning

Paper https://dl.acm.org/doi/10.1145/3219819.3220007

Information for MMOE models:

  1. The youtube address of the video introduction : https://www.youtube.com/watch?v=Dweg47Tswxw

  2. Open source address of keras framework implementation : https://github.com/drawbridge/k

  • Multi-task model: Learning the commonality and difference between different tasks can improve the quality and efficiency of modeling.
  • Multitasking model design pattern:
    1. Hard Parameter Sharing method: the bottom layer is a shared hidden layer, which learns the common patterns of each task, and the upper layer uses some specific fully connected layers to learn specific task patterns
    2. Soft Parameter Sharing method: the bottom layer does not use a shared shared bottom, but has multiple towers, and assigns different weights to different towers
    3. Task sequence dependency modeling: this is suitable for certain sequence dependencies between different tasks

insert image description here

Hybrid expert system (MoE) is a kind of neural network and also belongs to a combine model. Different methods of data generation are applicable to the dataset. What is different from the general neural network is that it separates and trains multiple models according to the data. Each model is called an expert , and the gating module is used to select which expert to use. The actual output of the model is the output of each model and the output of the gating model. Weight combination. Different functions (various linear or non-linear functions) can be used for each expert model. A hybrid expert system is all about combining multiple models into a single task.

Hybrid expert systems have two architectures: competitive MoE and cooperative MoE. The local area of ​​the data in the competitive MoE is forced to concentrate in each discrete space of the data, while the cooperative MoE does not enforce restrictions.

  • MoE model principle: Based on multiple aggregated outputs, the weight of Experteach is obtained through the gating network mechanism (attention network)Expert

  • Based on the OMOE model, each Experttask has a gating network

  • characteristic:

    1. ExpertAvoid task conflicts, adjust according to different gates, and choose a combination that is helpful to the current task
    2. Create relationships between tasks
    3. Flexible parameter sharing
    4. The model can converge quickly during training

Guess you like

Origin blog.csdn.net/weixin_42322991/article/details/125449432