Wuhan Yuan Chuanghui Returns, Let’s Talk About Large Models on April 20th”

backsceneand status quo

1. Characteristics of data in the advertising field

Data in the advertising field can be divided into: continuous value features and discrete value features. Different from AI image, video, voice and other fields , the original data in the advertising field are mostly presented in the form of ID, such as user ID, advertising ID, advertising ID sequence interacting with the user, etc., and the ID scale is large, forming the advertising field The distinctive characteristics of high-dimensional sparse data.

Continuous Value Features

There are both static (such as the user's age) and dynamic characteristics based on user behavior (such as the number of times a user clicks on an advertisement in a certain industry).
The advantage is that it has good generalization ability. A user's preference for an industry can be generalized to other users who have the same statistical characteristics of the industry.
不足是缺乏记忆能力导致区分度不高。比如两个相同统计特性的用户，可能行为上也会存在显著差别。另外，连续值特征还需要大量的人工特征工程。

discrete value features

Discrete valued features are fine-grained features. There are enumerable ones (such as user gender, industry ID), and there are also high-dimensional ones (such as user ID, advertising ID).
The advantage is that it has strong memory and high distinction. Discrete value features can also be combined to learn cross- and collaborative information.
The disadvantage is that the generalization ability is relatively weak.

Advertising is a scenario that requires strong memory for users and strong differentiation of media traffic. Therefore, discrete value features are the basis for personalized prediction and optimization of advertising models. High-dimensional sparse data such as advertising ID, user ID, and their combinations can be used as discrete value features. In this way, different users with different behaviors can be well distinguished in terms of features. Generally speaking, there are two ways to use discrete value features:

One-hot Encoding
Feature embedding (Embedding)

One-hot encoding of high-dimensional discrete-valued features can easily lead to the "curse of dimensionality", which manifests as obvious defects such as parameter explosion, slow model convergence, and weak generalization ability. Therefore, it is suitable for encoding under limited enumerable types of discrete values. For encoding large-scale IDs, the sparse IDs in high-dimensional space are expressed as vectors in low-dimensional dense space through feature embedding.

2. Current status of iQiyi’s advertising ranking model

In 2016, the Wide & Deep model proposed by Google officially introduced deep learning models into the recommendation field. The Wide & Deep model realized unified modeling of memory and generalization capabilities, and quickly became the baseline model in the industrial search, advertising, and recommendation fields. . iQiyi's advertising ranking business also gradually evolved from the online-learning FM model to the DNN model in 2019.

Our DNN model is based on the open source framework TensorFlow for training and inference. In the TensorFlow framework, dense Tensor is used as the basic data unit to calculate, store and transmit data. The Tensor shape used to store discrete-value ID feature Embedding needs to be determined in advance, that is, it is fixed to [vocabulary_size, embedding_dimension], and vocabulary_size needs to be manually determined based on the ID space. Therefore, when introducing high-dimensional sparse ID features, we first map the ID feature Hash to the vocabulary_size range.

Currently we have the following problems when using high-dimensional sparse ID features:

Feature conflict: If vocabulary_size is set too large, training efficiency will drop sharply and training will fail due to memory OOM. Therefore, even for billion-level user ID discrete value features, we will only set up an ID Hash space of 100,000 levels. The hash conflict rate is high, the feature information is damaged, and there is no positive benefit from offline evaluation.
Inefficient IO: Since features such as user ID and advertising ID are high-dimensional and sparse, that is, the parameters updated during training only account for a small part of the total. Under TensorFlow's original static Embedding mechanism, model access needs to be processed The entire dense Tensor will bring huge IO overhead and cannot support the training of sparse large models.

Advertisement sparse large model practice

As mentioned above, discrete value features are the basis for the model to further achieve personalized prediction. We also encountered problems when using high-dimensional sparse ID discrete value features. Therefore, we will use the industry's mainstream open source technology in 2023 to carry out the training and inference construction of sparse large models.

1. Algorithm framework

In the past few years, the industry has conducted a lot of exploration on TensorFlow's support for recommending sparse large models and has implemented it in real business scenarios. We choose TFRA dynamic Embedding open source component mainly for the following reasons:

The TFRA API is compatible with the Tensorflow ecosystem (reusing the original optimizer and initializer, the API has the same name and consistent behavior), enabling TensorFlow to support the training and inference of ID-type sparse large models in a more native way; the cost of learning and use is low and does not change Algorithm Engineer Modeling Habits.
Dynamic memory expansion and contraction saves resources during training; it effectively avoids Hash conflicts and ensures that feature information is lossless.

Based on the TensorFlow 2.6.0 and TFRA 0.6.0 frameworks, we have carried out the following iterations:

Static Embedding is upgraded to dynamic Embedding: For the artificial Hash logic of discrete value features, TFRA dynamic Embedding is used to store, access and update parameters, thereby ensuring that the Embedding of all discrete value features is conflict-free in the algorithm framework and ensuring that all discrete values Lossless learning of features.
Use of high-dimensional sparse ID features: As mentioned above, when using TensorFlow's static Embedding function, user ID and advertising ID features have no profit in offline evaluation due to Hash conflicts. After the algorithm framework is upgraded, user ID and advertising ID features are re-introduced, and there are positive benefits both offline and online.
The use of high-dimensional sparse combined ID features: Introducing the combined discrete value features of user ID and advertising coarse-grained ID, such as the combination of user ID with industry ID and App package name respectively. At the same time, combined with the feature access function, discrete features using a combination of sparser user IDs and advertising IDs are introduced.

2. Model update

In the process of implementing the sparse large model, we encountered many problems in training inference and deployment updates. We conducted in-depth analysis and optimization of various issues during the implementation process, and finally achieved efficient and stable training, inference, and deployment updates of sparse large models.

When using TensorFlow Serving to hot update a sparse large model, inference delays will occur due to competition for memory allocation and release. There are two main parts to the usage scenarios of TensorFlow memory allocation and release:

The allocation of the variable itself Tensor when the model is restored, that is, the memory is allocated when the model is loaded, and the memory is released when the model is unloaded.
The memory of the intermediate output Tensor is allocated during network forward calculation during RPC request and is released after the request processing is completed.

Therefore, when a sparse large model is updated, a large amount of memory is allocated in the Restore OP when the new model is loaded, and a large amount of memory is released when the old model is unloaded. During this process, the RPC inference request is not interrupted. At this time, the memory allocation and release of the two will conflict and compete, causing inference glitches. Based on the above analysis, by designing memory allocation isolation, the memory allocation of model parameters and the memory allocation of RPC requests are isolated, and they are allocated and released in different independent memory spaces, thus solving the delay in hot update of sparse large models. glitch problem.

Finally, sparse large model file fragmentation and P2P transmission within the same computer room are used for model distribution, which reduces the pressure on the storage backend and network dedicated lines, and solves the storage and bandwidth problems caused by frequent updates of sparse large models.

overall benefit

At present, the deep learning platform has efficiently and stably supported the training, inference, and deployment updates of 1 billion-level parameter models. We have fully launched three sparse large models in performance advertising CVR and DCVR scenarios, which directly drove a 4.3% increase in performance advertising business revenue while the inference delay was basically the same.

future outlook

Currently, all feature values of the same feature in the large advertising sparse model are given the same Embedding dimension. In actual business, the data distribution of high-dimensional features is extremely uneven, a very small number of high-frequency features account for a very high proportion, and the long tail phenomenon is serious; using fixed Embedding dimensions for all feature values will reduce the ability of Embedding representation learning. That is, for low-frequency features, the Embedding dimension is too large, and the model is at risk of over-fitting; for high-frequency features, because there is a wealth of information that needs to be represented and learned, the Embedding dimension is too small, and the model is at risk of under-fitting. Therefore, in the future, we will explore ways to adaptively learn the feature Embedding dimension to further improve the accuracy of model prediction.

At the same time, we will explore the solution of incremental export of the model, that is, only load the parameters that change during incremental training to TensorFlow Serving, thereby reducing the network transmission and loading time during model update, achieving minute-level updates of sparse large models, and improving The real-time nature of the model.

alsoMayyoureturnthinklook

Revealing the secret of memory explosion: solving the OOM problem of large model distributed training

iQIYI Performance Advertising Dual Bidding Optimization Process

iQIYI Data Lake Practice-Advertising Data Lake Application

This article is shared from the WeChat public account - iQIYI Technology Product Team (iQIYI-TP).
If there is any infringement, please contact [email protected] for deletion.
This article participates in the " OSC Source Creation Plan ". You who are reading are welcome to join and share together.

Practice of sparse large models in iQiyi advertising ranking scenarios