Want to know how big companies make recommendations? Facebook open source deep learning recommendation model DLRM

The DLRM model is implemented using Facebook's open source frameworks PyTorch and Caffe2. DLRM is improved compared to other models by combining the principles of collaborative filtering and predictive analysis methods, so that it can effectively process production-scale data and obtain the current best results.

Facebook open sourced the model and published related papers, aiming to help researchers in the field solve the unique challenges faced by this type of model. Facebook hopes to encourage further algorithm experimentation, modeling, system collaborative design and benchmark testing. This helps to discover new models and more efficient systems to provide more relevant content for people using various digital services.

Understanding the DLRM model

The DLRM model uses embedded representations to process category features, while the bottom multi-layer perceptron (MLP) is used to process continuous features. Then calculate the second-order interaction of different features (second-order interaction). Finally, use the MLP at the top to process the result and enter it into the sigmoid function to get the probability of a certain click.   

imageFigure 1 The DLRM model deals with continuous (dense) features and category (sparse) features describing users and products, as shown in the figure. The model uses various hardware and software components, such as memory capacity and bandwidth, as well as communication and computing resources.

Benchmark and system co-design

The open source implementation of DLRM can be used as a benchmark to measure the following indicators: 

  • The execution speed of the model (and its related operators).

  • The impact of different numerical techniques on accuracy.

This can be done on different hardware platforms, such as the BigBasin artificial intelligence platform.

The DLRM benchmark provides two versions of the code, using PyTorch and Caffe2 respectively. In addition, there is another version implemented using the Glow C++ operator. (In order to adapt to the specific situation of each framework, the code of each framework is slightly different, but the overall structure is similar.) These implementations allow us to compare the Caffe2 framework with the PyTorch framework, and Glow, which currently focuses on accelerators. Maybe we can extract the best features in each frame and integrate them into one frame in the future.

image

The DLRM benchmark supports the generation of random and synthetic inputs. At the same time, it supports the model to customize the index corresponding to the category feature. There are many reasons: for example, if an application uses a specific data set, but for privacy considerations we cannot share the data, then we can choose to express it through the distribution Category characteristics. In addition, if we want to use system components, such as studying memory behavior, we may need to capture the basic position of the original trajectory within the synthetic trace.

In addition, Facebook uses a variety of personalized recommendation models based on different user scenarios. For example, in order to achieve high-performance services on a certain scale, input can be batch-processed on a single machine and multiple models can be distributed to execute the inference process in parallel. In addition, a large number of servers in Facebook's data center have architectural heterogeneity, ranging from different SIMD widths to different cache structures. Architecture heterogeneity provides additional opportunities for software and hardware co-design and optimization. (See the paper: "The Architectural Implications of Facebook's DNN-based Personalized Recommendation" This article provides an in-depth analysis of the architecture of Facebook's neural recommendation system.)

Parallel Computing

As shown in Figure 1, the DLRM benchmark consists of computationally-dominated MLP and embeddings with limited memory capacity. Therefore, it naturally needs to rely on data parallelism to improve the performance of MLP, and rely on model parallelization to meet the embedded memory capacity requirements. The DLRM benchmark provides a parallel implementation that follows this approach. In the interaction process, DLRM requires an efficient full communication primitive, which we call butterfly shuffle. It reshuffles the embedded search results of minibatch on each device, distributes them to all devices, and becomes part of the minibatch embedded search. As shown in the figure below, each color represents a different element of the minibatch, and each number represents a device and its assigned embedding. We plan to optimize the system and publish the performance research details in a future blog. 

 Figure 3 Schematic diagram of DLRM butterfly shuffle


Modeling and algorithm experiment

DLRM 基准测试使用 Python 编写,支持灵活实现,模型结构、数据集和其他参数由命令行定义。DLRM 可用于推理和训练。在训练阶段,DLRM 将反向传播算子添加到计算图中,允许参数更新。

该代码是完整的,可以使用公开数据集,包括 Kaggle display advertising challenge 数据集。该数据集包含 13 种连续特征和 26 种类别特征,这些特征定义了 MLP 输入层的大小以及模型中使用的嵌入数量,而其他参数可以通过命令行定义。例如,根据如下命令行运行 DLRM 模型:   

python dlrm_s_pytorch.py --arch-sparse-feature-size=16 --arch-mlp-bot="13-512-256-64-16" --arch-mlp-top="512-256-1" --data-generation=dataset --data-set=kaggle --processed-data-file=./input/kaggle_processed.npz --loss-function=bce --round-targets=True --learning-rate=0.1 --mini-batch-size=128 --print-freq=1024 --print-time

训练结果如下图所示 :   image图 4 左图展示了在训练阶段和测试阶段的二值交叉熵损失,右图为训练阶段和测试阶段准确率

DLRM 模型可以在真实数据集上运行,可以帮助我们测量模型的准确率,这对于使用不同的数值技术和其他模型进行试验时尤其有用。我们计划在接下来的工作中对量化和算法实验对该模型的影响进行更深入的分析。

从长远来看,开发新的、更好的方法,将深度学习用于推荐和个性化工具(并提高模型的效率和性能),能够带来将人们与相关的内容联系起来的新方法。

DLRM 模型开源代码

The DLRM model input consists of dense features and sparse features. The dense feature is a vector of floating-point numbers, and the sparse feature is the sparse index of the embedded table. The selected vector is fed into the MLP network (triangle in the figure), and in some cases the vector interacts through operators.

DLRM implementation

There are two implementation versions of the DLRM model: 

  • DLRM PyTorch:dlrm_s_pytorch.py

  • DLRM Caffe2:dlrm_s_caffe2.py

DLRM data generation and loading: 

dlrm_data_pytorch.py, dlrm_data_caffe2.py, data_utils.py

DLRM test command (under the ./test path): 

dlrm_s_test.sh

DLRM benchmark model (under the path of ./bench): 

dlrm_s_benchmark.sh, dlrm_s_criteo_kaggle.sh
training

Train a smaller model:

python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6
time/loss/accuracy (if enabled):
Finished training it 1/3 of epoch 0, -1.00 ms/it, loss 0.451893, accuracy 0.000%
Finished training it 2/3 of epoch 0, -1.00 ms/it, loss 0.402002, accuracy 0.000%
Finished training it 3/3 of epoch 0, -1.00 ms/it, lo
ss 0.275460, accuracy 0.000%


Use Debug mode to train:


$ python dlrm_s_pytorch.py --mini-batch-size=2 --data-size=6 --debug-mode
model arch:
mlp top arch 3 layers, with input to output dimensions:
[8 4 2 1]
# of interactions
8
mlp bot arch 2 layers, with input to output dimensions:
[4 3 2]
# of features (sparse and dense)
4
dense feature size
4
sparse feature size
2
# of embeddings (= # of sparse features) 3, with dimensions 2x:
[4 3 2]
data (inputs and targets):
mini-batch: 0
[[0.69647 0.28614 0.22685 0.55131]
[0.71947 0.42311 0.98076 0.68483]]
[[[1], [01]], [[0], [1]], [[1], [0]]]
[[0.55679]
[0.15896]]
mini-batch: 1
[[0.36179 0.22826 0.29371 0.63098]
[0.0921 0.4337 0.43086 0.49369]]
[[[1], [023]], [[1], [12]], [[1], [1]]]
[[0.15307]
[0.69553]]
mini-batch: 2
[[0.60306 0.54507 0.34276 0.30412]
[0.41702 0.6813 0.87546 0.51042]]
[[[2], [012]], [[1], [2]], [[1], [1]]]
[[0.31877]
[0.69197]]
initial parameters (weights and bias):
[[ 0.05438 -0.11105]
[ 0.42513 0.34167]
[-0.1426 -0.45641]
[-0.19523 -0.10181]]
[[ 0.23667 0.57199]
[-0.16638 0.30316]
[ 0.10759 0.22136]]
[[-0.49338 -0.14301]
[-0.36649 -0.22139]]
[[0.51313 0.66662 0.10591 0.13089]
[0.32198 0.66156 0.84651 0.55326]
[0.85445 0.38484 0.31679 0.35426]]
[0.17108 0.82911 0.33867]
[[0.55237 0.57855 0.52153]
[0.00269 0.98835 0.90534]]
[0.20764 0.29249]
[[0.52001 0.90191 0.98363 0.25754 0.56436 0.80697 0.39437 0.73107]
[0.16107 0.6007 0.86586 0.98352 0.07937 0.42835 0.20454 0.45064]
[0.54776 0.09333 0.29686 0.92758 0.569 0.45741 0.75353 0.74186]
[0.04858 0.7087 0.83924 0.16594 0.781 0.28654 0.30647 0.66526]]
[0.11139 0.66487 0.88786 0.69631]
[[0.44033 0.43821 0.7651 0.56564]
[0.0849 0.58267 0.81484 0.33707]]
[0.92758 0.75072]
[[0.57406 0.75164]]
[0.07915]
DLRM_Net(
(emb_l
): ModuleList(
(0
): EmbeddingBag(42, mode=sum)
(1): EmbeddingBag(32, mode=sum)
(2): EmbeddingBag(22, mode=sum)
)
(bot_l): Sequential(
(0
): Linear(in_features=4, out_features=3, bias=True)
(1): ReLU()
(2): Linear(in_features=3, out_features=2, bias=True)
(3): ReLU()
)
(top_l): Sequential(
(0
): Linear(in_features=8, out_features=4, bias=True)
(1): ReLU()
(2): Linear(in_features=4, out_features=2, bias=True)
(3): ReLU()
(4): Linear(in_features=2, out_features=1, bias=True)
(5): Sigmoid()
)
)
time/loss/accuracy (if enabled):
Finished training it 1/3 of epoch 0, -1.00 ms/it, loss 0.451893, accuracy 0.000%
Finished training it 2/3 of epoch 0, -1.00 ms/it, loss 0.402002, accuracy 0.000%
Finished training it 3/3 of epoch 0, -1.00 ms/it, loss 0.275460, accuracy 0.000%
updated parameters (weights and bias):
[[ 0.0543 -0.1112 ]
[ 0.42513 0.34167]
[-0.14283 -0.45679]
[-0.19532 -0.10197]]
[[ 0.23667 0.57199]
[-0.1666 0.30285]
[ 0.10751 0.22124]]
[[-0.49338 -0.14301]
[-0.36664 -0.22164]]
[[0.51313 0.66663 0.10591 0.1309 ]
[0.32196 0.66154 0.84649 0.55324]
[0.85444 0.38482 0.31677 0.35425]]
[0.17109 0.82907 0.33863]
[[0.55238 0.57857 0.52154]
[0.00265 0.98825 0.90528]]
[0.20764 0.29244]
[[0.51996 0.90184 0.98368 0.25752 0.56436 0.807 0.39437 0.73107]
[0.16096 0.60055 0.86596 0.98348 0.07938 0.42842 0.20453 0.45064]
[0.5476 0.0931 0.29701 0.92752 0.56902 0.45752 0.75351 0.74187]
[0.04849 0.70857 0.83933 0.1659 0.78101 0.2866 0.30646 0.66526]]
[0.11137 0.66482 0.88778 0.69627]
[[0.44029 0.43816 0.76502 0.56561]
[0.08485 0.5826 0.81474 0.33702]]
[0.92754 0.75067]
[[0.57379 0.7514 ]]
[0.07908]
test

Test whether the code runs normally:

./test/dlrm_s_tests.sh
Running commands ...
python dlrm_s_pytorch.py
python dlrm_s_caffe2.py
Checking results ...
diff test1 (no numeric values in the output = SUCCESS)
diff test2 (no numeric values in the output = SUCCESS)
diff test3 (no numeric values in the output = SUCCESS)
diff test4 (no numeric values in the output = SUCCESS)
Benchmark model

Performance benchmark

./bench/dlrm_s_benchmark.sh


This model supports the Kaggle display advertising challenge data set. The data set needs to complete the following preparations:

Specify the raw data file: --raw-data-file=<path/train.txt>

Pretreatment

The processed data is stored in the .npz file, the path is <root_dir>/input/kaggle_data/ .npz

Can be run with a processed file: --processed-data-file=<path/ .npz>

./bench/dlrm_s_criteo_kaggle.sh
Model saving and loading

Save the model during training: --save-model=<path/model.pt>. If the test accuracy has improved, save the model. The saved model can be loaded via --load-model=<path/model.pt>. After the model is loaded, it can be used to continue training or to test on the test data set. You need to specify --inference-only.


Guess you like

Origin blog.51cto.com/15060462/2678113