Breaking the shackles of computing power: Ali proposed the intelligent computing power engine DCAF, which saves 20% GPU computing power

    read

In recent years, with the vigorous development of the Internet, the industry has ushered in many opportunities and challenges. In terms of system service capabilities, the scale of traffic continues to grow, especially after entering the era of mobility, the scale of traffic has reached a new height. Take Taobao as an example. Taobao's DAU has reached hundreds of millions, and a large amount of traffic requests Taobao's back-end services every day, which puts the entire system under tremendous pressure. During the annual double eleven and other big promotions, the instantaneous flow of traffic peaks is a big test for the back-end server.

It is true that the growth and scale of traffic is a serious challenge, but this problem has existed for a long time, and we already have more experience in dealing with it. On the other hand, the changes brought about by deep learning technology have caused the entire industrial real-time service system to face a completely new situation, which requires a new perspective to meet new challenges.

image

Figure 1 Model effect and computing power cost

The development of hardware computing capabilities has driven the widespread application of deep learning technology, and the frequency of model upgrades in the deep learning era is significantly higher than in the past. Take recommendation and advertising systems as an example. In the initial stage of deep learning, we call it the 1.0 period. With the dividends of computing power growth, most teams quickly promoted personalization algorithms and made huge progress in various fields. For example, Alibaba alone has launched a lot of jobs in recent years: DIN, DIEN, MIND, TDM, etc.

Although this wave has achieved a lot of effect growth, on the other hand, the increase in the computational power demand of the model will quickly erode the dividends of the decline in the cost of computational power accumulated over the past years. Even if the cost of hardware computing in the future maintains a linear downward trend (knowing that the current CPU can no longer maintain Moore's Law), the increase in algorithm effect brought by the same computing power will also be significantly narrowed. This also means that in the new stage, computing power may turn from the thrust of algorithm evolution in the past to resistance .

image

Figure 2 Algorithm and system

To break this challenge, we propose a personalized calculation algorithm force + personalized ideas.

In the past system design, computing power is often defined as a fixed constraint. However, in the deep learning era, the degree of freedom and plasticity of the model is stronger. We define computing power as a variable that can be optimized. We put forward the computing power & algorithm co-design, which not only provides personalized services from the perspective of personalized algorithms, but also allocates computing power personalizedly, and provides differentiated algorithm solution services for different traffic.

Different from the traditional single-point optimization ideas such as pruning, quantification, batching, and kernel fusion, we hope to combine the system to perform full-link dynamic computing power allocation from a more macro and overall perspective. This article is our preliminary attempt under this idea. We are the first to try and implement the dynamic computing power allocation framework DCAF (Dynamic Computation Allocation Framework, https://arxiv.org/pdf/2006.09684 ) in the Ali targeted advertising platform .

Starting in May 2020, DCAF began to undertake part of the traffic of Alibaba's targeted advertising and achieved certain results. In the advertising ranking stage, DCAF can save 20% of GPU computing power under the premise of ensuring the effect, which preliminarily verifies the effectiveness of DCAF in improving system efficiency.

Motivation

Take the recommendation/display advertising system as an example. In the face of great online service pressure and a huge candidate set (such as an e-commerce product library of more than 1 billion), in order to establish a stable and reliable online service system, common recommendation advertisements The system splits the entire process of optimizing the optimal display results into multiple modules, and a cascaded architecture system in which the candidate set of each module decreases in turn:

Fig ure 3 Cascade system

The system is usually formed by cascading multiple stages, the services of each stage are deployed independently, and the design mode of high cohesion and low coupling is adopted between upstream and downstream modules, which is relatively easy to maintain. At the same time, the whole recommendation problem is disassembled into multiple sub-problems in the process of splitting. Each problem individually constrains the upper limit of computing resources, latency or response time, and adopts different algorithm schemes. Most of the time, it will be maintained by different teams from the organizational structure. .

这样的方案确实有效,拆解了整个问题的难度,也被业界广泛采纳。但是这个方案并不是最优的。试想:每个模块分配的计算资源和 latency 以多少是最优呢?每个模块的候选集多大是最优呢?模块与模块之间的变量联合优化是不是会更优?

过去的系统视野里把算力分配当做了一个固定的约束或者边界,但是我们认为算力应当是一个可以和系统联合优化的变量。试想一下,每一条流量或者用户请求,其背后无论是在广告的商业价值,还是可能的购买欲望、浏览欲望都不同,这些不同因用户而不同,同一个用户不同的状态也不同。我们的算法会为不同的流量请求提供个性化的算法结果,那么我们的系统又何尝不能提供个性化的算法方案服务,让不同的流量采取不同的算法方案使用不同的计算资源呢?

同时从宏观层面来看,对于任何系统而言资源都是有限的,因此任何商业平台的终极目标均可抽象为在算力约束下最大化收益的问题。而目前的主流做法,无论是在日常服务还是应对突发情况,均是对所有流量分配相同的算力。从整体看,在流量价值有差异的前提下,对所有流量分配相同算力一定无法得到全局最优解。因此,我们需要根据流量价值差异化地分配算力,才可以从理论上打破效果的天花板。

基于这些判断,我们团队尝试从另一个全新的角度来提升系统的服务能力。我们认为不同流量间存在流量价值差异性,而系统需要差异化地处理每条流量,即 算力消耗需要与流量价值成正比 。在以往的做法中,平台习惯对系统进行整体优化,而对于每一条流量提供等算力的服务。而我们团队观察到,在所有请求淘宝后台的流量中存在着大量低价值流量,他们消耗着与高价值流量相等的算力,而对平台收益的贡献却微乎其微,这无疑是对平台计算资源的浪费。基于此,我们在阿里定向广告平台中率先尝试并落地了动态算力分配框架 DCAF(Dynamic Computation Allocation Framework)。

Figure 4 DCAF 动态算力分配

从本质上讲,DCAF 是基于流量价值对平台算力进行动态分配,打破了传统做法中对流量一视同仁的束缚,即从算力分配角度进一步释放了平台的服务能力。

另一方面,DCAF 能够根据系统状态(如 CPU/GPU 使用率、QPS 规模、失败率等)自动调整其动态分配策略,使得平台在面对如流量洪峰等突发状况时可以及时且智能地对其自身进行调节,极大地减少了人为干预。

2020 年 5 月起,DCAF 开始承接阿里定向广告的部分流量并取得了一定的成果。在广告 Ranking 阶段,DCAF 在保证效果的前提下,能够节省 20% 的 GPU 算力,初步验证了 DCAF 在提升系统效率方面的有效性。

面临的挑战

如上文所说,DCAF 的目标是在有限算力资源下,在保证系统稳定的情况下实现整体收益最大化,对于引擎和算法都提出了极高的要求。

在引擎架构上,需要具备对系统状态的深度感知能力和调控能力,在当前各模块各自级联服务架构下,从逻辑上需要一套中控决策模块为每个模块实时提供算力决策:一方面收集整体系统状态,为后续的动态算力算法提供算力决策依据,实现收益最大化;另一方面根据当前的流量状态,实时调节各阶段算力天花板,保证系统服务的稳定性。

在算法方案上,对于流量价值需要精确的建模,同时,在计算出流量价值后,需要解决的核心问题是如何根据不同流量价值分配多少算力资源,既要保证效果的最大化,同时也要保证整体计算资源不超出限额。

算法方案

对于平台而言,不同流量间存在着流量价值差异性,例如高转化率的流量对平台具有更高的转化价值。在以往的做法中,平台忽视了流量间的价值差异性,对所有流量一视同仁采取相同的算力。比如对于每条流量我们都会召回数量近似相等的广告候选集并对其精确排序。显而易见的,这种忽视流量价值的差异性、每条流量上消耗等价算力的粗放方案,会导致平台算力无法被充分利用,最终造成损失。

而本文中提出的动态算力方案,则是从流量价值差异化的角度出发,根据流量价值“按需分配”算力,进而实现在有限资源算力约束下的平台收益最大化。

基于前文的分析,本文将动态算力问题抽象为“背包问题”,即在整体计算约束下(背包总量),对每条流量按照其不同的流量价值(物品价值)动态分配算力(物品容量),进而得到算力分配的全局最优解。将问题建模为背包问题后,可以从理论上保证 DCAF 的算力分配方案为全局最优。

具体而言,DCAF 将不同的算力资源具象化映射为不同的算法 action(打分数量、模型复杂度等等),通过评估在同一条流量上不同 action 的预期收益价值,结合每个 action 自身的资源消耗,从而为每条流量选取性价比最高的 action,同时也在满足计算约束的同时,打破原先被有限算力制约的效果天花板。比如,对于价值较低的流量,可以选择较少的广告候选集进行打分,反之则选择更大的广告候选集合。

我们将问题定义如下:

image

Equation 1 DCAF 背包问题

其中 q_j 为采取 action j 时系统所消耗的算力。而 Qij 则是对于流量 i,DCAF 分配 action j 时其所产生的预期收益。对于每条流量 i,DCAF 会自动地为其分配唯一的 action j。因此整个问题被定义为一个在满足整体算力约束 C 的前提下,最大化所有流量预期收益之和的问题。

问题求解

对于动态算力的背包问题,我们构建其对偶问题进行求解。

首先,我们构建该问题的 Lagrangian,

image

Equation 2 DCAF Lagrange Function

由此我们可以得到该原问题的对偶问题为,

image

Equation 3 DCAF dual function

通过一系列推导可以得到最终的分配公式,

image

Equation 4 DCAF action 选取

其中, Qij 为采取 action j 时流量 i 的预期收益,因此可通过 dnn 模型利用用户行为等特征对其进行预估。而 λ 为拉格朗日常量,其求解通常较为困难。但在动态算力分配问题中,算力分配与平台收益往往符合边际效益递减的规律。在边际效益递减的假设下,我们可以利用离线数据,采取二分法(bisection search)的形式,找到 λ 的全局最优解。最终,DCAF 会根据该分配公式,为每条流量分配性价比最高的 action。

引擎架构

Figure 5 DCAF 系统架构

DCAF 框架主要分为在线执行模块和离线预估模块。

在线执行模块会根据流量的实时特征,对 Qij 进行预估(Request Value Online Estimation),并依照分配公式实时决定每条流量所采取的 action(Policy Execution)。

除此之外,在线模块还具有系统指标的实时感知能力(Information Collection and Monitoring)。该模块通过系统信息(例如当前的 QPS、RT),实时地对系统所能采取的算力消耗最大的 action 进行限制,从而在面对流量洪峰时,为系统控制提供强力抓手。

因此,本质上说,DCAF 在线调控分为两个部分:一、从平台收益层面看,通过将整个算力分配问题建模为背包问题,实现平台收益最大化;二、从系统层面看,利用其对系统指标的实时感知能力,实时调整系统的整体算力上下限,增强了对系统的控制力,保障了系统的稳定运行。

离线模块的作用主要有两个:

  1. 训练 Qij 的预估模型(Request Value Offline Estimation Model),即根据离线日志建立 DNN 模型,对 Qij 进行预估。与以往的 CTR 预估不同,这里预测的是在不同 action 下,流量 i 的预期收益。
  2. 对 λ 进行求解(Lagrange Multiplier Solver),即根据离线日志,在一定假设的前提下通过二分法(bisection search)找到拉格朗日常量λ的全局最优解。实    验

离线实验

通过离线模拟 DCAF 分配策略,我们可以得到在 DCAF 分配策略下,平台的预期收益及与其对应的算力消耗。

 实验设置

  • action j 控制流量在广告排序阶段请求 CTR model 的广告数量。
  • q_j 代表 action j 对应的请求 CTR model 的广告数量。
  • Qij 代表流量 i 在 action j 下对应的预期 ecpm。
  • C 为一段时间内,CTR model 可负担的总请求数量。
  • baseline 为当前系统,即对所有流量系统均请求相同数量的广告候选集

 实验一

image

Figure 6 lambda、ecpm 及 cost 间的关系

从图中看到,相比较 baseline,DCAF 可以在算力相同的情况下增加 3.7% 的 ecpm。而在效果相同情况下,DCAF 可以显著的减少 49% 的算力,即减少 49% 的广告请求数量。在同样算力下,随机策略则会使得总体 ecpm 下降约 36.8%。

 实验二

image

Figure 7 ecpm 与 cost

从实验二,我们得出结论,为达到相同的平台收益(ecpm)。baseline 策略需要请求 CTR model 的广告数量要远高于 DCAF。换言之,在达到相同平台效果的前提下,DCAF 模型可以节省较多的计算资源。

 实验三

image

Figure 8 action 与 ecpm 的关系

这里我们的 action 是按照算力消耗从低到高进行排序的。图三反映的是不同 action 对应的 ecpm 的和及其对应的算力消耗。可以观察到,DCAF 通过对不同流量采取不同 action 实现了效果的全局最优。

在线实验

image

Table 1 等算力下的在线效果

image

Table 2 等效果下的算力节省

From May 20, 2020 to May 30, 2020, we launched DCAF in Taobao’s targeted advertising system and conducted a rigorous and fair online A/B test. We use Revenue Per Mille (RPM) as an indicator to measure platform revenue, and computing power consumption is defined as the total number of ads requesting the CTR model for ad ranking. We have done two sets of comparative experiments: First, compare platform revenue under the same computing power consumption. 2. Comparing computing power consumption when platform revenue is basically flat. From Table 1, we can see that under the same computing power, DCAF can achieve CTR+0.91% and RPM+0.42%. Table 2 shows that when the RPM is basically flat, DCAF can reduce the number of ad requests by 25%, corresponding to about 20% of GPU computing resources.

Future work

DCAF dynamically allocates computing power by differentiating traffic flows, thereby opening the ceiling of effect under limited computing power. Although we are in the traffic dimension and differentiated based on the value of traffic, we cannot ignore a series of issues such as "fairness" that may be caused by user granularity.

For platforms, it is important to maximize platform revenue under limited computing resources, but respecting users and optimizing user experience are the cornerstones of the platform's long-term development. Therefore, from a long-term perspective, DCAF needs to continue to pay attention to the "fairness" issues that may be caused by differentiated traffic, and take the user's long-term experience as an important goal in the future iterative optimization process.

On the other hand, it is true that DCAF can theoretically guarantee the optimal allocation of computing power under a single module, but the optimal allocation problem has not been considered from the perspective of the whole system, so its allocation strategy is still sub-optimal from a global perspective.

In the future, DCAF will gradually develop from a single module adjustment to a full-link unified adjustment, thereby realizing the true global optimal benefit under the constraints of system resources.


Guess you like

Origin blog.51cto.com/15060462/2674764