Taobao information flow integration and hybrid service upgrade

The recommendation system is an information filtering system that is used to predict user preferences and filter out content that the user may be interested in from a large amount of information to make personalized recommendations. A complete recommendation system process mainly includes processing nodes such as multi-channel recall -> material completion -> fine sorting and filtering -> mixed sorting -> adaptation output. As the last layer of processing before result output, shuffling is mainly used to normalize and sort the recommendation results from different sources. On the one hand, it is to obtain the sorting sequence with the best recommendation effect for users, and on the other hand, it can also improve Diversity, personalization and reach of recommendations.

Current status of technology links

▐Existing link

Taobao information flow is a typical recommendation system. In the information flow, there are many types of business cards, such as products, advertisements, cloud themes, short videos, live broadcasts, etc. We will divide business cards into two categories: advertising results and natural recommendation results. In the sorting stage, two serial processing modules will be divided into two different types of results to mix and sort.

Schematic diagram of post-purchase information flow mixing process

Advertising results : Advertising mainly adopts a dynamic pit display strategy. By calling the dynamic display service provided by advertising, it decides which pits to display advertisements, which advertising results are specifically displayed and the corresponding advertising billing. The decision-making goal is optimal commercialization. value. When making decisions, all recommended candidate sets will be input as contextual features, but the order of natural results will not be decided.
Natural results : The process of rearranging natural results does not use the advertising candidate set as contextual features to make decisions. Similarly, it does not make additional decisions on the ranking of advertising candidate sets. It only rearranges within the natural results. , to obtain the sorting sequence of optimal user value.

In the final output sequence of results, the advertising results will be prioritized in the slots determined by the dynamic display service, and other natural recommendation results will be displayed in the remaining vacant slots.

▐There is a problem

The algorithm strategies have inconsistent goals and cannot obtain the global optimal results : the advertising display strategy is more based on commercial value, and less consideration is given to the user value of natural results, although the replacement of indicators can be achieved by adjusting the tradeoff coefficient between the two. , but obviously it cannot obtain a globally optimal sequence result.
There is a high coupling between algorithm strategy iteration and business logic iteration : in the current link, algorithm students need to jointly develop the same set of code with engineering students. At the same time, the various policy modules involved are scattered in different stages of the pipeline, such as The advertising ecpm value service that the advertising dynamic targeting service relies on will be called during the completion phase, while the actual dynamic targeting results will be processed during mixed scheduling, resulting in higher complexity of the overall system and higher stability maintenance costs.

▐Solution _

Based on the above problems, we hope to conduct a unified upgrade of the current mixed arrangement strategy service. The upgraded service should have the following characteristics:

Adjustment of the shuffle strategy goal : The shuffle service must comprehensively consider user value and commercial value, and maximize the overall value of the page as the shuffle strategy goal.
Decoupling strategy and business : extracting the mixing strategy logic from the server-side business link and connecting it as an independent service. Later iterative upgrades will be maintained by algorithm colleagues in the new service, and the algorithm’s strategy will be iterated It is independent from the business iteration of the engineering link, making the division of labor in development clearer and reducing corresponding maintenance costs.

Specific implementation plan

▐Technical selection

This new hybrid fusion service chooses xrec as the code framework. xrec is a business framework based on tpp graphical engine. The framework mainly includes the following advantages:

Recommended componentization of business processes : The xrec framework can abstract the business nodes of the link into components. Developers only need to implement the business of each node according to the component implementation specifications agreed by the framework, and pass a fixed format json file When arranging processes, there is no need to consider the orchestration of business processes at the code level.
Fully asynchronous concurrency performance optimization : Different from the streamlined execution process of the TPE framework used in the original engineering link, the xrec framework improves scene performance by automating multi-channel concurrency and encapsulating data operations, and uses a graphical structure to describe the business process, so that users do not need to By learning concurrent programming, you can achieve large-scale and safe concurrency. At the same time, data serialization/deserialization, data conversion, and common external service calls are encapsulated into operator operations for use, and performance-optimized platform modules are used to replace unused Performance-polished user code.

The xrec framework saves algorithm developers a lot of work, but it also imposes more constraints on coding rules. The development process needs to be carried out strictly in accordance with the rules of the framework.

▐Link scheme

Mixed service link solution

Based on the xrec framework, we built an independent TPP service (xhuffle) to undertake the integrated shuffling strategy logic of all advertising & natural results. The overall link of the service is as follows. The xhuffle service internally calls the advertising ecpm value estimation service and the recommended unified value model in parallel to obtain the value information of advertising & natural results. The fusion mixing mechanism module will summarize the advertising & natural results value information and make decisions on the sorting results of all cards. , given the pit position of the card or reordering the cards, and finally calling the advertising billing service to obtain advertising billing information for the advertising results.

In the original engineering link, the service modules that are mixed and dependent are scattered in different stages of the pipeline. After creating a new service, the relevant logic of mixing and sorting is integrated into an independent service, and it can be iterated separately in the new service, greatly reducing development and maintenance costs.
The recommendation unified value model and advertising ecpm estimation service are maintained by recommendation and advertising respectively, and each is responsible for obtaining recommendation value points and advertising value points.
The integrated mixing mechanism module is jointly maintained and iterated by the advertising and recommendation sides.
The advertising billing service is maintained by the advertising side. By calling the advertising EADS service, the generation of advertising billing strings is converged within the advertising service to ensure information security.

Overall link diagram of xhuffle service

In addition, since there are still some business targeting strategies in the post-acquisition information flow, such as cloud themes, short video targeting, etc., this part of the strategy was not considered in the original mixed arrangement strategy. As for the share of business targeting, The shuffling strategy may still determine the pit positions, which will cause these business pit cards to interfere with the shuffling results, directly affecting business data indicators. In the xhuffle service, we provide this part of the business pit information as a service input to the shuffling module, and proactively avoid this part of the pit, ensuring that the mixing results and the business pit results do not interfere with each other.

Engineering link service calling plan

After the xhuffle service is introduced, the timing of service invocation is a key concern of the upstream engineering link. The basic idea is that after the pre-filtering in the sorting stage is completed, the xhuffle service is called to make decisions on the pre-filtered advertising & natural result candidate sets, and then the final output card sequence is determined based on the shuffling results. On the one hand, this can avoid making decisions on filtered cards and improve the utilization rate of pits; on the other hand, it also reduces the number of candidate sets, which can reduce the pressure on services to a certain extent.

Here, we propose two link invocation schemes.

Option 1: Split the sorting phase and call services in parallel

Since the existing link is executed serially in the sorting phase, considering the addition of a new external service call, in Solution 1, we split the sorting phase into two phases:

Pre-sorting stage : This stage mainly performs some pre-sorting card filtering. After obtaining the pre-filtered card sequence, initiate parallel calls to the shuffle service and other external services of the engineering link.
Post-sorting stage : In this stage, the card sequence will be sorted and truncated based on the shuffling results to determine the final card sequence that needs to be adapted for output.

Scheme 1 engineering link diagram

这种并行调用的方式看似减轻了链路RT的压力，实际上引入了一个新的问题。排序阶段输入的候选集序列大小一般是数倍于最终排序输出的序列大小，例如在购物车场景，每次请求最终返回的卡片序列数量为20，而排序阶段输入的卡片序列数量一般可达到100。在原有链路中，工程链路其他处理过程只会承接最终确认好顺序的20张卡片。如果将这部分处理前置，即使经过了前置过滤，这部分的服务实际承接的卡片序列数量还是将增长三至四倍，无形中加重了下游服务的压力。

在这部分外部服务中，UMP导购券后价接口的问题比较突出，这主要是因为UMP接口限制了接口一次调用承接的卡片数量不能超过15个，超出数量限制就需要分批发起多次调用，原本承接20张卡片就需要发起两次调用。如果承接的卡片数量增多，那么会直接增加对下游服务的请求量。

在前期小流量验证阶段，我们发现在实验流量上，对UMP服务接口的调用QPS增长了约3倍左右，这一现象也符合我们上述对该方案的分析。在小流量实验上并不能暴露出QPS增长带来的具体问题，但是如果采用这种方案进行推全，全量后下游的UMP接口将承载入口流量六至八倍的流量，压力实在太大，并且最终输出的卡片序列数量并没有增多，这部分新增的资源消耗并不是有效消耗，而是冗余消耗。

方案二：串行调用服务

考虑到上述方案带来的冗余资源消耗问题，我们提出了第二种链路调用方案，将xhuffle服务作为整体排序阶段的一个串行模块，在前置过滤完成后，直接串行执行服务调用。

方案二工程链路示意

这种调用方式对链路的RT压力会更大，由于是串行执行，服务调用的耗时会直接体现到整体链路耗时上。为了缓解RT的压力，我们采取了以下两个方面的措施：

xhuffle服务本身的链路优化。混排服务中耗时占比最大的是推荐统一价值模型的调用，在最初的方案中是通过调用外部tpp服务进行处理，目前已优化为在服务中直接进行RTP调用来处理，同时调用所需的qinfo数据直接使用商品召回的缓存数据，不用重新生成。
购后工程链路在不影响用户体验的前提下，适当放宽超时限制，以此降低端上的超时率。目前，各场景均将场景超时限制放宽50ms。

两种方案对比

优点	缺点
并行调用对链路整体的RT影响较小	将工程链路其他处理前置，会带来下游服务承接的卡片数量增长三至四倍，带来冗余的资源消耗
链路改造成本小，无冗余资源消耗	服务耗时会直接体现在链路整体耗时上，对系统稳定性的压力更大

经过综合考虑后，我们认为方案一带来的冗余资源消耗是不可接受的，最终选择了方案二作为正式的链路改造方案。

总结与展望

在进行上述的链路改造后，xhuffle服务已在购中后信息流推全，好价版信息流正在逐步接入中。经过一系列优化迭代，目前的xhuffle服务在保证了系统稳定性前提下，取得了自然&广告双涨的结果。

▐ 链路稳定性结果

混排服务场景指标：入口场景的服务调用平均RT保持在30ms以内，P99保持在70ms以内。服务调用超时率稳定在0.5%以内。
入口场景整体的系统稳定性指标：链路整体耗时可控，整体超时率保持在0.3%以内。
端上用户体验指标：由于各场景均扩了超时RT限制，我们通过端上接口的耗时变化来反映对用户体感上的影响。从扩RT前后分端接口耗时来看，用户体感上没有明显的变化。

▐ 未来展望

短视频、直播等业务的混排策略升级，减少业务定坑对混排的约束。
类目打散等规则化策略的融入。
建设通用化的混排服务链路接入方案，以同一套方案为更多场景提供混排策略服务。

网络包传输

淘天集团首页&信息流技术-首页团队，目前负责集团电商平台的首页和信息流推荐，其中手机淘宝首页、信息流、NewDetail等场景每天服务数亿用户，大促核心系统峰值QPS千万计，工作涉及全链路端到端性能优化，流量效率提升、用户体验、提高商家及达人参与淘宝的积极性，优化商业生态运行机制。在过去的几年时间，我们一直专注手机淘宝首页、推荐信息流核心链路业务支持和业务平台抽象，与业界领先的算法团队紧密协作，不断拓展业务边界并将核心业务指标一次次踩在脚下。
这里有巨大的流量，可以满足你对高并发大规模分布式系统练手的畅想；
这里有前沿的算法应用场景，可以玩转各种智能创新；
这里有严苛的系统指标要求，可以让你感受到优化复杂系统化的快感~

¤ 拓展阅读 ¤

3DXR技术 | 终端技术 | 音视频技术

服务端技术 | 技术质量 | 数据算法

本文分享自微信公众号 - 大淘宝技术（AlibabaMTT）。
如有侵权，请联系 [email protected] 删除。
本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一起分享。