Application of Reinforcement Learning in Intelligent Replenishment Scenario

The author of this article: Ying Rushi, an engineer of Guanyuan Algorithm Team, graduated from the Department of Computer Science, Imperial College London, and his main research direction is reinforcement learning, time series algorithm and its application. Deeply cultivate retail consumer goods scenarios and solve supply chain logistics optimization problems. Provide customers with AI solutions based on machine learning.

1. Background

With the rapid development of cutting-edge technologies such as big data, artificial intelligence, and cloud computing, the retail consumer industry is characterized by digitization and intelligence from manufacturing, procurement, sales to service.

This article takes the smart replenishment scenario in the supply chain panorama solution as an example. Let me reveal to you how Guanyuan's AI solution empowers enterprises to replenish goods intelligently.

Smart replenishment can avoid missing or wrongly ordering goods, effectively control inventory turnover rate, reduce out-of-stock rate, reduce labor burden, and improve ordering efficiency.

Existing intelligent replenishment solutions are mainly divided into end-to-end architecture and multi-step architecture .

End-to-end architectures such as deep neural network models, end-to-end operational research optimization models, etc. A multi-step architecture usually includes two parts: a sales forecast model and a replenishment model . The dependence of the end-to-end deep neural network model on the amount of data is already one of the biggest obstacles to the implementation of current AI applications. The multi-step architecture leads to unsatisfactory final results due to the superposition and amplification of errors between multiple models.

The commercial implementation of machine learning needs to consider many factors, such as model stability, model complexity, and decision-making interpretability . The current technical solution is highly dependent on the input data, the model stability is low, and the generalization ability is weak. It increases the difficulty of commercial landing, and the ability to expand business scenarios is limited.

This article will analyze the technical difficulties of the intelligent replenishment scenario, and explain how the Guanyuan AI solution is based on imitation learning and inverse reinforcement learning, and uses a few-sample model framework to optimize the existing technical solution.

Guanyuan AI solution adheres to the tenet of "making business useable", and believes that intelligent replenishment is aimed at reducing the burden of manual labor, enhancing manual decision-making capabilities rather than replacing manual labor, especially in the post-epidemic era, manual decision-making is irreplaceable for timely processing of emergency information , Today's business world requires human-computer collaboration to build high-quality decisions.

2. Technical difficulties

This paper analyzes the technical solution difficulties of intelligent replenishment scenarios from the aspects of model stability, model complexity, and decision interpretability .

2.1. Model Stability

Model stability can be analyzed from two perspectives: model input and output:

  • From the perspective of input, the stability of the model is determined by the degree of dependence of the model on the data.

  • From the perspective of output, it shows the strength of the generalization ability of the model.

2.1.1. Data Reliability

Data dependence can be subdivided into data quality dependence and data volume dependence:

  • Data Quality Reliability

Refers to data accuracy, completeness, timeliness, relevance, consistency, reliability, reasonable representation, accessibility, etc.

  • Data Volume Reliability

Refers to the amount of data required to support model training to achieve convergence.

The deep neural network model requires massive training data, that is, the data volume requirement is large. At the same time, the model also requires high data quality. As the famous machine learning saying "Garbage in Garbage out" says, when a model encounters a "drift problem", its adaptability is weak and its performance is bound to be poor.

Drift problems can generally be divided into the following two categories:

  • Data Drift

Refers to when the input data distribution changes. Therefore, it is difficult for historically trained models to perform well on these new data.

  • Concept Drift

Refers to when the learning mode of the model is no longer established and changes;

In contrast to data drift, the distribution of the input data remains constant. Instead, the relationship between model inputs and outputs changes.

When data drift or concept drift occurs, the input data distribution of the model changes or the model learned by the model no longer holds true. A typical example is the Covid-19 outbreak.

In 2020, the Covid-19 epidemic has swept the world. Almost overnight, people's travel patterns, dining habits, and supply chain stocking have undergone earth-shaking changes. This change includes both changes in data distribution (Data Drift), such as online shopping leading to a surge in online orders and a sharp drop in offline orders; it also includes conceptual drift (Concept Drift), such as during the epidemic, international tourism and other businesses have been hit hard, But as the situation improves, old concepts may resume (Reoccurring Concepts).

The above changes will affect all models, regardless of whether such models were previously known for their high stability. When sudden drift (Sudden Drift) occurs , the future effect of the model cannot be guaranteed.

In the smart replenishment scenario, on the one hand, the data quality is worrying, and there are problems such as inventory, scrapping, and inaccurate arrival information, and delays in product information maintenance. On the other hand, it suffers from the "drift problem". In the post-epidemic era, data distribution and internal models have undergone drastic changes. Due to the high dependence of the existing technical architecture on data quality and data volume, the adjustment period of the model is long, and the effect improvement is limited. , It is difficult to meet today's ever-changing business needs.

2.1.2. Model Generalization

The model stability is analyzed from the model data source above, and the model stability is analyzed from the model generalization ability below.

The generalization scenarios of machine learning models fall into two categories:

  • weak generalization

The training data and test data come from the same distribution; also known as interpolation or robustness.

  • strong generalization

The training data comes from a different distribution than the test data; also called extrapolation or understanding.

"Weak generalization" usually assumes that the training and test data distributions are the same. But in practical problems, even in the case of "large sample limit", there will always be differences between the two distributions. In the smart replenishment scenario, store business updates, surrounding customer flow changes, and areas temporarily affected by the epidemic will cause the data of the training model and the data of the test model to not meet the independent and identical distribution conditions. In this case, whether it is an end-to-end neural network model trained based on a large amount of historical data or a multi-step operational research architecture, both face the problem of data drift, which weakens the generalization ability of the model.

In the "strong generalization" category, the model is evaluated on a completely different data distribution. Reinforcement learning aims to address model generalization in such changing scenarios. When the intelligent learning system understands more about the world, it is easier to obtain learning signals, and the fewer samples are needed to make decisions. This is why few shot learning, imitation learning, and learning to learn are important: they will free us from brute force solutions with large variance and little useful information.

The prior art scheme adopts the assumption of independent and identical distribution, which leads to poor effect of its "weak generalization". At the same time, when the existing technical solutions encounter different data distributions, the "strong generalization" ability needs to be improved urgently.

2.1.3. Model Decay

There is a concept in machine learning called Model Decay , which means that the historical model effect does not guarantee the future model effect. This situation is usually called Model drift, decay, or staleness . Therefore, it is necessary to maintain the model regularly, and maintain the effect of the model by retraining the model or even refactoring the model.

Due to the dependence of the above-mentioned model on data and the existence of drift problems, any model will inevitably decline. The existing technical solutions are highly dependent on data. Once the drift problem occurs, the model is more likely to decline and requires high maintenance costs. It even needs to redesign the model, and the cost of iteration is high. The figure below summarizes the relationship between model generalization ability, model decay risk and data dependence.

As shown in the figure, the higher the dependence of the model on the data, the weaker the generalization ability of the model (green line), and the higher the risk of recession (red line), which is also the main deficiency of the existing technical solutions.

2.2. Model Complexity

The complexity of the model is mainly considered from the difficulty of training the model and the cost of iterations. Since the business requirements in business scenarios are often varied, it will cause the model degradation problem mentioned above, and regular retraining and iterative upgrades are essential. In this way, the training difficulty and iteration cost of the model need to be discounted to become the consideration factors of the model complexity.

2.2.1. Model Trainability

深度神经网络需要海量数据输入,并且需要高性能 GPU 等硬件设备支持,同时训练花费时间较长。海量数据的需求限制了智能补货范围只能是开店较久的成熟门店,无法支持数据少的新开门店。

多步骤模型架构中,多模型的优化目标不同,中间环节的优化目标和最终目标并不完全一致,比如预测准确度的提升,不一定带来周转的优化(牛鞭效应)。这也导致模型训练难度上升,常常出现 1+1<2 的尴尬处境。

2.2.2. 模型迭代成本(Model Iteration Cost)

深度神经网络训练耗时长,通常指定目标损失函数后,通过最小化损失函数训练模型。当业务逻辑变更时,需要重新提炼业务知识,修改目标函数,构建新特征,调整网络结构等。

也存在一些模型根据运营总成本建立数学模型,且以运营总成本作为损失函数,求解建议补货量。当需求变化,例如业务目标从最优化运营总成本到开拓抢占市场份额,愿意打价格战吸引客流时,目标并非运营总成本最低。

综上,从模型复杂度角度考虑,这些技术方案训练耗时长,中间目标与最终目标断层,目标函数单一,当遇到业务需求变更时,调整成本高,难度大,迭代慢,无法适应商业世界快速变更的业务需求。

2.3. 补货决策可解释性(Decision Interpretability)

智能补货决策最终触达的是业务、领域专家,需要决策具备强解释性。从补货决策可解释性角度出发,黑盒预测与盲目假设是技术方案需要兼顾解决的。

2.3.1. 黑盒预测(Black-Box Predict)

深度神经网络模型把原始数据丢进模型,直接输出补货值,好处是方便,坏处是越来越像一个黑箱。这种端到端的黑箱模型,除了输入输出,工程师和业务使用者对中间过程一无所知。

对于业务、领域专家来说,AI 智能决策需要解决最基本的信任问题。当模型预测采取某一行动时,专家需要有理由相信模型是正确的。专家需要知道模型的决策逻辑,了解模型的弱点,并确保风险得到控制,如果专家不信任模型,决策结果就不敢被广泛的使用。有用的模型是被使用起来的模型,如果模型生产系统永远无法被使用,再出色的黑盒预测精度也没有任何价值。

2.3.2. 盲目假设(False Assumption)

机器学习中伴有许多假设,巧妙的假设能够精简模型,但盲目假设会使模型犯致命错误。

最常见的盲目假设就是训练数据与测试数据的独立同分布,这种隐式假设难以发现,因为训练模型时并不要求数据满足独立同分布,工程师只需拿到数据,训练,就能获得模型,而独立同分布假设的不满足将在底层逻辑上影响模型的预测效果。

另一种业务层面的盲目假设,例如学界最常说到的经济订货量(Economic Order Quantity,EOQ),在实际中却很少用到,首先不管是库存成本还是作业成本都很难衡量,基本无法获取。此外,实际运营中更多看数量,比如库存量能否满足需求,滞销 1 件需要销售多少件才能回本,至于库存租赁成本、固定作业成本这些是不怎么看的。类似的,以单一目标(例如运营总成本)建立数学模型,且以运营总成本作为损失函数,求解直接输出建议补货量,该技术方案也会遇到运营总成本难以预估的现实问题。

还有一种盲目假设,例如一些多步骤模型,将补货模型分为预测和补货两部分,该方案假设更准确的销量预测能够带来更合理的补货决策,但预测误差会随着多模型叠加而累积放大,并且中间环节的优化目标和最终的目标并不是完全一致的,比如预测准确度的提升,不一定带来周转的优化等。

3. 观远 AI 技术

观远 AI 团队深耕供应链场景,以业务用起来为指导思想,采用少样本数据框架,基于模仿学习与逆强化学习技术。从以下角度优化了现有的供应链技术方案。

  • 模型稳定性(Model Stability)

降级数据质量依赖

提高模型泛化能力

降低模型衰退维护成本

  • 模型复杂度(Model Complexity)

降低模型训练难度

降低模型迭代成本

  • 补货决策可解释性(Decision Interpretability)

避免了黑盒预测

避免了盲目假设

3.1. 理论基础

3.1.1. 什么是强化学习(RL)

强化学习(Reinforcement learning,简称 RL)是机器学习中的一个领域,强调如何基于环境而行动,以取得最大化的预期利益。强化学习是除了监督学习和非监督学习之外的第三种基本的机器学习方法。与监督学习不同的是,强化学习不需要带标签的输入输出对,同时也无需对非最优解的精确地纠正。其关注点在于寻找探索(对未知领域的)和利用(对已有知识的)的平衡,强化学习中的“探索-利用”的交换,在多臂老虎机问题(Multi-armed Bandit)和有限 MDP 中研究得最多。

基本的强化学习被建模为马尔可夫决策过程(Markov decision processes,MDP),包含以下要素:

  1. 状态空间

  1. 动作空间

  1. 奖励函数

  1. 策略 Policy

  1. 状态之间转换的规则(转移概率矩阵)

在强化学习中,上述要素中前四点与智能体(Agent)相关,即状态空间、动作空间、奖励函数和策略。第五点“状态之间转换的规则 ”是环境属性。

关于强化学习的深入讨论,可以参考这篇文章 《深入了解强化学习》

3.1.2. RL 与模仿学习(Imitation Learning)

模仿学习(Imitation Learning)——从专家示例中学习(Learn from Expert Demonstration)——是一种让智能体像人类专家一样能够进行智能决策的方法。在通往通用人工智能的路上,人们发现很难手工地进行编程来教会智能体进行思考,因为这么做涉及到大量的人工工程。比如,在教会车辆自动驾驶的过程中,需要有大量的约束进行考虑 (安全驾驶而不发生事故、平稳驾驶而增加舒适感)等等,而针对这些约束设计特定的监督信息信号来引导智能体是一个比较困难的任务。相反之下,人类却能比较容易地完成这些任务,并且为智能体提供大量的示例行为。利用这些专家示例来教会智能体进行智能决策就是模仿学习主要解决的问题。

不断涌现的新的任务促使研究者们设计了各种各样的模仿学习算法。其中,普遍认为模仿学习有两大类算法:

  • 行为克隆(Behavioral Cloning)

  • 对抗式模仿学习(Adversarial Imitation Learning)

行为克隆算法尝试最小化智能体策略和专家策略的动作差异,把模仿学习任务归约到常见的回归或者分类任务。而对抗式模仿学习算法则是通过逆强化学习(Inverse Reinforcement Learning, IRL)来构建一个对抗的奖励函数,然后最大化这个奖励函数去模仿专家行为。

观远 AI 技术方案融合了行为克隆对抗式模仿学习两种方式,以专家历史补货决策作为示例,通过模仿学习训练使智能体(Agent)的补货决策能力能够达到专家水平。在此基础上,当 Agent 面对复杂业务场景,奖励函数不明确时,通过逆强化学习构建对抗的奖励函数,也能做出一流的补货决策。

3.2. 架构设计

观远 AI 智能补货技术架构自底向顶包含三部分:MDP 设计,模仿学习建模以及智能决策

MDP 设计是基础设施,在此之上对业务场景抽象构建模仿学习模型,其中包含两个子模型:行为克隆以及对抗式模仿学习。行为克隆算法用于应对奖励函数已知的简单业务场景,属于基础策略(exploitation),对抗式模仿学习算法用于应对奖励函数未知的复杂业务场景,属于探索策略(exploration)

3.2.1. 马尔可夫决策过程(MDP)设计

MDP 设计主要包含状态空间,动作空间,奖励函数以及策略

  • 状态空间设计

状态空间描述 Agent 感知到的环境信息及其动态变化,是模型对环境的抽象。

在智能补货场景中,状态空间包含商品库存量、商品在途量、门店类型等信息。

  • 动作空间设计

动作空间描述 Agent 可执行的操作,如游戏中的上下左右、攻击、躲闪等。

在智能补货场景中,动作空间包含是否补货、补多少货等。

同时,在动作空间设计上结合业务知识,创新地引入若干新概念,更细粒度地刻画补货动作。例如:

  1. 触发库存(Trigger Stock):当商品真实库存低于触发库存时,引发店员补货动作

  1. 期望库存(Expect Stock):店员补货时,期望该商品补到的量

  1. 补货频率(Replenishment Frequency):商品两次补货行为之间的时间间隔

引入上述动作空间信息后,商品的补货决策基于下述条件:

当商品的真实库存低于触发库存时,智能体触发补货决策,补货量为期望库存与真实库存的差值,同时模型还会考虑补货频率,保证补货行为的合理性。

  • 奖励函数设计 (Reward Function Design)

在强化学习任务中,智能体根据探索过程中来自环境的反馈信号持续改进策略,这些反馈信号被称为奖励(Reward),奖励是即时的,而累计的奖励被称作回报(Return)。作为任务目标的具体化和数值化,奖励信号起到了人与算法沟通的桥梁作用。算法工程师将客户期望和任务目标“翻译”成奖励函数,引导强化学习算法的训练。

在补货场景中,奖励函数可以从日商、净利的角度设计,例如门店的净利越高,对应的奖励越高。奖励函数也可以从报废率角度设计,例如门店的报废率越低,对应的奖励越高。

  • 策略设计

策略设计基于状态空间、动作空间以及奖励函数的设计。在智能补货场景中,补货策略可以做到多种多样,例如:

  1. 日常运营时,补货策略是最大化门店日商、净利目标;

  1. 占领市场时,补货策略可以是最大化门店商品陈列量,并允许报废率调高。

根据不同奖励函数的设计,可以灵活调整模型策略,甚至可以综合考虑净利、报废率,将多种奖励函数结合得到复合奖励函数。

策略设计通过调整奖励函数引导补货决策适应业务需求,这也更符合当今商业世界变化多端的业务场景。

3.2.2. 模仿学习建模

  • 行为克隆(Behavior Cloning)

行为克隆算法适用于奖励函数已知或者专家示例已经足够优秀,模型仅最小化智能体策略和专家策略的动作差异即可,通常把模仿学习任务归约到常见的回归或者分类任务。

本节以最佳备货量预测和最佳备货日预测为行为克隆的两个例子。

  • 最佳备货量预测

首先明确备货量预测与销量预测不同,一些多步骤架构的技术方案在销量预测后往往需要考虑箱规转换、货架陈列、仓库压货等等业务问题。

而备货量,一般来说是远大于销量,并且考虑了箱规转换、货架陈列、仓库压货等等业务逻辑,本发明通过模仿专家的备货行为,直接预测备货量替代了传统方法预测销量后转化为备货量,大大减少精度损失。

  • 最佳备货日预测

门店商品订购需要考虑物流延期问题,即前置时间(Leadtime)。不同商品、不同物流方式的前置时间显著不同。本发明首先对商品进行聚类,把商品物流属性相似的聚为一类,然后通过学习店长的提前备货习惯,学习某一聚类商品的备货日提前量。

综上,行为克隆针对简单业务场景,能抽象成简单的回归或者分类问题,使用线性回归、树模型均能取得良好效果。此外,值得注意的是,行为克隆能够显著降低模型对数据的依赖,实践表明,行为克隆能够在商品库存数据质量较低的情况下,仅根据专家补货策略,获得较好的初始化补货策略,而现有技术方案面对低数据质量做出的决策合理性较低。

  • 对抗式模仿学习(Adversarial Imitation Learning)

对抗式模仿学习算法适用于复杂业务场景,虽然有示例数据,但模型不能通过简单的模仿示例数据达到最优效果,同时奖赏函数未知或较难设计,此时可以通过逆强化学习(IRL)来拟合一个奖赏函数,然后最大化该奖赏函数引导 Agent 生成补货策略。

类似的场景包含但不限于新品补货、新老商品汰换,由于无法确认复杂业务的奖励函数,可以基于专家历史上对新品的补货策略、对新老商品的汰换策略,训练奖励函数,再基于该奖励函数生成最佳补货策略。

例如奖励函数训练得到新品补货业务的报废率为 15%;新老品汰换场景中的汰换比例为 20%等。这种情况下,即便遇到一个全新商品,模型从未见到过该商品,但模型通过逆强化学习获得了新品补货和新老品汰换业务的奖励函数,模型也能对从未见过的新商品进行合理补货。这也是模型强泛化能力(extrapolation)的一种体现。

3.2.3. 智能决策

在完成 MDP 设计,模仿学习建模后,进入决策流程。决策分为基础决策(exploitation),通过克隆店长行为,达到大部分专家平均水平,应对简单业务场景。探索决策(exploration)基于逆强化学习,通过学习奖励函数,应对复杂业务场景。基础决策注重弱泛化能力,探索决策针对强泛化能力。

同时,本发明架构能够不断收集真实世界的新数据,优化模型,以下流程被称为 DAgger(Dataset Aggregation)算法,把行为克隆得到的策略与环境不断的交互,来产生新的数据;然后在增广的数据集上,重新使用行为克隆进行训练,再与环境交互;这个过程会不断重复进行。由于数据增广和环境交互,DAgger 算法会大大减小未访问状态的个数,以此提高模型的“强弱”双泛化能力。

3.3. 总结

本文总结了观远 AI 智能补货方案带来的有益效果,并在商业落地和业务迭代上能力的提升,拥有更好的商业落地能力以及业务拓展能力。

模型稳定性

  • 基于模仿学习架构,对输入数据的质量和体量依赖少。因此支持新店补货。

  • 模型应对 Sudden Data Drift 调整速度更快,能快速适应业务变化,相较深度神经网络模型耗时更短,应对模型衰退问题能力更强。

  • 结合行为克隆和逆强化学习提高模型“强弱”双泛化能力,实践表明模型面对简单业务场景和复杂业务场景均有不俗表现。

模型复杂度

  • 可在无明确损失函数前提下,只进行模仿学习,克隆专家补货行为。训练难度低。

  • 摒弃深度神经网络的多层架构,开创式的使用强化学习——模仿学习,训练所需数据少,模型训练难度低。

  • 使用显示 MDP 设计(状态空间、动作空间、奖励函数),能够在复杂多变的商业世界,以低成本快速迭代。

补货决策可解释性

  • 传统深度神经网络中以历史销量为输入,经过多层隐藏层输出预测销量,模型整体可解释性低。

  • 本方案具有更细粒度的显示 MDP 设计(状态空间、动作空间、奖励函数),能更好地解释补货逻辑,增强补货决策可解释性。

  • 通过对动作空间的监控来检测数据漂移或者概念漂移。如期望库存、触发库存、补货频率这类描述补货决策逻辑的参数,当模型返回一个“奇怪”的补货决策时,领域专家可以通过分析对应商品的动作空间来分析决策逻辑,判断其合理性。

  • 避免了黑盒预测,也避免了对数据分布、业务逻辑、模型架构等盲目假设导致方案存在逻辑漏洞。

4. 观远 AI 与展望

观远 AI 方案以”让业务用起来“为宗旨,结合具体业务场景,优化业务价值。此外,除了前面提到的 AI 技术方案,观远在产品技术、企业服务、业务推广方面都有非常丰富的经验。可登录观远数据官网查看相关资料。在一些行业头部客户,我们的产品也成功达到了 20000 名以上活跃分析师和数据决策用户的里程碑,可以想象这样的企业在激烈的市场竞争中能够体现出来的决策效率与质量的巨大优势。非常欢迎有兴趣的朋友来一道探讨交流,寻求合作共建的机会。

5. Reference

Guess you like

Origin blog.csdn.net/GUANDATA_/article/details/128675900