Reinforcement Learning: How to deal with large-scale discrete action space

https://www.toutiao.com/a6701973206141501964/

 

After learning the depth of the tide, search and recommend how to upgrade the domain model iteration it? Reinforcement learning shine in the field of games, whether it can be applied to reinforcement learning search field recommend it? Recommended search problems often can be seen as a sequence of questions decision to introduce the idea of ​​reinforcement learning to maximize the return on long-term idea is quite natural, in fact, have been related to exploration in the industry. So later I will write a series to introduce the recent strengthening of Applied Learning in the search recommended business.

This will introduce the two to solve large-scale discrete actions to strengthen the learning space of the paper.

The first chapter is DeepMind published in 2015, entitled:

Deep Reinforcement Learning in Large Discrete Action Space

Links to:

https://arxiv.org/abs/1512.07679

For intensive reading;

The second was published in AAAI2019, entitled:

Large-scale Interactive Recommendation with Tree-structured Policy Gradient

Links to:

https://arxiv.org/abs/1811.05869

For extensive reading.

一、Introduction

The traditional model Recommended model when considering only a single recommendation, that recommendation can consider forming a continuous sequence is recommended to improve the strategy? In fact there are already some work to introduce reinforcement learning to model the recommendation process. But there is a problem is the number of item recommended scenario involves the often very high, large-scale space will make a number of discrete actions RL methods can not be effectively applied. Such as the use DQN based approach in learning strategies:

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

Wherein A represents a collection item, all of A are to be counted item Q function. If | A | very large, then unacceptable from a cost in terms of time. However, this method has the advantage that the Q function often have better generalization in operation. Based on the actor-critic method actor networks are often similar to a classifier, after softmax outputs a probability distribution over operation, so avoiding performance problems DQN class method, but the drawback is that such methods have not appeared on the action generalization is not good. So we need to find a way for the action space for the complexity is linear but also in the action space can be better generalization.

二、Wolpertinger Architecture

The algorithm is presented to the first paper, the overall process is as follows.

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

The algorithm is based actor-critic framework, to train the parameters DDPG, here only part of the article that is the focus of action choices. After the first algorithm

Reinforcement Learning: How to deal with large-scale discrete action space

 

Get proto-action

Reinforcement Learning: How to deal with large-scale discrete action space

 

, but

Reinforcement Learning: How to deal with large-scale discrete action space

 

May not be a valid action, that is to say

Reinforcement Learning: How to deal with large-scale discrete action space

 

. And then find the k from A,

Reinforcement Learning: How to deal with large-scale discrete action space

 

Most similar action, expressed as:

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

This step can be of several times to give the approximate solution, under sub-linear complexity, time complexity is too high to avoid the problem. But some low Q value of the action may also happens to appear in

Reinforcement Learning: How to deal with large-scale discrete action space

 

Around, so choose a direct and

Reinforcement Learning: How to deal with large-scale discrete action space

 

The closest action is not very ideal. To avoid the abnormal action selected, need to be corrected by a corresponding Q value of the action, expressed as:

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

In which the parameters involved

Reinforcement Learning: How to deal with large-scale discrete action space

 

Action generation process comprising

Reinforcement Learning: How to deal with large-scale discrete action space

 

And critic network

Reinforcement Learning: How to deal with large-scale discrete action space

 

. By screening Q value such that the algorithm may be more robust in action choice.

Three, TPGR model

该算法是第二篇论文提出来的,主要思路是对 item 集合先预处理聚类从而解决效率问题,模型框架图如下。左半部分展示了一个平衡的树结构,整颗树对应着将 item 全集进行层次聚类的结果,每个根节点对应一个具体的 item 。右半部分展示了如何基于树结构进行决策,具体来说每个非叶节点对应一个策略网络,从根节点到叶子节点的路径中使用多个策略网络分别进行决策。TPGR 模型可使决策的复杂度从

Reinforcement Learning: How to deal with large-scale discrete action space

 

降低到

Reinforcement Learning: How to deal with large-scale discrete action space

 

,其中 d 表示树的深度。下面分别介绍下左右两部分。

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

TPGR模型框架图

  • 平衡的层次聚类

 

这步目的是将 item 全集进行层次聚类,聚类结果可通过树结构进行表示。文中强调这里是平衡的树结构,也就是说子树高度差小于等于1,而且子树均符合平衡性质。设树的深度为 d ,除去叶子节点的父节点外其他中间节点的子树数均为 c ,可以得到 d 、c 和 item 数目 |A| 的关系如下:

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

涉及如何表示 item 和如何进行聚类两个问题。第一个问题,可以使用评分矩阵、基于矩阵分解方法得到的隐向量等方法作为 item 的表示。而第二个问题,文章提出了两种方法。篇幅有限,只介绍基于 Kmeans 改进的方法,具体来说:先进行正常的 Kmeans 聚类得到的簇中心 ( c 个 ) ,然后遍历所有簇,以欧几里得距离作为相似度指标将和簇中心最近的 item 加到当前簇中,遍历所有簇一遍后继续循环遍历所有簇,直到将所有 item 都进行了分配为止。通过这种分配的方式可使得最后每个簇的 item 数目基本一致,也就达到了平衡的要求。

  • 基于树结构的策略网络

 

之前提到,每个非叶节点均对应一个策略网络。其输入是当前节点对应的状态,输出则是每个子节点上的概率分布,也就是移动到每个子节点的概率。在上面的框架图中,根节点为

Reinforcement Learning: How to deal with large-scale discrete action space

 

,根据其策略网络输出的概率移动到

Reinforcement Learning: How to deal with large-scale discrete action space

 

,以此类推,最后到

Reinforcement Learning: How to deal with large-scale discrete action space

 

,将

Reinforcement Learning: How to deal with large-scale discrete action space

 

推荐给 user 。策略网络具体采用 REINFORCE 算法,梯度更新公式为:

 

Reinforcement Learning: How to deal with large-scale discrete action space

 

 

其中

Reinforcement Learning: How to deal with large-scale discrete action space

 

表示

Reinforcement Learning: How to deal with large-scale discrete action space

 

S state action is taken at a desired cumulative reward policy, it can be estimated by using sampling.

IV Summary

General recommendation system will recall the steps to narrow the range of candidate item to the order of tens to hundreds, so the need to handle large-scale discrete actions from this point of view is not so big.

TPGR balanced tree model and limit the number of sub-tree just to ensure that the time complexity is an order of magnitude, which is the prerequisite for solving large-scale discrete action space. The item will have the distribution is inclined, so the actual number of all of the plurality of clusters item set are clustered into a substantially uniform and in practice reasonable. This will certainly affect the model results, you may also need to build the tree and try to explore more.

about the author:

Yang Yi-ming, travel drops advanced algorithms engineer, graduated from the University of Science and Technology of China, we know almost "records accumulated advertising model, other aspects of the recommendation," columnist.

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/91945767