DRN: A Deep Reinforcement Learning Framework for News Recommendation理解

Disclaimer: This article is original, All Rights Reserved https://blog.csdn.net/weixin_41864878/article/details/90080576

This article is issued by Microsoft 18 years to do the recommended articles based on reinforcement learning system.
Research recommendation system for more than a month, always feel that their model is too simple, simple no brain FC over the whole stupid people are stupid, so he searches for a moment there is no other way to do the recommendation, found this article the overall feel is still very good, recording what learning experience.
If there is a wrong understanding of the place, but also look brightest Gangster wing.

Summary

Currently mainstreaming recommendation algorithm unresolved:
(1) Most models only CTR CTR as the objective function
(2) few people try to take advantage of user feedback to enhance the effect is recommended
(3) Most of the methods will be repeated to the user recommend similar content (which I feel the same, most of the current APP algorithms are recommended in this mode)
this innovation:
(1) the introduction of a user returns to the mode (user return pattern) on CTR be supplemented to take advantage of more multi-user feedback information
(2) introduction of an efficient exploration strategy as the news has recommended to the user freshness

Introduction及related work

News of the recommended three major challenges:
first, the dynamic changes recommended during difficult to handle. These dynamic changes are mainly of two parts: (1) the news has a strong timeliness, will soon expire. In the data set used herein, each news release from the average length of time the last click is 4.1 hours, and therefore recommended candidate pool change is very rapid . (2) the user in the process of watching the news, a point of interest will change, so the model needs to be updated regularly, following the figure is very intuitive and clear. And the current recommendations will affect the future users will want to see what news (text tells a very simple story to make sense, such as the recommended two A and B news you want to see, but when you read a later, you'll probably want to see more news about a, B not so want to see the)
Here Insert Picture Description
second, the model is easy to recommend similar content to the user. Before there are two ways to alleviate this problem: (1) e \epsilon -greedy strategy, but it will push the content to the user completely unrelated (2) UCB, but its response time is very long, you need to click on an item is repeated many times in order to give an accurate reward.
Solution of this paper is (1) the introduction of DQN framework to better learning News features and user preferences, because DQN can simultaneously consider the current reward and a future reward. MAB-based method does not give a clear future reward, MDP-based method is not suitable for large scale data. (2) introducing into the user returns APP user activity, as the user feedback information. User activity can be calculated at any time. Using Dueling Bandit Gradient Descent (DBGD) Method (3) exploration time to select the candidate items in the current recommendation environment.
Overall framework below, in which the user clicks is real-time reward, user activity is the future reward, user characteristics and user behavior characteristic state represented, action is characteristic of news candidate pool, which constitutes the basic framework of reinforcement learning.
Here Insert Picture Description

Method

Model Framework

Here Insert Picture Description
模型分为线上计算和线下计算。
线下计算部分,模型采用了用户记录日志的新闻级别和用户级别的4类特征作为输入,计算DQN的reward。
线上计算部分分为4步
(1)PUSH:t1时刻用户u发出请求,agent利用u和新闻候选池作为输入,生成top k个推荐列表L
(2)FEEDBACK:当用户有点击行为时就生成了反馈信息B
(3)MINOR UPDATE(小调):t1时刻后,agent利用该用户u,生成的L和反馈信息B,比较exploitation网络Q和exploration网络Q’,选择更好的那个
(4)MAJOR UPDATE(大调):在经过某一个固定的时期Tr后,agent利用反馈信息B和memory存储的用户活跃度采样来更新Q(memory中存储近期的历史点击记录和用户活跃度分值)

特征构造

这一部分挺重要的,不过文章讲的不太细节
(1)新闻特征:417维的one hot,包括headline, provider, ranking, entity name, category, topic category, and click counts in last 1 hour, 6 hours, 24 hours, 1 week, and 1 year respectively。这里每个click count是1维,其余的特征总共412维,不知道分别是多少维的
(2)用户特征:413*5维: 1 hour, 6 hours, 24 hours, 1 week, and 1 year里用户点击过的新闻的headline, provider, ranking, entity name, category, and topic category + click count
(3)用户-新闻特征:25维,在该用户的所有浏览记录中category, topic category and provider的出现频率
(4)环境特征:32维,当一次点击发生时的环境,包括time, weekday, and the freshness of the news (the gap between
request time and news publish time).

reward计算

总的reward可写为
Here Insert Picture Description
Here Insert Picture Description
其中s是当前的state, r a , t + 1 r_{a,t+1} 是action a发生后的immediate reward(因为reward总是存在延时,所以时间是t+1而不是t), W t W_{t} W T W_{T}^{'} 是Q和Q’的参数。
在此公式中,在下一个状态 s a , t + 1 s_{a,t+1} 时,agent会产生一系列候选的action {a’},能带来对最大reward的a’所对应的参数W’就会替换掉当前的W。
下图显示了特征输入形式,值函数由静态的用户特征和环境特征构成,行为函数由静态动态的全部特征构成。
Here Insert Picture Description

用户活跃度拟合

当产生用户请求时,就被标记为user return的时刻(这里有一点疑问,如果是用户在一段时间内连续多次刷新呢?)
采用生存模型进行拟合,一通数学的积分操作我直接截过来
Here Insert Picture Description
然后一通定义变量,拟合,就变成这样:
Here Insert Picture Description
函数的图象大概长这样,意义就比较明确了,最大值不超过1,初始化值是0.5,阶跃的点就代表发生了用户请求
Here Insert Picture Description

Explore

这一节主要阐述了DQN如何选取候选列表L的过程,如图
Here Insert Picture Description
W = α r a n d ( 1 , 1 ) W ∆W = α · rand(−1, 1) · W , W is equivalent to a small perturbation, there will be generated a new Q '. Q 'and Q will be given a list of candidate top k L' and L, and finally pushed to the user L L is a random selection of the L and L 'in,for example L L The first content selection produced in L, then L will be randomly drawn from an item, this is not entirely random random, and L probability of each item is related to the probability of a high probability of being able to get too big( and principles tensorflow comes log_uniform_candidate_sampler sampling method is similar).
After the L L is on the user, the user clicks and will produce feedback information B, the if the Q 'generated recommendation item better than Q, then replaced, which is a logical model selection

Experimental results

metrics: with the precision @ 5 and nDCG @ 5
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description
Here Insert Picture Description

Guess you like

Origin blog.csdn.net/weixin_41864878/article/details/90080576