《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits） - 代码天地

《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits）

其他 2018-09-22 10:17:01 阅读次数: 0

版权声明：本文为博主原创文章，欢迎交流分享，未经博主允许不得转载。 https://blog.csdn.net/qjf42/article/details/79655483

《Reinforcement Learning: An Introduction》读书笔记 - 目录

Reinforcement Learning 和 Supervised Learning 的区别

evaluate vs instruct

也就是说，RL的对于每一个action的效果不是非黑即白的，而是在每一次的action之后都可能不一样的后果（feedback, reward）
- 非iid，基于不同环境和/或之前的 actions
- reward可能是随机的

定义问题（ k-armed bandit problem）

k种actions => k个reward $R$ 的平稳分布
目标
- $max\ E(\sum R_t)$

一些概念

`exploitation vs exploration (EE)`

exploitation: greedy move
exploration: nongreedy trial

reward & value

the value of an action $a$ , denoted $q_*(a)$ , is the expected reward given that $a$

i.e. $q_∗(a) = E[R_t | A_t = a]$
用经验分布近似估计：
- $Q_t(a) = \frac{\sum_{i=1}^{t-1}R_i \cdot 1_{A_i=a}}{\sum_{i=1}^{t-1}1_{A_i=a}}$
- 迭代式（在执行某个 $a$ 后）： $Q_{t} (a) = Q_{t - 1} + \frac{1}{t} (R_{t} (a) - Q_{t - 1}) = Q_{t - 1} + α (t) (R_{t} (a) - Q_{t - 1})$ $Q_t(a) = Q_{t-1} + \frac{1}{t}(R_t(a) - Q_{t-1}) \\ = Q_{t-1} + \alpha(t)(R_t(a) - Q_{t-1})$
更广义的，
- 这里，StepSize可以是单调减的，常数(指数平滑), …

几种方法

$\varepsilon$ -greedy

算法
- 以 $p = 1-\varepsilon$ 执行greedy action (exploitation)
- 以 $p = \varepsilon$ 执行nongreedy action (exploration)
优点
- 实现简单
- 效果不会太差，即使分布是非平稳的
缺点
- 通常收敛比较慢
- 单纯的 $\varepsilon$ -greedy收敛后执行最优action(greedy)的比例为 $1-\varepsilon < 1$
优化点
- $\varepsilon$ 随时间减小
- 选一个大点的
  - encourage exploration，选择足够大，能保证state space都覆盖到
  - 即使非平稳也没问题，因为影响只是暂时的

UCB（Upper-Confidence-Bound）

算法
- $A_t = argmax_a (\ Q_t(a) + c\sqrt{ln(t)/(N_t(a)+\epsilon})\ )$
- $\epsilon \rightarrow 0$ 或1
- $c$ 是平衡EE的参数（类比置信度）
缺点
- 适用范围没有 $\varepsilon$ -greedy广，比如非平稳分布

Gradient Bandit

算法
- 定义
  - $H_t(a)$ 为preference for action a
  - $\pi_t(a) = P_t(A_t = a) =softmax_t(H_t(a))$ ，非argmax
- 迭代
  - $H_{t+1}(A_t) = H_t(A_t) + \alpha (R_t − \bar R_t)(1 − \pi_t(A_t))$
  - $H_{t+1}(a) = H_t(a) - \alpha (R_t − \bar R_t)\pi_t(a), \text{ for all } a \ne A_t$
- 推导
  - $E(R_t) =\sum_x \pi_t(x) q_∗(x)$
  - $H_{t+1}(a) = H_t(a) + \alpha \frac{\partial E(R_t)}{\partial H_t(a) } = \dots$
优点
- 通用思想，可以引申到后面的full RL问题中

其它

Bayesian methods(posterior sampling/Thompson sampling)

假设value服从某个（未知的）稳定分布 $f$
假设一个（确定的）先验分布 $f_{pri}$ ，执行一系列action，根据结果，得到后验分布 $f_{post}$ （收敛于 $f$ ）
e.g

如何比较（参数&算法）

learning curve
- x轴为参数，y轴为average sum of rewards (e.g of 1000 experiments)

其他点

associative search (contextual bandits)

就是包含不同situation (environment)的问题（但与former actions仍无关）

If actions are allowed to affect the next situation as well as the reward, then we have the full reinforcement learning problem.

猜你喜欢

转载自blog.csdn.net/qjf42/article/details/79655483

《Reinforcement Learning》读书笔记 2：多臂老虎机（Multi-armed Bandits）

《Reinforcement Learning: An Introduction》 Chapter 2 Multi-arm Bandits 笔记

强化学习系列（二）：Multi-armed Bandits(多臂老虎机问题）

Reinforcement Learning: An Introduction读书笔记(2)--多臂机

RLAI读书笔记-第二章-Multi-armed Bandits

《Reinforcement Learning: An Introduction》读书笔记 - 目录

Multi-armed Bandits

Chapter 2 Multi-armed Bandits

Reinforcement Learning:An Inteoduction第二章读书笔记

《Reinforcement Learning》读书笔记 4：动态规划（Dynamic Programing）

Reinforcement Learning: An Introduction读书笔记(3)--finite MDPs

Deep Reinforcement Learning for Chinese Zero pronoun Resolution读书笔记

Reinforcement Learning 笔记（1）

Reinforcement Learning 笔记（3）

Reinforcement Learning 笔记（4）

读书笔记-Distributed Cooperative Reinforcement Learning-Based Traffic Signal Control That Integrates V2X

随机多臂赌博机 (Stochastic Multi-armed Bandits)：置信上界算法 (Upper Confidence Bound)

Introduction to Multi-Armed Bandits——02 Stochastic Bandits

Introduction to Multi-Armed Bandits——04 Thompson Sampling[2]

Reinforcement Learning:An Introduction 第三章读书笔记

读书笔记5：Deep Progressive Reinforcement Learning for Skeleton-based Action Recognition

《Reinforcement Learning》读书笔记 5：蒙特卡洛（Monte Carlo Methods）

读书笔记 - Clique-based Cooperative Multiagent Reinforcement Learning Using Factor Graphs

Introduction to Multi-Armed Bandits——03 Thompson Sampling[1]

Introduction to Multi-Armed Bandits——01 Scope and Motivation

Introduction to Multi-Armed Bandits——05 Thompson Sampling[3]

Issues in Using Function Approximation for Reinforcement Learning笔记

李宏毅Deep Reinforcement Learning笔记

算法笔记：Playing Atari with Deep Reinforcement Learning

强化学习（Reinforcement Learning）笔记（收藏）

今日推荐

Linus “吃狗粮”最积极！

开源日报 | Winamp播放器即将开源；生成式AI之战升级第二轮；Linus“吃狗粮”最积极；AI进入泡沫前期；吴泳铭为阿里云带来了什么？

NetBSD 禁止提交由 AI 生成的代码

Apache Doris 2.0.10 版本正式发布！

开源日报 | 大模型开战；大模型独角兽被曝卖身；周鸿祎建议谷歌开源所有产品；最大开源AI社区提供1000万美元共享GPU

开源日报 | Chrome内置Gemini的意义不在于Gemini；中国AI追随之路的五大误区；ECharts创始人“下海”养鱼；谷歌I/O开发者大会什么都有，只是没有惊喜

微软回应中国区AI团队“打包赴美”传闻

周排行

SVN服务端安装在阿里云

实战 | 相机标定

webpack核心概念

note20——》只要肯低头吃苦，人生就会有救

PAT甲级 1062 Talent and Virtue （25 分）排序

NG Toolset开发笔记--5GNR Resource Grid（26）

如何对待上司

oracle命令

第9章 STL迭代器

logstash使用es映射模板

每日归档

更多

2024-05-20(36)

2024-05-19(0)

2024-05-18(4)

2024-05-17(34)

2024-05-16(6)

2024-05-15(24)

2024-05-14(0)

2024-05-13(18)

2024-05-12(0)

2024-05-11(38)