马尔科夫决策过程MDP - Lecture Note for CS188(暨CS181 ShanghaiTech) - 代码天地

马尔科夫决策过程MDP - Lecture Note for CS188(暨CS181 ShanghaiTech)

编程语言 2018-11-25 02:59:58 阅读次数: 0

版权声明：博客为个人原创，如需转载请注明出处 https://blog.csdn.net/liubai01/article/details/84441464

说明：笔记旨在整理我校CS181课程的基本概念(PPT借用了Berkeley CS188)。由于授课及考试语言为英文，故英文出没可能。

目录

1 Markov Decision Processes mechanics

1.1 Markov Decision definitions

1.2 Markov 涵义

1.3 最优策略optimal policy

1.4 MDP搜索树 MDP search tree

2.1 Optimal Quantities

2.2 Value of states

2.3 Value iteration

2.4 Policy iteration

1 Markov Decision Processes mechanics

1.1 Markov Decision definitions

A MDP is defined by:

1.2 Markov 涵义

For markov decision processes, "Markov" means action outcomes depend only on the current state:

$P(S_{t+1}=s'|S_t=s_t, A_t=a_t, S_{t-1}=s_{t-1}, A_{t-1},\cdots S_0=s_0) = P(S_{t+1}=s'|S_t=s_t, A_t=a_t)$

1.3 最优策略optimal policy

For MDP, we want an optimal policy $\pi^*: S \mapsto A$ :

A policy π gives an action for each state
An optimal policy is one that maximizes expected utility if allowed
An explicity policy defines a reflex agent

1.4 MDP搜索树 MDP search tree

5. Discounting: each time we descend a level, we multiply in the discount once. Redefine Rewards R(s, a, s') with discount γ

2 Solving MDPs

2.1 Optimal Quantities

1. The value (utility) of a state s: $V^*(s)$ =expected utility starting in s and acting optimally.

2. The value (utility) of a q-state (s,a): $Q^*(s,a)$ =expected utility starting out having taken action a from state s and (therefore) acting optimally

3. The optimal policy: $\pi^*(s)$ =optimal action from state s

2.2 Value of states

2.3 Value iteration

1.Define $V_k(s)$ to be the optimal value of s if the game ends in k more time steps

time-limited value evaluation

2.Policy extraction

$\pi^*(s)=\arg \max_a \sum_{s'} T(s, a, s')[R(s, a, s')+\gamma V^*(s')]=\arg \max_a Q^*(s, a)$

2.4 Policy iteration

Step1 Policy evaluation:

Step2: Policy improvement: After evaluation(step 1), we get $v^{\pi_i}(s')$

Policy iteration: repeat two steps until policy converges

Reference

1. Artificial Intelligence, A Modern Approach. 3rd Edition. Stuart R., Peter N. Chapter 17

2. UC berkeley, CS188. Lecture 13 Markov Decision Process

猜你喜欢

转载自blog.csdn.net/liubai01/article/details/84441464

马尔科夫决策过程MDP - Lecture Note for CS188(暨CS181 ShanghaiTech)

马尔科夫决策过程MDP

马尔科夫决策过程（MDP）

强化学习：马尔科夫决策过程（MDP）

增强学习（一）——马尔科夫决策过程（MDP）

基础阶段（二）——马尔科夫决策过程（MDP）

马科夫过程（MP) -＞马尔科夫奖励过程（MRP） -＞马尔科夫决策过程（MDP）

Lecture note 2: TensorFlow Ops

Model thinking lecture note (1)

Model thinking lecture note (3)

Model thinking lecture note (2)

Model thinking lecture note (4)

Model thinking lecture note (5)

深度强化学习2——马尔科夫决策过程（MDP）

强化学习笔记—马尔科夫决策过程(MDP)

强化学习（二）——MDP：马尔科夫决策过程

对马尔科夫决策过程MDP（Markov Decision Processes）的一点理解

强化学习笔记（2）—— 马尔科夫决策过程 MDP

马尔科夫决策过程

半马尔科夫决策过程

Lecture note 3: Linear and Logistic Regression in TensorFlow

lyaponuv function —— Model thinking lecture note (6)

[Fastai] ML lecture1 note

HihoCoder1103 Colorful Lecture Note

马尔科夫模型（Markov）（MDP）

马尔科夫决策过程之Markov Reward Process（马尔科夫奖励过程）

强化学习（二）马尔科夫决策过程(MDP) 强化学习（一）模型基础

《Reinforcement Learning》读书笔记 3：有限马尔科夫决策过程（Finite MDP）

【转载】强化学习（二）马尔科夫决策过程(MDP) 强化学习（一）模型基础

David Silver强化学习Lecture2：马尔可夫决策过程

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

循环神经网络（rnn）讲解

Tigao教程四：单独的关节运动

金蝶K3WISE15.0-注册套打教程

如何在Mac上配置Kubernetes

Android应用结束自身进程的方法

SpringMVC学习十三拦截器栈

中国驻洛杉矶总领馆举行新春招待会

HttpClient get post 发送

11 - three.js 笔记 - 绘制三维字体模型

Mysql递归获取某个父节点下面的所有子节点和子节点上的所有父节点

每日归档

更多

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)