RL-Zhao-(1): Basic concepts [state value (v), action value (q), policy (π), reward, return, trajectories, episode]

Insert image description here

1.1 A grid world example

Consider an example as shown in Figure 1.2, where a robot moves in a grid world. The robot, called agent, can move across adjacent cells in the grid. At each time step, it can only occupy a single cell. The white cells are accessible for entry, and the orange cells are forbidden. There is a target cell that the robot would like to reach. We will use such grid world examples throughout this book since they are intuitive for illustrating new concepts and algorithms.

The ultimate goal of the agent is to find a “good” policy that enables it to reach the target cell when starting from any initial cell. How can the “goodness” of a policy be defined? The idea is that the agent should reach the target without entering any forbidden cells, taking unnecessary detours, or colliding with the boundary of the grid.

1.2 State and action

The first concept to be introduced is the state, which describes the agent’s status with respect to the environment.

In the grid world example, the state corresponds to the agent’s location. Since there are nine cells, there are nine states as well.

They are indexed as s1, s2, . . . , s9, as shown in Figure 1.3(a). The set of all the states is called the state space, denoted as S = {s1, . . . , s9}.

1.3 State transition

1.4 Policy

1.5 Reward

1.6 Trajectories, returns, and episodes

1.7 Markov decision processes

Guess you like

Origin blog.csdn.net/u013250861/article/details/134766531