Getting Started with Reinforcement Learning Q-learning

1 Introduction

Q-learningThis article is the study notes of the reinforcement learning that I have learned recently , mainly for summary and daily records. This article mainly explains the corresponding necessary introductory knowledge.

Without further ado, let's get started!

2. Concept

We have all experienced the following situations when we were young: we did something wrong for a certain age and were punished. After we learn, we will not make mistakes again when we encounter similar situations. Also, many times, good behaviors are rewarded in return, which encourages us to repeat those behaviors on more occasions.

Similarly, reinforcement learning agentwill take certain actions according to the policy actionand receive positive or negative feedback rewarddepending on whether the action taken was beneficial. This reward is then used to update the policy, and the whole process is repeated until an optimal policy is reached, as shown in Figure 1.
insert image description here

The goal of a reinforcement learning agent is to optimize the actions taken to achieve the highest possible reward through the agent 's continuous interaction agentwith the dynamic environment .Envactionreward

3. Model-Free and Model-Based

Before proceeding to explain and implement Q-learnngalgorithms, we need to note that RL algorithms fall into two broad categories: Model-BasedAlgorithms and Model_FreeAlgorithms.

Among them, Model-Basedthe purpose of is to learn the environment model through the interaction with the environment, so that agentit can predict the return of a given action before taking action (by building an environment model, it can predict what will happen after each action), so as to carry out action planning . Algorithms , on the other hand, Model-Freemust take actions to observe their consequences and then learn from them (see Figure 2). It is important to note that the term "model" does not refer to a machine learning model, but rather a model of the environment itself.

insert image description here

Strictly speaking, Q-Learningan Model-Freealgorithm because its learning consists of taking actions, getting rewards, and learning to improve continuously from the results of taking those actions.

4. Q-learning

Q-learningThe algorithm uses a Q-table (2D matrix) consisting of state-action pairs such that each value in the matrix Q(S,a)corresponds to an estimate of the Q-value of athe action taken in the state (Q-value will be introduced later). SWhen interacting agentwith the environment Env, the Q-values ​​of the Q-table will continue to converge until their optimal value as it iterates to find the optimal policy.

Understanding all these terms is complicated at first, and it is normal to have many questions, such as what is a Q-value? How is the Q table constructed? How is the Q value updated?

Next, let's gradually introduce the above concepts and corresponding problems!

5. Build Q-Table

As above, Q-Tableis a matrix where each element corresponds to a state-action pair. Thus, Q-Tablewill be a mxnmatrix of where mis the number of possible states and nis the number of possible actions. The Q value of the Q table must have an initial value, and generally, Q-Tableall initial values ​​are set to zero.
Example:
For simplicity, assume that the environment Envwill be a room with 4 possible states (a,b,c,d), as shown in the image below. At the same time, let's assume that the agent agentwill be able to perform 4 possible actions: up, down, left, and right.
insert image description here
Considering the above agents agentand environments Env, Q-Tablewill be a 4x4matrix with 4 rows corresponding to the 4 possible states Statesand 4 columns corresponding to the 4 possible actions Actions. As shown below, all values ​​have been initialized to zero.
insert image description here

6. Q-Values

Once the Q-table is initialized, the agent agentcan start Envinteracting with the environment and update it Q-Valuesto achieve an optimal policy. But how is the Q value updated?

First, Value Functionit is important to introduce the concept of value functions. In general, a value function is a measure of the benefit an agent can gain agentfrom being in a given state Stateor being in a given state-action sate-actionpair.

There are two types of value functions: State-Value Function, v(S), which determine the benefit of following a certain policy in a certain state; and Action-Value Function, q(S, A), which determine the benefit of triggering a certain action from a certain state while following a certain policy. More specifically, these functions return the expected benefit of following a given policy starting from a state State(for state-value functions) or a state-action state-actionpair (for action-value functions). The diagram is as follows: The result of insert image description here
the function Action-Value Function is called the Q value, as mentioned earlier, it is the basic unit that constitutes the Q table. Thus, the Q-table gives us the expected benefits of taking a certain action from a certain state, i.e. the information the agent agentwill use to Envbest act in the environment. Thus, the agent agent's goal is to iteratively find the optimal such that it returns the highest possible reward q*(S, A)from any state-action pair according to any policy .state-actionreward

7. Q-Values ​​update

One property of an optimal Action-Valuefunction q*(S,A)is Bellmanthe optimal equation it needs to satisfy, as shown below. We know that Bellmanequations can be used to iterate optimal Action-Value functions, which is the main goal of agents.
insert image description here
In the case of , use the adaptation of the optimal equation Q-learningshown in the figure below to iteratively update the Q value in . This equation is used to reduce the error by comparing the current Q-value with the optimal Q-value at each iteration, thus seeking an equilibrium of the two. Note that the update equation for Q values ​​above uses a parameter called the learning rate, which is used to weight the new Q values ​​at each update/iteration. In practical experiments, the ideal value of this parameter is usually found by trial and error.BellmanQ-Table
insert image description here
α

8. Algorithm process

Now that all the components and steps of the algorithm have been explained, it's time to put it all together and let the agent agentlearn. The following is the pseudo code of the algorithm, which will be used as Q-learninga reference in the implementation process.
insert image description here
The process is as follows:

  1. The initialization shape of the initialization Q-Table
    Q-Tabledepends on the number of possible states stateand actions action, and all its values ​​are set to zero, as previously described.
  2. Train once episode
    Every episodetime you train, you need the agent agentto reach the goal state. The agent agentstarts from a random state, and for episodeeach of the times step, it will do the following:
    a) Take the appropriate action according to the policy action(the algorithm most commonly used is the greedy policy). b) Compute a new Q value from the new state reached and the reward obtained
    according to the previously mentioned value update equation. c) Start iterating from the new state reached to the next one .Q
    step

The training will end when all episodeare done. At this point, the Q-value of the Q-table will be optimal (as long as the training works), which means that if the agent chooses agentthe action A with the highest Q-value, then he will get the maximum reward in each state S. rewardFinally, in order to use a trained agent `` in a non-training environment, it is only necessary to make it choose the action with the highest Q-value at each step, because the Q-table is already optimized during training.

9. Summary

This article focuses on Q-learningthe theory of the algorithm and related concepts, without focusing on its implementation in code. In the future, I will plan to give corresponding examples for specific codes. Well, if you are interested, please pay more attention.

Guess you like

Origin blog.csdn.net/sgzqc/article/details/131029085