1 Introduction
Q-learning
This article is the study notes of the reinforcement learning that I have learned recently , mainly for summary and daily records. This article mainly explains the corresponding necessary introductory knowledge.
Without further ado, let's get started!
2. Concept
We have all experienced the following situations when we were young: we did something wrong for a certain age and were punished. After we learn, we will not make mistakes again when we encounter similar situations. Also, many times, good behaviors are rewarded in return, which encourages us to repeat those behaviors on more occasions.
Similarly, reinforcement learning agent
will take certain actions according to the policy action
and receive positive or negative feedback reward
depending on whether the action taken was beneficial. This reward is then used to update the policy, and the whole process is repeated until an optimal policy is reached, as shown in Figure 1.
The goal of a reinforcement learning agent is to optimize the actions taken to achieve the highest possible reward through the agent 's continuous interaction agent
with the dynamic environment .Env
action
reward
3. Model-Free and Model-Based
Before proceeding to explain and implement Q-learnng
algorithms, we need to note that RL algorithms fall into two broad categories: Model-Based
Algorithms and Model_Free
Algorithms.
Among them, Model-Based
the purpose of is to learn the environment model through the interaction with the environment, so that agent
it can predict the return of a given action before taking action (by building an environment model, it can predict what will happen after each action), so as to carry out action planning . Algorithms , on the other hand, Model-Free
must take actions to observe their consequences and then learn from them (see Figure 2). It is important to note that the term "model" does not refer to a machine learning model, but rather a model of the environment itself.
Strictly speaking, Q-Learning
an Model-Free
algorithm because its learning consists of taking actions, getting rewards, and learning to improve continuously from the results of taking those actions.
4. Q-learning
Q-learning
The algorithm uses a Q-table (2D matrix) consisting of state-action pairs such that each value in the matrix Q(S,a)
corresponds to an estimate of the Q-value of a
the action taken in the state (Q-value will be introduced later). S
When interacting agent
with the environment Env
, the Q-values of the Q-table will continue to converge until their optimal value as it iterates to find the optimal policy.
Understanding all these terms is complicated at first, and it is normal to have many questions, such as what is a Q-value? How is the Q table constructed? How is the Q value updated?
Next, let's gradually introduce the above concepts and corresponding problems!
5. Build Q-Table
As above, Q-Table
is a matrix where each element corresponds to a state-action pair. Thus, Q-Table
will be a mxn
matrix of where m
is the number of possible states and n
is the number of possible actions. The Q value of the Q table must have an initial value, and generally, Q-Table
all initial values are set to zero.
Example:
For simplicity, assume that the environment Env
will be a room with 4 possible states (a,b,c,d)
, as shown in the image below. At the same time, let's assume that the agent agent
will be able to perform 4 possible actions: up, down, left, and right.
Considering the above agents agent
and environments Env
, Q-Table
will be a 4x4
matrix with 4 rows corresponding to the 4 possible states States
and 4 columns corresponding to the 4 possible actions Actions
. As shown below, all values have been initialized to zero.
6. Q-Values
Once the Q-table is initialized, the agent agent
can start Env
interacting with the environment and update it Q-Values
to achieve an optimal policy. But how is the Q value updated?
First, Value Function
it is important to introduce the concept of value functions. In general, a value function is a measure of the benefit an agent can gain agent
from being in a given state State
or being in a given state-action sate-action
pair.
There are two types of value functions: State-Value Function, v(S)
, which determine the benefit of following a certain policy in a certain state; and Action-Value Function, q(S, A)
, which determine the benefit of triggering a certain action from a certain state while following a certain policy. More specifically, these functions return the expected benefit of following a given policy starting from a state State
(for state-value functions) or a state-action state-action
pair (for action-value functions). The diagram is as follows: The result of
the function Action-Value Function
is called the Q value, as mentioned earlier, it is the basic unit that constitutes the Q table. Thus, the Q-table gives us the expected benefits of taking a certain action from a certain state, i.e. the information the agent agent
will use to Env
best act in the environment. Thus, the agent agent
's goal is to iteratively find the optimal such that it returns the highest possible reward q*(S, A)
from any state-action pair according to any policy .state-action
reward
7. Q-Values update
One property of an optimal Action-Value
function q*(S,A)
is Bellman
the optimal equation it needs to satisfy, as shown below. We know that Bellman
equations can be used to iterate optimal Action-Value
functions, which is the main goal of agents.
In the case of , use the adaptation of the optimal equation Q-learning
shown in the figure below to iteratively update the Q value in . This equation is used to reduce the error by comparing the current Q-value with the optimal Q-value at each iteration, thus seeking an equilibrium of the two. Note that the update equation for Q values above uses a parameter called the learning rate, which is used to weight the new Q values at each update/iteration. In practical experiments, the ideal value of this parameter is usually found by trial and error.Bellman
Q-Table
α
8. Algorithm process
Now that all the components and steps of the algorithm have been explained, it's time to put it all together and let the agent agent
learn. The following is the pseudo code of the algorithm, which will be used as Q-learning
a reference in the implementation process.
The process is as follows:
- The initialization shape of the initialization
Q-Table
Q-Table
depends on the number of possible statesstate
and actionsaction
, and all its values are set to zero, as previously described. - Train once
episode
Everyepisode
time you train, you need the agentagent
to reach the goal state. The agentagent
starts from a random state, and forepisode
each of the timesstep
, it will do the following:
a) Take the appropriate action according to the policyaction
(the algorithm most commonly used is the greedy policy). b) Compute a new Q value from the new state reached and the reward obtained
according to the previously mentioned value update equation. c) Start iterating from the new state reached to the next one .Q
step
The training will end when all episode
are done. At this point, the Q-value of the Q-table will be optimal (as long as the training works), which means that if the agent chooses agent
the action A with the highest Q-value, then he will get the maximum reward in each state S. reward
Finally, in order to use a trained agent `` in a non-training environment, it is only necessary to make it choose the action with the highest Q-value at each step, because the Q-table is already optimized during training.
9. Summary
This article focuses on Q-learning
the theory of the algorithm and related concepts, without focusing on its implementation in code. In the future, I will plan to give corresponding examples for specific codes. Well, if you are interested, please pay more attention.