Deep Reinforcement Learning [1] - Essential Basics for Getting Started with Reinforcement Learning (Including Python Maze Game Solving Examples)

insert image description here

Essential foundation for getting started with reinforcement learning

1. Reinforcement Learning and Machine Learning

Machine learning is an implementation method of artificial intelligence. The method of machine learning can be summarized as the process of determining the parameters of the entire system through a series of input data.

Machine learning can be divided into supervised learning, semi-supervised learning, unsupervised learning and reinforcement learning.

1.1 Supervised learning

Supervised learning is a common learning method in which a model is trained using a labeled dataset. In supervised learning, each data point consists of an input vector and a label. For example, we can train a model using labeled images, where each image is marked as containing a specific object or category, and then let the model learn to recognize similar objects.

In supervised learning, we train a model to learn how to map input data to the correct output labels. During training, the model adjusts its weights and parameters by comparing it to the correct answer. Once trained, we can apply the model to a new dataset and use its output to make predictions. Common supervised learning algorithms include decision trees, random forests, support vector machines, logistic regression, and neural networks, among others.
insert image description here

1.2 Semi-supervised learning

Semi-supervised learning is a learning method in machine learning that combines supervised and unsupervised learning. In semi-supervised learning, we have some labeled data and some unlabeled data, and our goal is to use this data to train a model so that we can classify or make predictions on new data.

In semi-supervised learning, we can use unsupervised learning algorithms to utilize unlabeled data to discover patterns and structures in the data, thereby improving the performance of the model. For example, we can use clustering algorithms to group unlabeled data, and then use these groups to help us classify labeled data. Alternatively, we can use semi-supervised learning algorithms to automatically label unlabeled data and use it along with labeled data to train models.

Semi-supervised learning is useful in datasets where labeled data is low or expensive, for example, in medical image recognition or speech recognition. Since semi-supervised learning utilizes the information of unlabeled data, it can improve the accuracy of the model and reduce the training cost. Common semi-supervised learning algorithms include semi-supervised support vector machines, semi-supervised clustering, and graph semi-supervised learning.

1.3 Unsupervised Learning

In machine learning, unsupervised learning is a learning method in which a model is trained using an unlabeled dataset without the use of labels or indicator variables. In unsupervised learning, our goal is to discover structure and patterns in data so that we can better understand and interpret the data.

In unsupervised learning, we typically use algorithms such as clustering, dimensionality reduction, anomaly detection, and association rules to discover patterns and structures in data. Clustering algorithms are used to group data points into clusters with similar characteristics. Dimensionality reduction algorithms are used to convert high-dimensional datasets into low-dimensional representations so that we can better visualize and understand the data. Anomaly detection algorithms are used to identify outliers that are different from other data points in the data. Association rule algorithms are used to discover associations in data, such as market basket analysis.

Unsupervised learning is very useful in many situations, especially when we don't know the correct answer or label. It can help us discover potential patterns and structures in data, thereby improving the efficiency and accuracy of data analysis. Common unsupervised learning algorithms include k-means clustering, principal component analysis, autoencoders, density clustering, and association rules, etc.
insert image description here

1.4 Reinforcement Learning

In machine learning, reinforcement learning is a learning method used to train an agent (agent) to perform tasks in a specific environment. In reinforcement learning, the agent interacts with the environment and learns based on the actions performed and the rewards received. The goal is for the agent to learn a policy that maximizes long-term reward.

In reinforcement learning, we define an environment, and the agent decides what actions to perform by observing the state of the environment. After performing an action, the environment will give the agent a reward or punishment to feedback the quality of its action. By interacting with the environment, the agent learns a policy, a set of mappings from states to actions, to maximize long-term reward. This is achieved by using a value function to compute the expected long-term reward at the current state, and updating the policy.

Reinforcement learning is very useful in many application scenarios, such as in games, robot control, natural language processing, etc. Common reinforcement learning algorithms include Q-learning, policy gradient, deep reinforcement learning, etc.
insert image description here

1.5 Deep Learning

Deep learning is a machine learning method that uses artificial neural networks to learn representations of input data so that tasks such as classification, regression, clustering, generation, etc. can be performed. Neural networks in deep learning typically contain multiple layers, each of which performs some simple computation and passes its results to the next layer.

The main advantage of deep learning is that it can automatically learn feature representation by training on a large amount of labeled data, and achieve efficient learning without manual feature extraction. Compared with traditional machine learning methods, deep learning performs better when dealing with large, high-dimensional data, such as images, audio, natural language processing, etc. Deep learning has been widely used in many fields, such as computer vision, speech recognition, natural language processing, machine translation, recommendation system, etc.

The core of deep learning is the neural network model, the most common of which are convolutional neural network (CNN) and recurrent neural network (RNN). CNN is mainly used to process image and video data, while RNN is more suitable for processing sequence data, such as text and speech. In addition, there are some other deep learning architectures, such as autoencoder, generative confrontation network (GAN), variational autoencoder (VAE), etc. These architectures are also widely used in different fields.

The connection of the above four methods is shown in Fig.

insert image description here

2. Some concepts in reinforcement learning

2.1 Agent, Action, State

Take the following Mario game as an example, the execution subject of the game is called an agent (Agent) (Agent)(Agent)

insert image description here

A frame in the figure below is a state sss

insert image description here

The actions that Mario can make are recorded as a , a ∈ { left , right , up } a,a\in \{left,right,up\}a,a{ left,right,up}

insert image description here

2.2 Policy function, reward

The strategy function represents the choice probability of Mario's next action, and the strategy function is recorded as π ( s , a ) \pi(s,a)π ( s ,a ) , whose value ranges from 0 to 1, by observing the value of the state, the action is randomly sampled according to the probability.
π ( left ∣ s ) = P ( A = a ∣ S = s ) \pi(left|s)=P(A=a|S=s)π ( l e f t s )=P(A=aS=s )
In the case of a slightly different
π ( left ∣ s ) = 0.2 π ( right ∣ s ) = 0.1 π ( up ∣ s ) = 0.7 \pi(left|s)=0.2\\ \pi(right|s)= 0.1\\ \pi(up|s)=0π ( l e f t s )=0.2π ( r i g h t s )=0.1π ( u p s )=0.7
insert image description here

Reinforcement learning is a machine learning method with reward as the goal. Its idea is modeled on the biological experience learning method. It has no label data, so reward is a very important indicator. The ultimate goal of reinforcement learning is to maximize the total reward. Model design guides the direction of the entire reinforcement learning.

In this scenario, reward RRR can be designed like this:

  • Get a gold coin: R = + 1 R=+1R=+1
  • Game wins: R = + 10000 R=+10000R=+10000
  • Hitting an enemy (not stepping): R = − 10000 R=-10000R=10000
  • Nothing happens: R = 0 R=0R=0

2.3 State transition

The process of changing from an old state to a new state is called the state transition process. The state transition depends on the choice of action. When the action is randomly sampled, the agent will cause the current state to change after the action is taken.

insert image description here

The process of state transfer is random, the randomness comes from the environment , remember the old state as sss , the new state iss ′ s^{'}s , then the state transition functionppp有:
p ( s ′ ∣ s , a ) = P ( S ′ ∣ S = s , A = a ) p(s{'}|s,a)=P(S^{'}|S=s,A=a) p(ss,a)=P(SS=s,A=a)

2.4 The interaction process between the agent and the environment

The process of the agent interacting with the environment can be summarized as:

  • environment generated ttState st s_tat time tst, the state st s_t generated by the agent in the environmentstcomplete subsequent decisions.
    insert image description here

  • The agent is in state st s_tstRandom sampling according to the probability, make ttAction at time t at a_tatActs on the environment reference.
    insert image description here

  • The environment gets the agent according to ttState st s_tat time tstThe determined action at a_tatAfter that, correspondingly generate ttReward rt r_tat time trtand t+1 t+1t+st + 1 s_{t+1}of the state at time 1st+1, to complete a closed loop.
    insert image description here

Then the process of using reinforcement learning to play this game is actually

  • Observe a frame (state s 1 s_1s1
  • Sample action a 1 a_1a1(up, left, right)
  • Observe a new frame (state s 2 s_2s2) at the same time get the reward rewardr 1 reward r_1rewardr1
  • Sample action a 2 a_2a2
  • ... (repeatedly)

( s t a t e , a c t i o n , r e w a r d ) (state,action,reward) (state,action,re w a r d ) trajectory sequence is
s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s T , a T , r T s_1,a_1,r_1,s_2,a_2, r_2,...,s_T,a_T,r_Ts1,a1,r1,s2,a2,r2,...,sT,aT,rT

2.5 Discount rewards

Cumulative return UUDefinition of U :

U t = R t + R t + 1 + R t + 2 + R t + 3 + . . . U_t=R_t+R_{t+1}+R_{t+2}+R_{t+3}+... Ut=Rt+Rt+1+Rt+2+Rt+3+...

Regarding the importance of reference, the reward at the future moment must be less important than the current moment, so the current (t moment) moment should be given higher weight. According to the example of Mr. Wang Shusen, giving you 100 yuan now is definitely more important than giving you 100 yuan in the future. Your 100 yuan is more realistic, so the above formula should add the attenuation coefficient γ \gammaγ , NUUThe definition of U is changed to
U t = R t + γ R t + 1 + γ 2 R t + 2 + γ 3 R t + 3 + . . . U_t=R_t+\gamma R_{t+1}+\gamma^2R_ {t+2}+\gamma^3R_{t+3}+...Ut=Rt+γRt+1+c2 Rt+2+c3 Rt+3+...
randomness of discount rewards

At time t, the cumulative return U t U_tUtis random, and there are two sources of its randomness:

  • Action sampling is random, that is,
    P [ A = a ∣ S = s ] = π ( a ∣ s ) P[A=a|S=s]=\pi(a|s)P[A=aS=s]=π ( a s )

  • The generation of a new state is random
    P [ S ′ = s ′ ∣ S = s , A = a ] = p ( s ′ ∣ s , a ) P[S^{'}=s^{'}|S= s,A=a]=p(s^{'}|s,a)P[S=sS=s,A=a]=p(ss,a)

For any given i ≥ ti\geq tit , excitationR i R_iRiDepends on the random variable S i S_iSiSum A i A_iAi, so given a state st s_tst, cumulative return U t U_tUtDepends on the random variable:

A t , A t + 1 , A t + 2 , . . . A_t,A_{t+1},A_{t+2},... At,At+1,At+2,... andS t + 1 , S t + 2 , . . . S_{t+1},S_{t+2},...St+1,St+2,...

2.6 Action value function

Action value function Q ( s , a ) Q(s,a)Q(s,a)的定义为:
Q π ( s t , a t ) = E [ U t ∣ S t = s t , A t = a t ] Q_\pi(s_t,a_t)=E[U_t|S_t=s_t,A_t=a_t] Qp(st,at)=E [ UtSt=st,At=at]
That is, the action value function is the cumulative rewardU t U_tUtThe expectation of , which reflects the policy function π \piπ 's evaluation.

Optimal action value function Q ∗ Q^*QThe * function is defined as follows:

Q ∗ ( s t , a t ) = m a x π Q π ( s t , a t ) Q^{*}(s_t,a_t)=\mathop{max}\limits_{\pi}Q_{\pi}(s_t,a_t) Q(st,at)=PimaxQp(st,at)

Action value function: Given a policy function π \piπ ,Q π ( s , a ) Q_{\pi}(s,a)Qp(s,a ) Evaluate the agent in statesss downsampling actionaaa good or bad

2.7 State value function

State value function V π V_{\pi}Vpis defined as follows:
V π ( st ) = EA [ Q π ( st , A ) ] V_{\pi}(s_t)=E_{A}[Q_{\pi}(s_t,A)]Vp(st)=EA[Qp(st,A )]
Further based on the action space refinement can be divided into:

  • Independent variable
    V π ( st ) = EA [ Q π ( st , A ) ] = ∑ a π ( a ∣ st ) ⋅ Q π ( st , a ) V_{\pi}(s_t)=E_{A}[ Q_{\pi}(s_t,A)]=\sum_{a}\pi(a|s_t)\cdot Q_{\pi}(s_t,a)Vp(st)=EA[Qp(st,A)]=aπ ( a st)Qp(st,a)
  • Let
    V π ( st ) = EA [ Q π ( st , A ) ] = ∫ π ( a ∣ st ) ⋅ Q π ( st , a ) give V_{\pi}(s_t)=E_{A}[ Q_{\pi}(s_t,A)]=\int \pi(a|s_t)\cdot Q_{\pi}(s_t,a)daVp(st)=EA[Qp(st,A)]=π ( a st)Qp(st,a ) d a

In order to modify the policy function π \piπV π ( s ) V_{\pi}(s)Vp( s ) evaluation statussss good or bad

Find the expectation ES for all states [ V π ( S ) ] E_{S}[V_{\pi}(S)]ES[Vp( S )] , the policy functionπ \piπ is good or bad.

3. Python Reinforcement Learning Maze Example

The source code of this section comes from Wanghailin2019/Learing-DRL-by-PyTorch-cookbook: The author of this book is Yutaro Ogawa (Ogawa Kumataro) from Japan. The author’s github source code is annotated in Japanese. This repository translates it into Chinese

From "Learning Deep Reinforcement Learning PyTorch Programming Practice by Doing It"

#导入所使用的包
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

#迷宫的初始位置

#声明图的大小以及图的变量名
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()

#画出红色的墙壁
plt.plot([1, 1], [0, 1], color='red', linewidth=2)
plt.plot([1, 2], [2, 2], color='red', linewidth=2)
plt.plot([2, 2], [2, 1], color='red', linewidth=2)
plt.plot([2, 3], [1, 1], color='red', linewidth=2)

#画出表示状态的文字S0-S8
plt.text(0.5, 2.5, 'S0', size=14, ha='center')
plt.text(1.5, 2.5, 'S1', size=14, ha='center')
plt.text(2.5, 2.5, 'S2', size=14, ha='center')
plt.text(0.5, 1.5, 'S3', size=14, ha='center')
plt.text(1.5, 1.5, 'S4', size=14, ha='center')
plt.text(2.5, 1.5, 'S5', size=14, ha='center')
plt.text(0.5, 0.5, 'S6', size=14, ha='center')
plt.text(1.5, 0.5, 'S7', size=14, ha='center')
plt.text(2.5, 0.5, 'S8', size=14, ha='center')
plt.text(0.5, 2.3, 'START', ha='center')
plt.text(2.5, 0.3, 'GOAL', ha='center')

#设定画图的范围
ax.set_xlim(0, 3)
ax.set_ylim(0, 3)
plt.tick_params(axis='both', which='both', bottom='off', top='off',
                labelbottom='off', right='off', left='off', labelleft='off')

#当前位置S0用绿色圆圈画出
line, = ax.plot([0.5], [2.5], marker="o", color='g', markersize=60)

#设定参数θ的初始值theta_0,用于确定初始方案

#行为状态0-7,列为↑,→,↓,←表示的移动方向
theta_0 = np.array([[np.nan, 1, 1, np.nan],  # s0
                    [np.nan, 1, np.nan, 1],  # s1
                    [np.nan, np.nan, 1, 1],  # s2
                    [1, 1, 1, np.nan],  # s3
                    [np.nan, np.nan, 1, 1],  # s4
                    [1, np.nan, np.nan, np.nan],  # s5
                    [1, np.nan, np.nan, np.nan],  # s6
                    [1, 1, np.nan, np.nan],  # s7、※s8是目标,无策略
                    ])

#将策略参数θ转换为行动策略π的函数定义
def simple_convert_into_pi_from_theta(theta):
    #简单地计算百分比

    [m, n] = theta.shape  # 获取θ的矩阵大小
    pi = np.zeros((m, n))
    for i in range(0, m):
        pi[i, :] = theta[i, :] / np.nansum(theta[i, :])  # 计算百分比

    pi = np.nan_to_num(pi)  # 将nan转换为0

    return pi

#求初始策略π
pi_0 = simple_convert_into_pi_from_theta(theta_0)

#1步移动后求得状态s的函数的定义

def get_next_s(pi, s):
    direction = ["up", "right", "down", "left"]

    next_direction = np.random.choice(direction, p=pi[s, :])
    # 根据概率pi[s,:]选择direction

    if next_direction == "up":
        s_next = s - 3  # 向上移动时状态的数字减少3
    elif next_direction == "right":
        s_next = s + 1  # 向右移动时状态的数字增加1
    elif next_direction == "down":
        s_next = s + 3  # 向下移动时状态的数字增加3
    elif next_direction == "left":
        s_next = s - 1  # 向左移动时状态的数字减少1

    return s_next

# 迷宫内使智能体移动到目标的函数的定义

def goal_maze(pi):
    s = 0  # 开始地点
    state_history = [0]  # 记录智能体移动轨迹的列表

    while (1):  # 循环,直到到达目标
        next_s = get_next_s(pi, s)
        state_history.append(next_s)  # 在记录列表中添加下一个状态(智能体的位置)

        if next_s == 8:  # 到达目标地点则终止
            break
        else:
            s = next_s

    return state_history

# 在迷宫内朝着目标移动

state_history = goal_maze(pi_0)

print(state_history)
print("求解迷宫路径所需的步数是 " + str(len(state_history) - 1))

# 将智能体移动的情形可视化
# 参考URL http://louistiao.me/posts/notebooks/embedding-matplotlib-animations-in-jupyter-notebooks/
from matplotlib import animation
from IPython.display import HTML


def init():
    '''初始化背景图像'''
    line.set_data([], [])
    return (line,)


def animate(i):
    '''每一帧的画面内容'''
    state = state_history[i]  # 画出当前的位置
    x = (state % 3) + 0.5  # 状态的x坐标为状态数除以3的余数加0.5
    y = 2.5 - int(state / 3)  # 状态的y坐标为2.5减去状态数除以3的商
    line.set_data(x, y)
    return (line,)


# 用初始化函数和绘图函数来生成动画
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=len(
    state_history), interval=200, repeat=False)

HTML(anim.to_jshtml())

Guess you like

Origin blog.csdn.net/qq_38853759/article/details/130190054