Basics of using q-learning reinforcement learning

reinforcement learning

Learning through strategies, q-learning (Markov chain model)

Markov chain: reward*discount factor, R(t)=reward(1)+yR(t+1). After multiple iterations of the Markov chain, the distribution tends to be stable, so the optimal solution can be obtained.

q-learning

Construct a qtable. The two-dimensional array contains two dimensions, state and action. The value during the qtable iteration process is not greater than 1.

-	action1	action2	action3
state1
state2
state3

Action update formula: Q(s,a)←Q(s,a)+α[reward+γmax′Q(s′,a′)−Q(s,a)] to score each action, and finally use
numpy The argmax gets the maximum index
γ discount factor. The larger the value, the greater the weight of the current action. Otherwise, the weight of the historical action is greater.

The training process introduces a greedy algorithm
Insert image description here

gym use

import gym

quit = False
env = gym.make("CartPole-v1", render_mode="human")
print(env.observation_space,env.action_space)
state = env.reset()    #reset返回env内在的状态4参数，qtable的state=4个参数组成一个state值，再根据state的参数范围，划分出n个状态，action=0，1左右2个值
while not quit:
    env.render()
    env.step(1)

Official demo

env = gym.make('CartPole-v0')
for i_episode in range(20):
    observation = env.reset() #初始化环境每次迭代
    for t in range(100):
        env.render() #显示
        print(observation)
        action = env.action_space.sample() #随机选择action
        observation, reward, done, info = env.step(action)
        if done:#判断游戏是否结束
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

Insert image description here

q-learning

Official demo, gym's reward does not need to be built by yourself, gym will give the corresponding reward value according to the user's operation, and the step function returns

alpha = 0.8
nstate = 50   #划分状态区间，平衡游戏只在初始位置有效，平均划分状态值不好训练
gamma = 1  #衰减因子
env = gym.make("CartPole-v0", render_mode="human")
table = np.zeros((nstate,nstate,nstate,nstate,env.action_space.n))#左右两个action
print(env.observation_space,env.action_space)

for i in range(10000):
    t=0
    observation = env.reset()
    high = env.observation_space.high
    low = env.observation_space.low
    high[1]=high[3]=10   #重新定义取值范围，否则state索引位置不变化
    low[1]=low[3]=-10   #重新定义取值范围，否则state索引位置不变化
    while True:
        env.render()
        div = (high-low)/nstate
        state = tuple(((observation[0]-env.observation_space.low)/div).astype(int))
        if np.random.random() < 0.3:#随机选择
            action = env.action_space.sample()
        else:
            action = np.argmax(table[state])
        t+=1
        observation = env.step(action)
        table[state][action] += (alpha*(observation[1]+ gamma * np.max(table[state])-table[state][action]))
        if observation[2]: 
            print("{} Episode finished after {} timesteps".format(state,t+1))
            break;

Example of climbing a mountain with a car

Insert image description here

env = gym.make("MountainCar-v0", render_mode="human")
n_states = 40
iter_max = 10000
gamma = 1.0
epsilon = 0.3
alpha = 0.5

def obs_to_state(obs):   #把参数范围划分40个状态，求当前值在哪个状态区间
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_dx = (env_high - env_low) / n_states
    state_index = tuple(((obs - env_low) / env_dx).astype(int))
    return state_index

Q = np.zeros((n_states, n_states, env.action_space.n))
obs = env.reset()
s = obs_to_state(obs[0])
while True:
    env.render()
    if np.random.uniform(0, 1) < epsilon:
        a = env.action_space.sample()
    else:
        a = np.argmax(Q[s])
    obs = env.step(a)
    if obs[2]: break
    next_s = obs_to_state(obs[0])
    td_target = obs[1] + gamma * np.max(Q[next_s])
    td_error = td_target - Q[s][a]
    Q[s][a] += alpha * td_error
    s = next_s

print(Q)

Save parameter values

numpy.save(“1”,qtable)
numpy.load(“1”)

The qtable is saved. Every time the qtable result is used, the index of the multi-dimensional array that saves the state is traversed during the training process. The qtable can be used to make decisions by removing the random factors. The qtable needs to traverse all qtables to obtain a stable result. The training is too Slowness can be DQN network

In games other than gym, you need to customize rewards. The greater the difference in reward values for each state, the faster you will learn.

q-learning is suitable for discrete movements, but the effect of continuous state transformation is not very good.