强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle

我们的目标是书写强化学习-Q learning的代码，然后利用代码解决汉诺塔问题

强化学习简介

基础的详细定义之类的，就不再这里赘述了。下面直接说一些有用的东西。

强化学习的步骤：

对于每个状态，对这个状态下，所有的动作，计算这个状态-动作的潜在奖励。
- 一般记录在Q表格中，可以表示为 $Q[(state,move):value]$
对于汉诺塔问题，由于我们能达到最终的目标，所以这里设置最终的 reinforcement($r$) = 1
对于强化学习，我们的选择动作有两种策略（注：同的选择所对应的更新Q表格的方程不同）
- 一，每次选择最小的，更小的值，代表离目标更近。
- 二，每次选择更大的，更大的值，代表离目标更近。
- 这里我们设目标为1，同时使用更小值作为选择动作的方式。选择方程如下，
  - $ a_t^o = \mathop{\arg\min}_ Q(s_t,a).$
  - 其中$a_t$为选择的动作，$s_t$为当前状态，可以解释为，$s_t$下，有若干的动作$a$，选择Q最小的动作$a_t$
现在考虑Q表格的更新问题
对于Q表格的更新，我们采取下面两种方程。（r=1）（注意：这里我们会初始化所有的Q为0，接着再根据状态-动作进行更新）
- 如果达到目标
  
  \[ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r - Q(s_t,a_t)) \end{align*} \]
  - 或者直接赋值为1，表示到达目标，这里为了计算简单，直接赋值为1。
- 其他情况
  
  \[ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) \end{align*} \]
- 理解上述方程：
  - 首先，在$s_t$下，我们根据Q表格值，选取了动作$a_t$，运动到$s_{t+1}$。
  - 在$s_{t+1}$下，我们首先做的是更新上一个$s_t$下，动作$a_t$的Q值。
  - 这时，我们根据Q表格可以有$s_{t+1}$下，$a_{t+1}$的值，并且，我们有目标奖励reinforcement($r$) = 1
  - 这里，我们把$(r + Q(s_{t+1},a_{t+1}))$看作实际$s_t$下，动作$a_t$的Q值，同时估计值是$Q(s_t,a_t)$。
  - 因此，$(r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))$ 可以是估计值与实际值的差值，再乘以学习率$\rho$，表示每次学习的差值。
  - 最后，把差值累加上原有的估计值$Q(s_t,a_t)$，即为更新后的$Q(s_t,a_t)$

以上为对于基本强化学习的解释。

需要完成的事

首先，我们要把汉诺塔问题可视化。方便观察运行结果，与过程。
简单来看，我们可以用[[1, 2, 3], [], []]表示一个一个状态，三个小的[]表示三根塔柱，数字表示三个塔盘，其中大小表示塔盘的不同大小。
对于移动塔盘的动作，也可以简化为单个[1, 2]，或者(1, 2),表示为，把一号塔柱，上的塔盘移动到二号塔柱上，（从左到右依次1，2，3）
那么我们可以书写一下四个方程：
- printState(state): 打印塔的状态，便于可视化
- validMoves(state): 返回当前 state下的所有的可行动作
- makeMove(state, move): 返回根据move(action)移动后的state
- stateMoveTuple(state, move): state与move(action)需要更改以为tuple格式，即(state,move)，因为，这里我们把Q表格更改字典型存储，这样比较简单
接下来书写epsilonDecayFactor方程
- 此方程的功能为：随机一个数，如果这个数小于我们预设的epsilon，那么就随机一个动作。如果大于，就从Q表格中选择Q值最小的动作运动。
- 对于epsilonGreedy 方程(If np.random.uniform()<epsilon)来说，小的epsilon意味着，更多可能会使用Q表格选取动作。太大的epsilon会导致无法收敛的问题。对于本次题目来说，加入$epsilon*=espsilonDecayFactor$ 来不断减小epsilon的值，使其趋向于0。
trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF，startState，goalState)
- 根据start与goal状态，训练Q表格，得到一个合理的Q表格
- 以下为trainQ的伪代码:
  1. 初始化 Q.
  2. Repeat:
    1. Use epsilonGreedy function to get the action and get the stateNew
    2. If (stateNew,move) not in Q,
    3. Update Qold = 0
    4. If stateNew is goalState,
      1. Update Qold = 1
    5. Otherwise (not at goal),
      1. If not first step, update Qold = Qold + rho * (1 + Qnew - Qold)
      2. Shift current state and action to old ones.
testQ(Q, maxSteps, validMovesF, makeMoveF，startState，goalState)
- 选择需要的start与goal状态，自动根据Q表格中的值，选择最优的移动策略
- 以下为testQ的伪代码：
  1. get Q from the trainQ；
  2. Repeat：
    1. Use validMoves function to get the action list；
    2. Use the Q table to get value of $(state,action)$；
    if the action is not in Q, set the value is infinity
    1. Choose the action by the $argmin Q[(state,move)]$
    2. Record the action and state in path
    3. If at goal；
    return path
    1. If step > maxStep；
    return 'Goal not reached in maxSteps'

Code & Test

import numpy as np
import random
import matplotlib.pyplot as plt
import copy
%matplotlib inline

def stateModify(state):
    N = 3
    row = []
    stateModify = []
    collums= len(state)
    stateCopy = copy.copy(state)
    for i in range(collums):
        row.append(len(state[i]))
    # add 0 in modified state
    for i in range (collums):
        while row[i] < N:
            stateCopy[i].insert(0,0)
            row[i]= len(stateCopy[i])    
    # set it as modify state
    for i in range(max(row)):
        for j in range(len(stateCopy)):
            stateModify.append(stateCopy[j][i])          
    return(stateModify)

def printState(state):
    statePrint = stateModify(state)
    # print the state 
    i = 0
    for num in statePrint:
        # if the number is zero, we print ' '
        if num == 0:
            print(" ",end=" ")
        else:
            print(num, end=" ")
        i += 1
        if i%3 == 0:
            print("")
    print('------')

def validMoves(state):
    actions = []    
    # check left 
    if state[0] != []:
        # left to middle
        if state[1]==[] or state[0][0] < state[1][0]:
            actions.append([1,2])
        # left to right
        if state[2]==[] or state[0][0] < state[2][0]:
            actions.append([1,3])
   
    # check middle
    if state[1] != []:
        # middle to left
        if state[0]==[] or state[1][0] < state[0][0]:
            actions.append([2,1])
        # middle to right   
        if state[2]==[] or state[1][0] < state[2][0]:
            actions.append([2,3])
    
    # check right        
    if state[2] != []:
        # right to left
        if state[0]==[] or state[2][0] < state[0][0]:
            actions.append([3,1])
        # right to middle
        if state[1]==[] or state[2][0] < state[1][0]:
            actions.append([3,2])            
    return actions

def stateMoveTuple(state, move):
    stateTuple = []
    returnTuple = [tuple(move)]
    for i in range (len(state)):
        stateTuple.append(tuple(state[i]))
    returnTuple.insert(0,tuple(stateTuple))
    return tuple(returnTuple)

def makeMove(state, move):
    stateMove = []
    stateMove = copy.deepcopy(state)
    
    stateMove[move[1]-1].insert(0,stateMove[move[0]-1][0])
    stateMove[move[0]-1].pop(0)
    return stateMove

def epsilonGreedy(Q, state, epsilon, validMovesF):
    validMoveList = validMoves(state)
    if np.random.uniform() < epsilon:
        # Random Move
        lens = len(validMoveList)
        return validMoveList[random.randint(0,lens-1)]
    else:
        # Greedy Move
        Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList]) 
        return validMoveList[np.argmin(Qs)]

def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState):
    epsilon = 1.0
    outcomes = np.zeros(nRepetitions)
    Q = {}
    for nGames in range(nRepetitions):
        epsilon *= epsilonDecayFactor
        step = 0
        done = False
        state = copy.deepcopy(startState)
    
        while not done:
            step += 1
            move = epsilonGreedy(Q, state, epsilon, validMovesF)         
            stateNew = makeMoveF(state,move)
            if stateMoveTuple(state, move) not in Q:
                Q[stateMoveTuple(state, move)] = 0 
                
            if stateNew == goalState:
#                 Q[stateMoveTuple(state, move)] += learningRate * (1 - Q[stateMoveTuple(state, move)])
                Q[stateMoveTuple(state, move)] = 1
                done = True
                outcomes[nGames] = step  
                
            else:
                if step > 1:
                    Q[stateMoveTuple(stateOld, moveOld)] += learningRate * \
                                    (1 + Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)]) 
                stateOld = copy.deepcopy(state)
                moveOld = copy.deepcopy(move)
                state = copy.deepcopy(stateNew)
    return Q, outcomes

def testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState):
    state = copy.copy(startState)
    epsilon = 1.0
    path = []
    path.append(state)
    done = False
    step = 0 
    while not done:
        step += 1 
        Qs = []
        validMoveList = validMoves(state)
        for m in validMoveList:
            if stateMoveTuple(state, m) in Q:
                Qs.append(Q[stateMoveTuple(state, m)])
            else:
                Qs.append(0xffffff)
        stateNew = makeMoveF(state,validMoveList[np.argmin(Qs)])
        path.append(stateNew)
        if stateNew == goalState:
            return path
            done = True
        elif step >=maxSteps:
            print('Goal not reached in {} steps'.format(maxSteps))
            return []
            done = True
        state = copy.deepcopy(stateNew)

def minsteps(steps,minStepOld,nRepetitions):
    delStep =0

    steps = list(steps)
#     lengh = len(step)
    while delStep != nRepetitions:
        if np.mean(steps)>7:
            steps.pop(0)
            delStep += 1
        else:
            if delStep < minStepOld:
                return delStep,True
            else:
                return minStepOld,False
    if delStep < minStepOld:
        return delStep,True
    else:
        return minStepOld,False

def findBetter(nRepetitions,learningRate,epsilonDecayFactor):
    Q, steps = trainQ(nRepetitions, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])       
    minStepOld,_ = minsteps(steps,0xffffff,50)
    bestlRate = 0.5
    besteFactor = 0.7
    LAndE = []
    for k in range(10):
        for i in range(len(learningRate)):
            for j in range(len(epsilonDecayFactor)):
                Q, steps = trainQ(nRepetitions, learningRate[i], epsilonDecayFactor[j], validMoves, makeMove,\
                                  startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
                minStepNew,B = minsteps(steps,minStepOld,nRepetitions)
                if B:
                    bestlRate = learningRate[i]
                    besteFactor = epsilonDecayFactor[j]
                    minStepOld = copy.deepcopy(minStepNew)
        LAndE.append([bestlRate,besteFactor])            
    return LAndE

Test part

state = [[1, 2, 3], [], []]
printState(state)

1     
2     
3     
------

state = [[1, 2, 3], [], []]
move =[1, 2]
stateMoveTuple(state, move)

(((1, 2, 3), (), ()), (1, 2))

state = [[1, 2, 3], [], []]
newstate = makeMove(state, move)
newstate

[[2, 3], [1], []]

Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path = testQ(Q, 20, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path

[[[1, 2, 3], [], []],
 [[2, 3], [], [1]],
 [[3], [2], [1]],
 [[3], [1, 2], []],
 [[], [1, 2], [3]],
 [[1], [2], [3]],
 [[1], [], [2, 3]],
 [[], [], [1, 2, 3]]]

for s in path:
    printState(s)
    print()

1     
2     
3     
------

2     
3   1 
------

3 2 1 
------

  1   
3 2   
------

  1   
  2 3 
------

1 2 3 
------   

    2 
1   3 
------

	1 
	2 
	3 
------

# find better learningRate and epsilonDecayFactor
learningRate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
epsilonDecayFactor = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
LAndE = findBetter(100,learningRate,epsilonDecayFactor)
print(LAndE)

[[0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6]]

强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle

强化学习简介

需要完成的事

Code & Test

Test part

猜你喜欢