MATLAB reinforcement learning toolbox (2)-training Q learning in the MDP environment

This example shows how to train a Q-learning agent to solve a general Markov Decision Process (MDP) environment.
Insert picture description here
Here:

  1. Each circle represents a state.

  2. In each state, there is a decision that determines the rise or fall.

  3. The agent starts from state 1.

  4. The agent receives a reward equal to each transition value in the chart.

  5. The training goal is to collect the largest cumulative reward.

Create MDP agent environment

Create an MDP model with eight states and two actions ("up" and "down")

MDP = createMDP(8,["up";"down"]);

To model the transition from the above figure, please modify the state transition matrix and reward matrix of the MDP. By default, these matrices contain zeros.
Specify the state transition and reward matrix of the MDP. For example, in the following command:

1. The first two lines specify the transition from state 1 to state 2 by taking action 1 ("up"), and provide a +3 reward for this transition.

2. The next two lines specify the transition from state 1 to state 3 by taking action 2 ("press") and specifying a reward of +1.

MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;

Similarly, specify state transitions and rewards for the remaining rules in the diagram.

% State 2 transition and reward
MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;
% State 3 transition and reward
MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;
% State 4 transition and reward
MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;
% State 5 transition and reward
MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;
% State 6 transition and reward
MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;
% State 7 transition and reward
MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;
% State 8 transition and reward
MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;

Specify the state "s7" and "s8" as the terminal state of the MDP.

MDP.TerminalStates = ["s7";"s8"];

Create a reinforcement learning MDP environment for this process model.

env = rlMDPEnv(MDP);

To specify that the initial state of the agent is always state 1, specify a reset function to return to the initial agent state. This function will be called at the beginning of each training and simulation. Create an anonymous function handle with the initial state set to 1.

env.ResetFcn = @() 1;

Fix random generator seed to improve repeatability.

rng(0)

Create Q learning agent

To create a Q learning agent, first use the observation and operation specifications in the MDP environment to create a Q table. Set the indicated learning rate to 1.

obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
qTable = rlTable(obsInfo, actInfo);
qRepresentation = rlQValueRepresentation(qTable, obsInfo, actInfo);
qRepresentation.Options.LearnRate = 1;

Next, use this table representation to create a Q learning agent and configure epsilon-greedy exploration.

agentOpts = rlQAgentOptions;
agentOpts.DiscountFactor = 1;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.9;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
qAgent = rlQAgent(qRepresentation,agentOpts);

Train Q learning agent

To train the agent, first specify the training options. For this example, use the following options:

  1. The training is up to 200 times, and each episode lasts up to 50 time steps.

  2. When the average cumulative reward obtained by the agent in 30 consecutive episodes is greater than 10, please stop training.

trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 200;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 13;
trainOpts.ScoreAveragingWindowLength = 30;

Use the train function to train the agent. This may take a few minutes to complete. In order to save the time of running this example, please load the pre-trained agent false by setting doTraining to. To train the agent yourself, please set doTraining to true.

doTraining = false;

if doTraining
    % Train the agent.
    trainingStats = train(qAgent,env,trainOpts);
else
    % Load pretrained agent for the example.
    load('genericMDPQAgent.mat','qAgent');
end

Insert picture description here

Verify Q learning results

To verify the training results, use the sim function to simulate an agent in the training environment. The agent successfully finds the best path leading to cumulative reward13.

Data = sim(qAgent,env);
cumulativeReward = sum(Data.Reward)

Cumulative reward = 13
Since the discount factor is set to 1, the value in the Q table of the trained agent matches the undiscounted return of the environment.

QTable = getLearnableParameters(getCritic(qAgent));
QTable{
    
    1}

Insert picture description here

TrueTableValues = [13,12;5,10;11,9;3,2;1,9;5,1;0,0;0,0]

Insert picture description here

Guess you like

Origin blog.csdn.net/wangyifan123456zz/article/details/109399336