Training Q learning in the MDP environment
This example shows how to train a Q-learning agent to solve a general Markov Decision Process (MDP) environment.
Here:
-
Each circle represents a state.
-
In each state, there is a decision that determines the rise or fall.
-
The agent starts from state 1.
-
The agent receives a reward equal to each transition value in the chart.
-
The training goal is to collect the largest cumulative reward.
Create MDP agent environment
Create an MDP model with eight states and two actions ("up" and "down")
MDP = createMDP(8,["up";"down"]);
To model the transition from the above figure, please modify the state transition matrix and reward matrix of the MDP. By default, these matrices contain zeros.
Specify the state transition and reward matrix of the MDP. For example, in the following command:
1. The first two lines specify the transition from state 1 to state 2 by taking action 1 ("up"), and provide a +3 reward for this transition.
2. The next two lines specify the transition from state 1 to state 3 by taking action 2 ("press") and specifying a reward of +1.
MDP.T(1,2,1) = 1;
MDP.R(1,2,1) = 3;
MDP.T(1,3,2) = 1;
MDP.R(1,3,2) = 1;
Similarly, specify state transitions and rewards for the remaining rules in the diagram.
% State 2 transition and reward
MDP.T(2,4,1) = 1;
MDP.R(2,4,1) = 2;
MDP.T(2,5,2) = 1;
MDP.R(2,5,2) = 1;
% State 3 transition and reward
MDP.T(3,5,1) = 1;
MDP.R(3,5,1) = 2;
MDP.T(3,6,2) = 1;
MDP.R(3,6,2) = 4;
% State 4 transition and reward
MDP.T(4,7,1) = 1;
MDP.R(4,7,1) = 3;
MDP.T(4,8,2) = 1;
MDP.R(4,8,2) = 2;
% State 5 transition and reward
MDP.T(5,7,1) = 1;
MDP.R(5,7,1) = 1;
MDP.T(5,8,2) = 1;
MDP.R(5,8,2) = 9;
% State 6 transition and reward
MDP.T(6,7,1) = 1;
MDP.R(6,7,1) = 5;
MDP.T(6,8,2) = 1;
MDP.R(6,8,2) = 1;
% State 7 transition and reward
MDP.T(7,7,1) = 1;
MDP.R(7,7,1) = 0;
MDP.T(7,7,2) = 1;
MDP.R(7,7,2) = 0;
% State 8 transition and reward
MDP.T(8,8,1) = 1;
MDP.R(8,8,1) = 0;
MDP.T(8,8,2) = 1;
MDP.R(8,8,2) = 0;
Specify the state "s7" and "s8" as the terminal state of the MDP.
MDP.TerminalStates = ["s7";"s8"];
Create a reinforcement learning MDP environment for this process model.
env = rlMDPEnv(MDP);
To specify that the initial state of the agent is always state 1, specify a reset function to return to the initial agent state. This function will be called at the beginning of each training and simulation. Create an anonymous function handle with the initial state set to 1.
env.ResetFcn = @() 1;
Fix random generator seed to improve repeatability.
rng(0)
Create Q learning agent
To create a Q learning agent, first use the observation and operation specifications in the MDP environment to create a Q table. Set the indicated learning rate to 1.
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);
qTable = rlTable(obsInfo, actInfo);
qRepresentation = rlQValueRepresentation(qTable, obsInfo, actInfo);
qRepresentation.Options.LearnRate = 1;
Next, use this table representation to create a Q learning agent and configure epsilon-greedy exploration.
agentOpts = rlQAgentOptions;
agentOpts.DiscountFactor = 1;
agentOpts.EpsilonGreedyExploration.Epsilon = 0.9;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
qAgent = rlQAgent(qRepresentation,agentOpts);
Train Q learning agent
To train the agent, first specify the training options. For this example, use the following options:
-
The training is up to 200 times, and each episode lasts up to 50 time steps.
-
When the average cumulative reward obtained by the agent in 30 consecutive episodes is greater than 10, please stop training.
trainOpts = rlTrainingOptions;
trainOpts.MaxStepsPerEpisode = 50;
trainOpts.MaxEpisodes = 200;
trainOpts.StopTrainingCriteria = "AverageReward";
trainOpts.StopTrainingValue = 13;
trainOpts.ScoreAveragingWindowLength = 30;
Use the train function to train the agent. This may take a few minutes to complete. In order to save the time of running this example, please load the pre-trained agent false by setting doTraining to. To train the agent yourself, please set doTraining to true.
doTraining = false;
if doTraining
% Train the agent.
trainingStats = train(qAgent,env,trainOpts);
else
% Load pretrained agent for the example.
load('genericMDPQAgent.mat','qAgent');
end
Verify Q learning results
To verify the training results, use the sim function to simulate an agent in the training environment. The agent successfully finds the best path leading to cumulative reward13.
Data = sim(qAgent,env);
cumulativeReward = sum(Data.Reward)
Cumulative reward = 13
Since the discount factor is set to 1, the value in the Q table of the trained agent matches the undiscounted return of the environment.
QTable = getLearnableParameters(getCritic(qAgent));
QTable{
1}
TrueTableValues = [13,12;5,10;11,9;3,2;1,9;5,1;0,0;0,0]