OpenAI GYM CarRacing DQN: Deep Q learning to train self-driving cars
introduction
In the field of reinforcement learning, training a CarRacing 2D agent capable of autonomous driving is a fascinating challenge. In this blog, we'll dive into the code for Deep Q Learning (DQN) implemented using TensorFlow and Keras to train a model capable of navigating CarRacing's virtual race track.
DQN algorithm principle
Q-values and the Bellman equation
The Q-value (the expected cumulative reward for a state-action pair) is defined by the Bellman equation:
[ Q(s,a) = r(s,a) + \gamma \max Q(s', A) ]
- (s) is the current state
- (a) is the action taken
- (r(s,a)) is the reward after taking action (a) in state (s)
- (s') is the next state
- (A) is the action space
- (\gamma) is the discount rate used to measure the importance of future rewards
DQN structure
DQN combines Q learning with deep learning, replacing the Q table with a neural network. The structure of the model is as follows:
model = Sequential()
model.add(Conv2D(filters=6, kernel_size=(7, 7), strides=3, activation='relu', input_shape=(96, 96, self.frame_stack_num)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(filters=12, kernel_size=(4, 4), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(216, activation='relu'))
model.add(Dense(len(self.action_space), activation=None))
model.compile(loss='mean_squared_error', optimizer=Adam(lr=self.learning_rate, epsilon=1e-7))
- The input is three consecutive top view images, each 96x96 pixels
- Convolutional layers and max pooling layers are used to capture image features
- The fully connected layer outputs the Q value of each action
Training process design
Experience Replay
In order to break the temporal correlation between data, experience replay is used to store previous experiences in the experience pool and randomly sample from them for training.
def memorize(self, state, action, reward, next_state, done):
self.memory.append((state, self.action_space.index(action), reward, next_state, done))
Target Network
Introduce the target network to slow down the change of the target and improve the stability of training.
def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())
training loop
def replay(self, batch_size):
minibatch = random.sample(self.memory, batch_size)
train_state = []
train_target = []
for state, action_index, reward, next_state, done in minibatch:
target = self.model.predict(np.expand_dims(state, axis=0))[0]
if done:
target[action_index] = reward
else:
t = self.target_model.predict(np.expand_dims(next_state, axis=0))[0]
target[action_index] = reward + self.gamma * np.amax(t)
train_state.append(state)
train_target.append(target)
self.model.fit(np.array(train_state), np.array(train_target), epochs=1, verbose=0)
In each training cycle, a batch of data is randomly selected from the experience pool, the target Q value is calculated, and the model weights are updated.
Training results and model evolution
Through training, we observe that the model gradually learns to navigate the track:
After 400 rounds of training
The model encountered difficulties in making sharp turns during learning and occasionally deviated from the track.
After 500 rounds of training
The model becomes more proficient, making fewer errors and driving smoother.
After 600 rounds of training
The model became reckless in its greed for rewards, causing it to leave the track during sharp turns.
Summarize
This blog provides an in-depth analysis of the process of training self-driving agents using the DQN algorithm. Through experience replay and the application of target networks, the model gradually learns to optimize the Q value to achieve better navigation strategies. Deep Q-learning provides a powerful and flexible method for solving decision-making problems in complex environments, and provides new ideas for research and applications in the field of autonomous driving.