Double Deep Q-Networks (DDQN) - A Quick Intro (with Code)
In the previous post, we discussed how Deep Q-Networks has proven to be a powerful tool for solving RL-based problems. However, over the years, several modifications to it have resulted in a great performance. In this article, I’ll discuss one of these modifications known as Double Deep Q-Networks (DDQN).
- Overestimation in Q-learning
- Introduction to Double DQN
- Performance of DDQN
- Implementation of DDQN
- Final Thoughts
Overestimation in Q-learning
One of the challenges in Q-learning is that the Q function is updated based on estimates of the future rewards, rather than the true rewards. This can lead to an overestimation of the Q values, especially when the estimates are made using an inaccurate model of the environment. In other words, the agent may think it would get a higher reward than it actually would, leading to wrong decisions.
There are various techniques have been developed to address the problem of overestimation in Q-learning. One such technique is Double DQN.
Introduction to Double DQN
Value estimates of DDQN vs DQN
Double DQN is a variant of the deep Q-network (DQN) algorithm that addresses the problem of overestimation in Q-learning. It was introduced in 2015 by Hado van Hasselt et al. in their paper “Deep Reinforcement Learning with Double Q-Learning”.
In traditional DQN, the Q function is updated using the Bellman equation, which involves estimating the maximum expected future reward for each action. However, this can lead to the overestimation of the Q values, as described in the previous subtopic. On the other hand, DDQN addresses this problem by separating the action selection and action evaluation steps in the Q-learning update.
Specifically, in DDQN, the action with the maximum Q value is selected using one network (the “selection network”), and the Q value for this action is evaluated using a separate network (the “target network”).
Performance of DDQN
Performance of DDQN vs DQN
Double DQN has been evaluated on a variety of reinforcement learning tasks and has been shown to improve performance on many of them. In the original paper, the authors demonstrated improved performance on the Atari 2600 game suite compared to traditional DQN.
Additionally, other studies have also found that DDQN can lead to improved performance on tasks such as playing the card game Hanabi, controlling a simulated robot arm, and navigating a simulated 3D environment. In general, DDQN has been found to be particularly effective at reducing the overestimation of the Q values and improving the stability of learning.
However, Double DQN is not a panacea and may not always lead to improved performance. In fact, some studies have found that DDQN can perform worse than traditional DQN on certain tasks, such as playing the game of Go. Therefore, it is important to carefully evaluate the performance of DDQN on each specific task to determine whether it is likely to be beneficial. A comprehensive list of results performed by the authors of the DDQN paper is available on page 10 of the paper.
In summary, DDQN has been shown to improve performance on many reinforcement learning tasks, but it is not always the best choice and its effectiveness can vary depending on the specific task.
Implementation of DDQN
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import numpy as np
import random
# Define the network architecture
class QNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
# Define the replay buffer
class ReplayBuffer:
def __init__(self, capacity):
self.capacity = capacity
self.buffer = []
self.index = 0
def push(self, state, action, reward, next_state, done):
if len(self.buffer) < self.capacity:
self.buffer.append(None)
self.buffer[self.index] = (state, action, reward, next_state, done)
self.index = (self.index + 1) % self.capacity
def sample(self, batch_size):
batch = np.random.choice(len(self.buffer), batch_size, replace=False)
states, actions, rewards, next_states, dones = [], [], [], [], []
for i in batch:
state, action, reward, next_state, done = self.buffer[i]
states.append(state)
actions.append(action)
rewards.append(reward)
next_states.append(next_state)
dones.append(done)
return (
torch.tensor(np.array(states)).float(),
torch.tensor(np.array(actions)).long(),
torch.tensor(np.array(rewards)).unsqueeze(1).float(),
torch.tensor(np.array(next_states)).float(),
torch.tensor(np.array(dones)).unsqueeze(1).int()
)
def __len__(self):
return len(self.buffer)
# Define the Double DQN agent
class DDQNAgent:
def __init__(self, state_size, action_size, seed, learning_rate=1e-3, capacity=1000000,
discount_factor=0.99, tau=1e-3, update_every=4, batch_size=64):
self.state_size = state_size
self.action_size = action_size
self.seed = seed
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.tau = tau
self.update_every = update_every
self.batch_size = batch_size
self.steps = 0
self.qnetwork_local = QNetwork(state_size, action_size)
self.qnetwork_target = QNetwork(state_size, action_size)
self.optimizer = optim.Adam(self.qnetwork_local.parameters(), lr=learning_rate)
self.replay_buffer = ReplayBuffer(capacity)
self.update_target_network()
def step(self, state, action, reward, next_state, done):
# Save experience in replay buffer
self.replay_buffer.push(state, action, reward, next_state, done)
# Learn every update_every steps
self.steps += 1
if self.steps % self.update_every == 0:
if len(self.replay_buffer) > self.batch_size:
experiences = self.replay_buffer.sample(self.batch_size)
self.learn(experiences)
def act(self, state, eps=0.0):
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
state = torch.from_numpy(state).float().unsqueeze(0).to(device)
self.qnetwork_local.eval()
with torch.no_grad():
action_values = self.qnetwork_local(state)
self.qnetwork_local.train()
# Epsilon-greedy action selection
if random.random() > eps:
return np.argmax(action_values.cpu().data.numpy())
else:
return random.choice(np.arange(self.action_size))
def learn(self, experiences):
states, actions, rewards, next_states, dones = experiences
# Get max predicted Q values (for next states) from target model
Q_targets_next = self.qnetwork_target(next_states).detach().max(1)[0].unsqueeze(1)
# Compute Q targets for current states
Q_targets = rewards + self.discount_factor * (Q_targets_next * (1 - dones))
# Get expected Q values from local model
Q_expected = self.qnetwork_local(states).gather(1, actions.view(-1, 1))
# Compute loss
loss = F.mse_loss(Q_expected, Q_targets)
# Minimize the loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# Update target network
self.soft_update(self.qnetwork_local, self.qnetwork_target)
def update_target_network(self):
# Update target network parameters with polyak averaging
for target_param, local_param in zip(self.qnetwork_target.parameters(), self.qnetwork_local.parameters()):
target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)
def soft_update(self, local_model, target_model):
for target_param, local_param in zip(target_model.parameters(), local_model.parameters()):
target_param.data.copy_(self.tau * local_param.data + (1.0 - self.tau) * target_param.data)
Explanation
-
QNetwork
: a PyTorch module that defines the architecture of the Q-network. It takes in a state and outputs the action values for all actions. -
ReplayBuffer
: a class that stores experiences in a circular buffer and samples a batch of experiences randomly for learning. -
DDQNAgent
: the main class that implements the Double DQN algorithm. It has the following methods:-
__init__
: initializes the local and target Q-networks, the optimizer, the replay buffer, and some hyperparameters. It also callsupdate_target_network
to initialize the target network with the same weights as the local network. -
step
: stores an experience in the replay buffer and learns from a batch of experiences everyupdate_every
steps. -
act
: selects an action using an epsilon-greedy policy based on the action values output by the local Q-network. -
learn
: performs a learning step using a batch of experiences. It computes the Q-targets using the local Q-network and the action values output by the target Q-network, and then minimizes the loss between the Q-targets and the expected Q-values using the local Q-network. It then updates the target Q-network using polyak averaging. -
update_target_network
: updates the target Q-network with polyak averaging. -
soft_update
: a helper function that updates the target Q-network with polyak averaging.
-
To use this Double DQN implementation, you can create an instance of the DDQNAgent
class and call its act
and step
methods in your training loop. You can also customize the hyperparameters and the network architecture by modifying the __init__
method of the DDQNAgent
class.
Training DDQN
Cart Pole task from OpenAI Gym
Assuming we’ve saved the previous code snippet in a file called ddqn.py
, the following snippet can be used to import and train the agent on performing the CartPole task:
import gym
import numpy as np
from ddqn import DDQNAgent
import matplotlib.pyplot as plt
# Create the environment
env = gym.make('CartPole-v0')
# Get the state and action sizes
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
# Set the random seed
seed = 0
# Create the DDQN agent
agent = DDQNAgent(state_size, action_size, seed)
# Set the number of episodes and the maximum number of steps per episode
num_episodes = 1000
max_steps = 1000
# Set the exploration rate
eps = eps_start = 1.0
eps_end = 0.01
eps_decay = 0.995
# Set the rewards and scores lists
rewards = []
scores = []
# Run the training loop
for i_episode in range(num_episodes):
print(f'Episode: {i_episode}')
# Initialize the environment and the state
state = env.reset()[0]
score = 0
# eps = eps_end + (eps_start - eps_end) * np.exp(-i_episode / eps_decay)
# Update the exploration rate
eps = max(eps_end, eps_decay * eps)
# Run the episode
for t in range(max_steps):
# Select an action and take a step in the environment
action = agent.act(state, eps)
next_state, reward, done, trunc, _ = env.step(action)
# Store the experience in the replay buffer and learn from it
agent.step(state, action, reward, next_state, done)
# Update the state and the score
state = next_state
score += reward
# Break the loop if the episode is done or truncated
if done or trunc:
break
print(f"\tScore: {score}, Epsilon: {eps}")
# Save the rewards and scores
rewards.append(score)
scores.append(np.mean(rewards[-100:]))
# Close the environment
env.close()
plt.ylabel("Score")
plt.xlabel("Episode")
plt.plot(range(len(rewards)), rewards)
plt.plot(range(len(rewards)), scores)
plt.legend(['Reward', "Score"])
plt.show()
The progress of training may look like the following:
DDQN training
Final Thoughts
That’s it! Hope you found this helpful. Let me know if there are any questions in the comments below. Also, feel free to check out my other posts on Reinforcement Learning here.