Deep Q-Networks (DQN)

Deep Q-Networks (DQN) Tutorial

Introduction

Deep Q-Networks (DQN) combine Q-Learning with deep neural networks. Instead of storing a Q-table, the algorithm uses a neural network to approximate the Q-function. This approach allows Reinforcement Learning to handle large or continuous state spaces that traditional Q-Learning cannot manage.

بالعربية المغربية (الدارجة): الشبكات العصبية العميقة ديال Q (DQN) كدمج التعلم بالتعزيز مع الشبكات العصبية. بلا ما نستعمل جدول Q كبير، كندرب شبكة عصبية باش تتعلم شنو هي القيمة ديال كل فعل فكل حالة. بهاد الطريقة نقدر نخدم فبيئات كبيرة بزاف.

Core Concepts Explained

Q-Network: A neural network that takes a state as input and outputs Q-values for each action.
Target Network: A copy of the Q-network used to stabilize learning.
Experience Replay: A memory buffer that stores past experiences to break correlation between samples.
Loss Function: Mean squared error between predicted Q-values and target Q-values.

بالعربية المغربية: عندنا شبكة عصبية كتسمى Q-Network كتخرج القيم Q لكل فعل. وعندنا نسخة منها (Target Network) كتعاون باش التعلم يكون مستقر. كذلك كنسجلو التجارب القديمة فـMemory باش مانبقوش نتعلمو من نفس البيانات بزاف.

DQN Update Rule

The DQN loss function is based on the Bellman equation:


Loss = (R + γ * max(Q_target(s', a')) - Q(s, a))²

We minimize this loss to update the weights of the Q-network.

Python Example: Basic DQN with PyTorch


import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque

# Neural Network for Q-function
class DQN(nn.Module):
    def __init__(self, state_size, action_size):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(state_size, 24)
        self.fc2 = nn.Linear(24, 24)
        self.fc3 = nn.Linear(24, action_size)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# Environment (example: simple 1D world)
state_size = 4
action_size = 2
policy_net = DQN(state_size, action_size)
target_net = DQN(state_size, action_size)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=0.001)

# Replay memory
memory = deque(maxlen=2000)
gamma = 0.95
batch_size = 32

def replay():
    if len(memory) < batch_size:
        return
    minibatch = random.sample(memory, batch_size)
    states, actions, rewards, next_states, dones = zip(*minibatch)
    states = torch.tensor(states, dtype=torch.float32)
    next_states = torch.tensor(next_states, dtype=torch.float32)
    actions = torch.tensor(actions)
    rewards = torch.tensor(rewards)
    dones = torch.tensor(dones, dtype=torch.float32)

    q_values = policy_net(states).gather(1, actions.unsqueeze(1)).squeeze()
    max_next_q_values = target_net(next_states).max(1)[0]
    targets = rewards + (1 - dones) * gamma * max_next_q_values
    loss = nn.MSELoss()(q_values, targets.detach())
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# Example of adding experience
for i in range(100):
    s = np.random.rand(state_size)
    a = random.randint(0, action_size - 1)
    r = random.random()
    s_next = np.random.rand(state_size)
    done = random.choice([0, 1])
    memory.append((s, a, r, s_next, done))
    replay()

print("Training step completed.")

Explanation of the Example

The Q-network predicts Q-values for all actions given a state.
The target network is updated periodically to keep learning stable.
The replay buffer stores experiences to train with random samples.
The network learns by minimizing the Bellman loss.

بالعربية المغربية: الشبكة العصبية كتتعلم تدريجياً من تجارب مخزونة فـMemory. كل مرة كتحسب الخطأ بين القيمة المتوقعة والقيمة الحقيقية (Bellman Loss) وكتصلح راسها. الشبكة الهدف كتتبدل من وقت لآخر باش التعلم ما يكونش غير مستقر.

Advanced Concepts

Double DQN: Reduces overestimation by separating action selection and evaluation.
Dueling DQN: Splits the network into value and advantage streams for better learning.
Prioritized Experience Replay: Samples more important experiences more often.

بالعربية المغربية: كاينين نسخ مطوّرة بحال Double DQN اللي كتقلل الأخطاء فالتقدير، وDueling DQN اللي كتفصل بين القيمة العامة والميزة ديال الأفعال. كذلك كاين Prioritized Replay باش يتعلم الوكيل من التجارب المهمة بزاف.

10 Exercises for Practice

Explain the difference between Q-Learning and DQN.
Define the role of the target network in DQN.
Implement a DQN for the CartPole environment using Gym.
Add Experience Replay to your Q-Learning agent.
Experiment with different batch sizes and learning rates.
Implement Double DQN and compare the results.
Add visualization of episode rewards during training.
Explain why DQN is more stable than vanilla Q-Learning.
Test your trained agent in a new environment.
Discuss how neural networks help approximate Q-values in complex environments.